class: center, middle, inverse, title-slide .title[ # Winter Institute in Data Science and Big Data ] .subtitle[ ## Containers, Cloud Computing, and Code Reproducibility: Docker, Kubernetes, and Code Ocean ] .author[ ### Le Bao ] .institute[ ### .font80[Massive Data Institute, Georgetown University]
] .date[ ### .font80[January 10th, 2023] ] --- <style> .remark-slide-number { position: inherit; } .remark-slide-number .progress-bar-container { position: absolute; bottom: 0; height: 6px; display: block; left: 0; right: 0; } .remark-slide-number .progress-bar { height: 100%; background-color: #005099; } .orange { color: #C4122E; } </style> # Why? .center[![:scale 75%](data:image/png;base64,#fig/container_meme.png)] ??? Last time, we already talked a bit about cloud computing and reproducibility vis a vis using command line tools and non-interactive sessions to run code. I know it may be a bit overwhelming but that's not our intension. It's never a best time to teach it. A lot of ppl learn it hard way after they already had a steady workflow. The point is to give you a little bit exposure even you don't master it right away. It can give you a sense that there are those kinds of things out there when you need them. Today we will be talking about containers. This is also some addition to your toolbox beyond the normal programming language. So, it is not to replace R or python in any way. They're still you major tools. Rather, this is an add-on to help you better use R and python. To begin with, why we need to learn container, simply put, it is the future. -- - Docker and Kubernetes are the cutting-edge tools for tech and tech-related industries and scientific research. ??? To give you some realistic reasons: Kubernetes (kubernatis/kubernats) -- - "The share of jobs containing Docker as a skill on Indeed increased by 9,538% from 2014 to 2019." ??? If you pay attention to some job market updates for software developer or data science, you may have already noticed this. In the past couple of years, the job that required or wanted Docker as a skill skyrocketed. Based on the data from Indeed, the share of jobs containing Docker as a skill increased by 9,538%. from big tech company wanted to hire people with Docker skill like crazy. --- # The Problem of Reproducibility .center[![:scale 95%](data:image/png;base64,#fig/multi_ppl.png)] ??? Docker is probably too powerful to be covered in this single session. There are so many use cases of docker. But from a research/data science perspective, it is mostly because of the problem of reproducibility. Imagine you and you coworkers were working on a project, you have different operating systems, some of you use mac os, some use linux. And you all have different versions of R and R packages installed. So, as a result, each of you got different value for a analysis. And you sent you code to your supervisor to let her have a look. And maybe because she's not doing hands-on analysis any more, she has much older versions of the software. And the entire analysis run into error, which would be terrifying. --- # The Problem of Reproducibility .center[![:scale 95%](data:image/png;base64,#fig/multi_proj.png)] ??? Or imagine another scenario, you as a data scientist, have been developing or managing multiple projects. You may be still maintaining older project using R 3.6, and currently you use R 4.0 to analyzing a recent project, and maybe for future proof, you're using the developmental version of R and packages to prepare a coming project. All these can be very overwhelming to manage, especially when you encounter some reproducibility issues. --- # The Problem of Reproducibility - Our computing tools are increasingly powerful, diverse, cloud-based. - iPhone 6 is 32,600 times faster than the Apollo Guidance Computer (APC). - Supercomputer is now accessible to everyone through cloud computing. ??? To put it into a bigger context, why reproducibility becomes an issue, what is the kind of big background. - powerful, - iphone 6 is over thirty thousands faster than the computer used to land a man on the moon. - Even you need something very powerful like a super computer, it's actually more accessible than ever. It used to be the case only like very limited number of institutions have super computer like NASA, NOAA, or some national lab or very rich universities. Now, through cloud computing, there are tons of accessible resources from Amazon, Google, and Microsoft. You may have learned a bit deep learning over the past couple of days, which can be a very computational intensive task. But just starting a couple of years ago, for example, computer science students at Stanford/MIT/Carnegie Mellon working on deep learning wouldn't need to use the super computer from their school anymore because you need to apply for access and computing time, etc. They just register for AWS cloud services and use the machines there. And it's more updated and lots of choices. -- - Our work is required to be more open, transparent, and collaborative. - Reproducible research, open-sourced projects, etc. ??? - In terms of research and data analysis work, - The idea is we can learn from the past scholars and research, and build one step further. So, we want to share our research and projects so we can learn from each other, open-sourced projects are vibrant because every one is contributing to it. -- - Our data becomes bigger, higher dimensional, multimodal. - Big data, image/voice data, etc. ??? - Also, - High dimensional: for example, you have an X and a Y, that's just two dimensions. But our data is increasingly high dimensional, it may has thousands or millions of dimensions, for example, gene and DNA, images, if your subject is human or human society, it's by nature high dimensional, we were just not able to fully unpack them because of both computational and methodological challenges. So, high-dimensional data is increasingly a hot topic in computer science and statistics. -- - Our analysis needs to be fast, instant, and real-time. - Real-time analytic, the pandemic, OpenTable & the State of the Industry ??? - Another trend, which especially has been intensified during the pandemic, is real-time analytic. It is not a new term, it's been around for years, but finally made a lot of noise last two years because of the pandemic. It becomes increasingly important for decision-making, economically, politically, or in terms of health policy. An very intuitive example would be OpenTable, the restaurant reservation service, which drew a lot of attention during the early days of the pandemic. It started to release data about the State of the Industry, which has a lot of statistics about restaurant reservations and cancellation, and the status of restaurant services like in-door or outdoor dining, etc., providing a much more vivid picture of how pandemic affect people's life and regional variation. And it's real-time, comparing to the number of cases that reported by the health agencies, which has lags and even accuracy issues because of testing time, reporting from different administrative levels. And we're having more and more such real-time data like Uber/Lyft, Google Search, and of course social media, and the problem is we are still not fully capable of taking advantage of it. -- - ... ??? -- - **Our computing environment is increasingly complex and convoluted.** ??? As a result, --- # Computing Environment - Computing environment: - Hardware - Software - Operating system - System dependencies - R/python: versions, packages/libraries (last time) - Goals: - Control the whole computing environment. - Configure the environments as we want. - Make the environment reproducible. - Share the code (that may require specific environment). ??? Our goals are Configure the environments as we want without messing up some of our other projects. I was used to be really careful in terms of whether to upgrade my R/python when I was working on some specific project. --- # Today - Container - Docker - Run `R`/`python` with Docker - Kubernates - Cloud computing and Code Ocean - Both front- and back-ends ??? Both front- and backend application of container and Docker, which means itself utilize the container/Docker technology, but it also allows you to use it kinda like a container. --- # What is a Container? .center[![:scale 45%](data:image/png;base64,#fig/house1.jpg)] `$$\downarrow$$` .center[![:scale 45%](data:image/png;base64,#fig/house2.jpg)] ??? Imagine you want to build a perfect house for yourself, you designed it, found the all the materials, finally built the house. Then, you find an another job and needs to relocate, you still want to live in the same house since it's perfect for you but you just want to move it to a different location. So, the idea of the container is you can build it as a container, and anytime you want to move, you just pack it altogether and move it anywhere you want, and unpack it, you are good to go, live in your perfect house again with new surroundings, new neighborhood. Actually, there is a slogan for container technology is "build once, run anywhere" --- # What is a Container? - A standard unit of software that packages up code and all its dependencies ??? To be a bit technical, -- - Operating system: linux (most common), Windows, Mac OS, etc. and data center, cloud, serverless. -- - System-level dependencies -- - Software packages: R, python, TensorFlow, MySQL, etc. -- - Software dependencies: tidyverse, NumPy, PyTorch -- - Including everything needed to run code ??? Basically, a container has everything you need to run the code. --- # Why Container? - Lightweight - Standalone and standard implementation - Isolated from host system - Portable and shareable - Secure ??? -It's lightweight comparing to a complete operating system. I don't know if you guys often mess around your system and reinstall it like me. It's time consuming. And because we don't need a lot of the functions from the system to run specific code. -It's standalone and standard implementation, isolated from host system, in other words, independent of what else you have installed on your OS and wouldn't mess your system and other things. -It's also portable and shareable. -We probably wouldn't worry too much about this. But for some industries, container also means providing another layer of protection. -- - Container allows us to deploy, replicate, move, and back up a workload in one streamlined way ??? All in all, --- # Docker - Docker is a platform for building, configuring, and delivering containers - Docker (2013-) is the *de facto* industry standard for containers (1970-) ??? - Docker is one of the application of container. So, basically it's a platform for - The idea or concept of container was born and developed since 70s. But it is actually Docker makes it widely applicable and popularize the standard implementation of container. -- - Features: - Improved and seamless portability from labtop to any desktop, data center and cloud environment. - Collaboration - HPC and cloud application - Isolated, transparent, and reproducible implementation - Open-sourced, and community optimized - Docker Hub - User-created images - Versioned - Almost all the versions of R, python, etc. - Layered - Each additional layer will built upon the existing ones ??? - Same version of linux, but just different versions of R/Python, you don't need to build Linux again, Docker will just reuse that particular layer of linux and then build different versions of R/python. Which is more efficient and save a lot of time. --- # Docker Image - A static, read-only template for creating containers. .center[![:scale 65%](data:image/png;base64,#fig/docker-hub.png)] ??? A little bit jargon, one of the unit in Docker is image. For most of the time, we can just use exisiting images. optimized for different programming languages and softwares. --- # Use Docker to Run `R` - Rocker project: [https://www.rocker-project.org/](https://www.rocker-project.org/) Image | Description ------------ | ------------------------ r-base | Current R version r-devel | R-devel added side-by-side onto r-base r-ver | Specify R version rstudio | Adds rstudio tidyverse | Adds tidyverse & devtools verse | Adds tex & publishing-related packages geospatial | Adds geospatial libraries ??? A lot of big names in R programming world. Different distributions --- # Run `R` Using Pre-built Images - Command: ```bash docker run -it --rm rocker/r-base ``` - `docker run`: run processes based on an image - General form: `docker run [OPTIONS] IMAGE[:TAG|@DIGEST] [COMMAND] [ARG...]` - `-it`: interactive session - `-rm`: automatically remove container once stopped - `rocker/r-base`: pull latest r-base from rocker repository --- # Run `R` Using Pre-built Images - Pull a specific version using `rocker/r-ver:4.2.0` ```bash docker run -it --rm rocker/r-ver:4.2.0 ``` - Run an RStudio server ```bash docker run --rm -p 8888:8787 -e PASSWORD='mypassword' rocker/rstudio:4.2.0 # For apple silicon machines: docker run --rm -p 8888:8787 -e PASSWORD='mypassword' rocker/rstudio:latest-daily ``` - `-p`: publish a port - `-e`: set environment variable --- # Using Docker to Run `python` - Lots of images with different support and configurations ```bash docker run -it --rm python docker run -it --rm python:3.10.4 ``` --- # Build Your Own `R` Docker Image `Dockerfile`: instructions for assembling and configuring an Docker image. ```bash FROM rocker/r-ver:4.2.0 # System dependencies RUN apt-get update && apt-get install -y curl libz-dev # R packages RUN Rscript -e 'install.packages("MASS")' RUN install2.r readr dplyr ggplot2 forcats ## Copy files RUN mkdir docker-demo COPY data docker-demo/data COPY code docker-demo/code RUN mkdir docker-demo/output #ADD . docker-demo ## Set working directory WORKDIR docker-demo ``` --- # Build Your Own `R` Docker Image - Build image ```bash docker build -t demo:r . ``` + `-t`: name tag for the image + `.`: root directory (where the Dockerfile is) - Run a container using the image ```bash docker run -i -t demo:r /bin/bash ``` + `-i`: interactive + `-t`: name tag of the image + `/bin/bash`: start a bash session --- # Build Your Own `python` Docker Image ```bash FROM python:3.10.4 # python libraries RUN pip install -U bs4 ## Copy files RUN mkdir docker-demo RUN mkdir docker-demo/data ADD . docker-demo ## Set working directory WORKDIR docker-demo ``` - Build image ```bash docker build -t demo:python . ``` - Run a container using the image ```bash docker run -i -t demo:python /bin/bash ``` --- # Managing Docker Container - List and commit current containers ```bash docker ps -l docker commit [CONTAINER ID] [NAME] ``` - Build a Docker image ```bash docker images docker create [IMAGE ID] ``` - Extract files ```bash docker cp [ID]:[Container PATH] [Local PATH] ``` - Remove containers and images ```bash docker rm -f [Container ID] docker image rm [Image ID] docker system prune -all ``` --- # Exercise - Verify Docker Installation - Open Terminal (Mac/Linux) or Command Prompt/PowerShell (Windows) - Run `docker run hello-world` - Run an R or python container with a specific version using `docker run`. - Run R and python container using Dockerfile. - Go to `/docker-demo-python` and build a `demo-python` container using the provided Dockerfile. - *Feel free to use your own project. - Run the container and test the code script in `/docker-demo-python/code` - Extract the output file using `docker cp` - Follow the same procedure for `/docker-demo-r` --- # Kubernates - What is Kubernates (aka. K8s) .center[![:scale 45%](data:image/png;base64,#fig/house2.jpg)] `$$\downarrow$$` .center[![:scale 45%](data:image/png;base64,#fig/house3.jpeg)] ??? real estate developer --- # What is Kubernates (aka. K8s) - Container management - Deployment, scaling, scheduling, etc. - Works with Docker, Containerd, and CRI-O, etc. .center[![:scale 85%](data:image/png;base64,#fig/k8s_cluster.png)] ??? Advanced and complicated. For now, we usually don't need to worry about it as data scientist or --- # Code Ocean - An integrative, collaborative platform for computational research - Develop, collaborate, share, and publish code using a web browser (without the need for much specialized knowledge) - Similar tools: Digital Ocean, Vultr, Kamatera, Google Cloud/Amazon Web Service, etc. - Both: - a *product* of Docker and K8s *and* - an *application* of cloud computing - **Capsule** - Container (using images supplied by CO) - Cloud computing + environment + code + (optional) data - The backend of Code Ocean - AWS computing instance: 16 cores, 120 GB of memory - Docker for configuring computing environment (user-accessible) - K8s for allocating and scheduling resources (not user-accessible) --- # Exercise - *You can use the example code or create your own capsule for your project.* - Create a Code Ocean account at https://codeocean.com - `.edu` account comes with 10 hours computing time - Create a new capsule - Add `R` (4.1.0) as the base environment - Install `python` support by adding `python3-pip:latest` to `apt-get` - Install system dependency using `apt-get`: `libudunits2-dev`, `libgdal-dev`, `libgeos-dev`, `libproj-dev`, `libfontconfig1-dev`. - Add packages/libraries along with their specific versions - Install `beautifulsoup4:4.11.1`, `requests:latest` via `pip3` - Install `dplyr:1.0.9`, `tidyr:1.2.0`, `ggplot2:3.4.0`, `sf:latest` via R (CRAN) and `fiftystater` via Github (`wmurphyrd/fiftystater`) - Upload files - Upload code scripts to `/code` - Upload data to `/data` - The only runtime writable folder is `/results` (continued on next page ...) --- # Exercise (cont'd) - Create a `run` script for running the code and set as file to run ```bash #!/usr/bin/env bash set -ex mkdir -p ../results/data # make a dir for saving scraped data mkdir -p ../results/output # make a dir for saxing output figs python3 -u election-2020.py "$@" # running python script Rscript election-map-2020.R "$@" # running R script for analysis ``` - Edit metadata and readme - Commit the changes - Execute a Reproducible Run