As a data science engineer, I recently found myself in the midst of a common problem: setting up a local environment for a data pipeline script that required a specific version of MS SQL Server which was not supported by Ubuntu 22.04. The process of manually installing and configuring the software, not to mention troubleshooting any issues that arose along the way, was both time-consuming and frustrating.
That’s when I discovered the power of Docker. By using a pre-configured MS SQL Server image from Docker Hub, I was able to quickly and easily set up the environment I needed for my pipeline script. Not only did this save me valuable time, but it also ensured that my development environment was consistent and easily reproducible for other team members.
This experience got me thinking about all the other ways that Docker can help data engineers and data scientists streamline their workflows. From running databases and data processing tools to automating pipeline deployments and reproducing development environments, the possibilities are endless
There are several ways to run a Docker image, depending on your use case and needs. Here are some common ways:
- From Docker Hub: The easiest way to run an image is by pulling it from a public or private registry, such as Docker Hub. This can be done using the
docker pull
command, followed by the image name and tag (if specified). For example, to pull the latest version of the official Ubuntu image, you would use the commanddocker pull ubuntu
. You can find more information and examples on the Docker Docs. - From a Registry: You can also pull images from a private or enterprise registry. You have to first login to the registry using
docker login
command and then usedocker pull
command to pull the image. You can find more information and examples on this link. - From a Dockerfile: You can also build an image from a
Dockerfile
, which is a script that contains instructions for building an image. Thedocker build
command is used to build an image from aDockerfile
, and it takes the path to theDockerfile
as an argument. For example, to build an image from aDockerfile
located in the current directory, you would use the commanddocker build .
You can find more information and examples on the valohai.com. - From a Compose file: You can use
docker-compose
command to create and start one or more containers from a compose file. This is a great way to organize multiple containers and services in a single file. Thedocker-compose up
command is used to start the services defined in the compose file. You can find more information and examples on the Docker Docs
Now that we have a better understanding of how to pull down or create our own Docker images, let’s dive into some of the ways that Docker can be used in data engineering and science.
Setting up and running database systems
Setting up and running database systems, such as MS SQL Server, MySQL, and PostgreSQL, without the need for manual installation and configuration. For example,
- after running these commands, you can easily connect to the MS SQL Server instance from SQL Server Management Studio or sqlcmd on your host machine using the server name
localhost,1433
and theSA
password you specified. Setting up and running a MS SQL Server on Docker:docker pull mcr.microsoft.com/mssql/server:2022-latest docker run -e "ACCEPT_EULA=Y" -e "MSSQL_SA_PASSWORD=<YourStrong@Passw0rd>" \ -p 1433:1433 --name sql1 --hostname sql1 \ -d \ mcr.microsoft.com/mssql/server:2022-latest
- The first code snippet pulls the latest version of MS SQL Server from the official Microsoft container registry on Docker Hub.
- The second code snippet then runs a container using this image, with environment variables set for the ‘ACCEPT_EULA’ and ‘SA_PASSWORD’ for the SQL Server, and maps the host’s port 1433 to the container’s port 1433. This allows us to access the SQL Server from our host machine.
Creating isolated development environments
Creating isolated development environments that can be easily shared and reproduced across teams, without the need for complex setup and configuration. The following code is a sample of a docker-compose.yml
file,
- which allows us to define and run multiple containers together. In this example,
- we have two services defined, “web” and “db”. The “web” service is built from the current directory and maps port 8000 of the host to port 8000 of the container. The “db” service uses the official PostgreSQL image from Docker Hub and sets environment variables for the user, password, and database name.
- By running
docker-compose up
in the same directory as this file, it will spin up these two services and make them available for the host to use.version: '3' services: web: build: . ports: - "8000:8000" db: image: postgres environment: POSTGRES_USER: user POSTGRES_PASSWORD: password POSTGRES_DB: dbname
Running data processing and analysis tools
Running data processing and analysis tools, such as Apache Spark and Apache Hadoop, in a containerized environment, which can help to simplify deployment and scaling. This code snippet
- pulls the latest version of Apache Spark from the official SequenceIQ image on Docker Hub.
- The second command runs an interactive container using this image and runs the bootstrap script which sets up the environment and runs bash shell.
version: '3' services: pipeline_component_1: build: . command: python script1.py pipeline_component_2: build: . command: python script2.py pipeline_component_3: build: . command: python script3.py
Automating the build and deployment of data pipelines
Automating the build and deployment of data pipelines, by containerizing the pipeline components and using Docker Compose to manage their dependencies. In this example,
- we have three services defined, “pipeline_component_1”, “pipeline_component_2”, and “pipeline_component_3”.
- Each service is built from the current directory and runs a specific python script, this allows us to automate the pipeline deployment and execution.
docker pull tableau/server:latest docker run -p 8850:8850 -p 8800:8800 --name tableau -d -v /your/local/path:/var/opt/tableau/tableau_server/data/tabsvc/files/ tableau/server:latest
Running data visualization and reporting tools
Running data visualization and reporting tools, such as Tableau and Power BI, in a containerized environment, which can help to simplify deployment and scaling. This code snippet,
- pulls the latest version of Tableau Server from the official Tableau image on Docker Hub.
- The second command runs a container using this image, maps port 8850 and 8800 of the host to the container’s ports 8850 and 8800 respectively and creates a volume for the container to persist data.
docker pull tableau/server:latest docker run -p 8850:8850 -p 8800:8800 --name tableau -d -v /your/local/path:/var/opt/tableau/tableau_server/data/tabsvc/files/ tableau/server:latest
Using Docker to run Jupyter notebook, RStudio
Using Docker to run Jupyter notebook, RStudio, and other interactive development environments, making it easy to share and reproduce the code, data and dependencies. This command,
- runs a container using the official Jupyter Data Science Notebook image on Docker Hub, it maps port 8888 of the host to the container’s port 8888 and creates a volume for the container to access the host files.
docker run -it --rm -p 8888:8888 -v /your/local/path:/app jupyter/datascience-notebook
Using Docker to run Machine Learning and Deep Learning frameworks
Using Docker to run Machine Learning and Deep Learning frameworks like Tensorflow, Pytorch, and scikit-learn, which can help to simplify the deployment of these frameworks.
docker pull tensorflow/tensorflow:latest-gpu-py3
docker run --rm -it -p 8888:8888 tensorflow/tensorflow:latest-gpu-py3
In summary, this blog has demonstrated the advantages of using Docker in data science and engineering, including reproducible environments and dependency management. The use cases and example provided illustrate the potential of Docker in streamlining data professionals’ workflows. To learn more about using Docker with Python, check out the blog Isolate, Organize, and Deploy: The Benefits of Virtual Environments and Docker for Python.