Containerizing Your Python Pipelines with Docker
Package your Python data pipelines into Docker containers for consistent, reproducible execution — from Dockerfile to docker-compose, with environment management and production deployment patterns.
AI GeneratedYour pipeline runs perfectly on your laptop. It crashes on the server. The Python version is different. A library is missing. The file paths do not match.
"Works on my machine" is the most common deployment failure. Docker eliminates it by packaging your pipeline with its exact dependencies, configuration, and runtime into a container that runs identically everywhere.
This guide takes a working Python pipeline and containerises it — from Dockerfile basics to docker-compose setups with databases, environment management, and production deployment patterns.
# Who This Is For
- Developers whose scripts work locally but break when deployed to another machine
- Data engineers moving pipelines from laptops to cloud servers or team environments
- Vibe coders who built something that works and now need to ship it reliably
- Operations teams tired of debugging environment differences between staging and production
You should know basic Python and be comfortable running commands in a terminal. No prior Docker experience required — this guide starts from scratch.
# The Container Architecture
flowchart LR
subgraph Container["Docker Container"]
P["Python 3.11"] --> D["Dependencies\n(requirements.txt)"]
D --> S["Pipeline Script"]
S --> V["Volume Mount\n(/data)"]
end
E["Environment\nVariables"] --> Container
V --> H["Host File System\n(data/ directory)"]
Container --> O["Output\n(reports, DB)"]The container includes Python, all dependencies, and your code. Data and secrets are injected from outside — never baked into the image.
# What You Will Need
# Install Docker Desktop
# https://docs.docker.com/get-docker/
# Verify installation
docker --version
docker compose version
# Step 1: Project Structure
Organise your pipeline for containerisation:
pipeline/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── .dockerignore
├── .env.example
├── pipeline.py
├── extract.py
├── transform.py
├── load.py
└── data/ ← mounted as volume (not in image)
# requirements.txt
pandas==2.2.0
requests==2.31.0
openpyxl==3.1.2
python-dotenv==1.0.1
# .dockerignore
# Do not copy these into the Docker image
.env
.git
__pycache__
*.pyc
data/
reports/
*.db
.venv
venv
# Step 2: The Dockerfile
# Use a specific Python version — never use :latest
FROM python:3.11-slim
# Set working directory
WORKDIR /app
# Install system dependencies (if needed for certain Python packages)
RUN apt-get update && apt-get install -y --no-install-recommends \
gcc \
&& rm -rf /var/lib/apt/lists/*
# Copy requirements first (Docker layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy pipeline code
COPY *.py .
# Create data directory for volume mount
RUN mkdir -p /app/data /app/reports
# Run as non-root user (security)
RUN useradd --create-home pipelineuser
USER pipelineuser
# Default command
CMD ["python", "pipeline.py"]
# Why This Order Matters
flowchart TD L1["Layer 1: Python base image\n(cached — rarely changes)"] --> L2["Layer 2: System packages\n(cached — rarely changes)"] L2 --> L3["Layer 3: pip install requirements\n(cached until requirements.txt changes)"] L3 --> L4["Layer 4: COPY pipeline code\n(rebuilds when code changes)"]
Docker caches each layer. By copying requirements.txt before your code, dependency installation is cached. Only code changes trigger a rebuild of the final layer — going from minutes to seconds.
# Build and Run
# Build the image
docker build -t sales-pipeline .
# Run with environment variables
docker run --rm \
-e API_KEY=your-key-here \
-e DB_PATH=/app/data/reports.db \
-v $(pwd)/data:/app/data \
sales-pipeline
# Run with .env file
docker run --rm \
--env-file .env \
-v $(pwd)/data:/app/data \
sales-pipeline
# Step 3: Environment Management
# .env.example (Committed to Git)
# Copy this to .env and fill in real values
API_KEY=your-api-key-here
API_URL=https://api.example.com/v1
DB_PATH=/app/data/reports.db
SMTP_HOST=smtp.example.com
SMTP_PASSWORD=your-smtp-password
LOG_LEVEL=INFO
# Reading Environment in Python
import os
from dotenv import load_dotenv
# Load .env in development, use system env vars in production
load_dotenv()
def get_config():
"""Load pipeline configuration from environment."""
required = ["API_KEY", "API_URL"]
missing = [var for var in required if not os.environ.get(var)]
if missing:
raise EnvironmentError(
f"Missing required environment variables: {', '.join(missing)}\n"
f"Copy .env.example to .env and fill in the values."
)
return {
"api_key": os.environ["API_KEY"],
"api_url": os.environ["API_URL"],
"db_path": os.environ.get("DB_PATH", "/app/data/reports.db"),
"log_level": os.environ.get("LOG_LEVEL", "INFO"),
}
# Step 4: Docker Compose
For pipelines that need a database or multiple services, use docker-compose:
# docker-compose.yml
services:
pipeline:
build: .
env_file: .env
volumes:
- ./data:/app/data
- ./reports:/app/reports
depends_on:
db:
condition: service_healthy
networks:
- pipeline-net
db:
image: postgres:16-alpine
environment:
POSTGRES_DB: pipeline
POSTGRES_USER: pipeline
POSTGRES_PASSWORD: ${DB_PASSWORD}
volumes:
- pgdata:/var/lib/postgresql/data
healthcheck:
test: ["CMD-SHELL", "pg_isready -U pipeline"]
interval: 5s
timeout: 3s
retries: 5
networks:
- pipeline-net
volumes:
pgdata:
networks:
pipeline-net:
# Running with Compose
# Start database and run pipeline
docker compose up
# Run pipeline only (database already running)
docker compose run --rm pipeline
# Run with overridden command
docker compose run --rm pipeline python pipeline.py --days 7
# Tear down
docker compose down
# Step 5: Volume Mounts for Data Persistence
Data should live outside the container so it survives container restarts.
# docker-compose.yml — volume examples
services:
pipeline:
build: .
volumes:
# Source data (read-only)
- ./data/input:/app/data/input:ro
# Output data (read-write)
- ./data/output:/app/data/output
# Reports
- ./reports:/app/reports
# Logs
- ./logs:/app/logs
# In your pipeline — use paths relative to /app
import os
INPUT_DIR = os.environ.get("INPUT_DIR", "/app/data/input")
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "/app/data/output")
REPORT_DIR = os.environ.get("REPORT_DIR", "/app/reports")
def run_pipeline():
"""Pipeline using container-friendly paths."""
# Read from mounted input
df = pd.read_excel(os.path.join(INPUT_DIR, "sales.xlsx"))
# Process
result = transform(df)
# Write to mounted output
result.to_csv(os.path.join(OUTPUT_DIR, "clean_sales.csv"), index=False)
# Generate report
report_path = os.path.join(REPORT_DIR, f"report_{timestamp}.xlsx")
export_report(result, report_path)
# Step 6: Multi-Stage Builds (Production)
For production images, use multi-stage builds to reduce image size:
# Stage 1: Build dependencies
FROM python:3.11-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# Stage 2: Runtime image (smaller)
FROM python:3.11-slim
WORKDIR /app
# Copy only installed packages from builder
COPY --from=builder /install /usr/local
# Copy pipeline code
COPY *.py .
# Create directories
RUN mkdir -p /app/data /app/reports
# Non-root user
RUN useradd --create-home pipelineuser
USER pipelineuser
CMD ["python", "pipeline.py"]
# Image Size Comparison
| Approach | Image size |
|---|---|
python:3.11 (full) |
~900 MB |
python:3.11-slim |
~150 MB |
| Multi-stage slim | ~120 MB |
python:3.11-alpine |
~50 MB (compatibility issues) |
Use slim unless you have a specific reason to use alpine (glibc vs musl issues).
# Step 7: Scheduling Containers
# Cron on Host
# Run the containerised pipeline every day at 6 AM
0 6 * * * cd /path/to/project && docker compose run --rm pipeline >> logs/docker.log 2>&1
# Docker-Native Scheduling
# docker-compose.yml — with Ofelia scheduler
services:
scheduler:
image: mcuadros/ofelia:latest
volumes:
- /var/run/docker.sock:/var/run/docker.sock
labels:
ofelia.job-exec.pipeline.schedule: "0 6 * * *"
ofelia.job-exec.pipeline.container: "pipeline"
ofelia.job-exec.pipeline.command: "python pipeline.py"
depends_on:
- pipeline
pipeline:
build: .
env_file: .env
volumes:
- ./data:/app/data
# Step 8: Logging in Containers
Write logs to stdout/stderr — Docker captures them automatically.
import logging
import sys
def setup_logging(level=None):
"""Configure logging for container environment."""
log_level = level or os.environ.get("LOG_LEVEL", "INFO")
logging.basicConfig(
level=getattr(logging, log_level.upper()),
format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
handlers=[logging.StreamHandler(sys.stdout)],
)
return logging.getLogger("pipeline")
logger = setup_logging()
# Viewing Logs
# Follow logs in real time
docker compose logs -f pipeline
# Last 100 lines
docker compose logs --tail 100 pipeline
# Logs with timestamps
docker compose logs -t pipeline
# Step 9: Health Checks
# Add to Dockerfile
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD python -c "import sys; sys.exit(0)" || exit 1
For long-running pipelines, add a health endpoint:
# health.py — simple health check
import os
from datetime import datetime
HEALTH_FILE = "/app/data/.health"
def update_health():
"""Update health marker after successful pipeline stage."""
with open(HEALTH_FILE, "w") as f:
f.write(datetime.now().isoformat())
def check_health(max_age_seconds=3600):
"""Check if pipeline has run recently."""
if not os.path.exists(HEALTH_FILE):
return False
with open(HEALTH_FILE, "r") as f:
last_run = datetime.fromisoformat(f.read().strip())
age = (datetime.now() - last_run).total_seconds()
return age < max_age_seconds
# Common Docker Patterns for Pipelines
| Pattern | Use case | Example |
|---|---|---|
| One-shot container | Scheduled pipeline runs | docker compose run --rm pipeline |
| Long-running service | Worker processing queue | docker compose up -d pipeline |
| Sidecar database | Pipeline needs PostgreSQL | Compose with depends_on + healthcheck |
| Mounted volumes | Input/output data persistence | -v ./data:/app/data |
| Multi-stage build | Smaller production images | Builder → runtime stages |
| Non-root user | Security | RUN useradd ... + USER ... |
# What This Replaces
| Local approach | Docker equivalent |
|---|---|
pip install on the server |
requirements.txt built into image |
| "Works on my machine" | Identical container on every machine |
| Python version mismatch | Specific Python version in FROM |
| Manual server setup | docker compose up |
| Log files on disk | docker compose logs |
| Cannot reproduce a bug | Exact same image in dev and production |
# Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
ModuleNotFoundError |
Missing from requirements.txt | Add to requirements.txt, rebuild |
| Permission denied on volume | Container user ≠ host user | Fix ownership or use same UID |
| Container exits immediately | Script finishes (expected for one-shot) | Use docker compose run not up |
| Slow builds | Dependencies re-install every build | COPY requirements.txt before code |
| Large image size | Using full Python image | Switch to python:3.11-slim |
| Cannot connect to database | Wrong hostname | Use compose service name as hostname |
# Next Steps
Start by containerising your most important pipeline — the one that breaks when deployed to a new server. The Dockerfile pattern above works for 90% of Python data pipelines. Add docker-compose when you need a database or multiple services.
For building the pipelines to containerise, see How to Design Data Pipelines for Reliable Reporting. For orchestrating containerised pipelines with dependencies, see How to Schedule and Orchestrate Workflows with Prefect.
Automation services include containerising and deploying production pipeline systems.
Get in touch to discuss deploying your pipelines with Docker.
Enjoyed this article?
Get notified when I publish new articles on automation, ecommerce, and data engineering.