Containerizing Your Python Pipelines with Docker

·8 min read·Automation

Package your Python data pipelines into Docker containers for consistent, reproducible execution — from Dockerfile to docker-compose, with environment management and production deployment patterns.

Containerizing Your Python Pipelines with DockerAI Generated

Your pipeline runs perfectly on your laptop. It crashes on the server. The Python version is different. A library is missing. The file paths do not match.

"Works on my machine" is the most common deployment failure. Docker eliminates it by packaging your pipeline with its exact dependencies, configuration, and runtime into a container that runs identically everywhere.

This guide takes a working Python pipeline and containerises it — from Dockerfile basics to docker-compose setups with databases, environment management, and production deployment patterns.

# Who This Is For

  • Developers whose scripts work locally but break when deployed to another machine
  • Data engineers moving pipelines from laptops to cloud servers or team environments
  • Vibe coders who built something that works and now need to ship it reliably
  • Operations teams tired of debugging environment differences between staging and production

You should know basic Python and be comfortable running commands in a terminal. No prior Docker experience required — this guide starts from scratch.

# The Container Architecture

flowchart LR
  subgraph Container["Docker Container"]
    P["Python 3.11"] --> D["Dependencies\n(requirements.txt)"]
    D --> S["Pipeline Script"]
    S --> V["Volume Mount\n(/data)"]
  end

  E["Environment\nVariables"] --> Container
  V --> H["Host File System\n(data/ directory)"]
  Container --> O["Output\n(reports, DB)"]

The container includes Python, all dependencies, and your code. Data and secrets are injected from outside — never baked into the image.

# What You Will Need

bash
# Install Docker Desktop
# https://docs.docker.com/get-docker/

# Verify installation
docker --version
docker compose version

# Step 1: Project Structure

Organise your pipeline for containerisation:

text
pipeline/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── .dockerignore
├── .env.example
├── pipeline.py
├── extract.py
├── transform.py
├── load.py
└── data/          ← mounted as volume (not in image)

# requirements.txt

text
pandas==2.2.0
requests==2.31.0
openpyxl==3.1.2
python-dotenv==1.0.1

# .dockerignore

text
# Do not copy these into the Docker image
.env
.git
__pycache__
*.pyc
data/
reports/
*.db
.venv
venv

# Step 2: The Dockerfile

dockerfile
# Use a specific Python version — never use :latest
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies (if needed for certain Python packages)
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements first (Docker layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy pipeline code
COPY *.py .

# Create data directory for volume mount
RUN mkdir -p /app/data /app/reports

# Run as non-root user (security)
RUN useradd --create-home pipelineuser
USER pipelineuser

# Default command
CMD ["python", "pipeline.py"]

# Why This Order Matters

flowchart TD
  L1["Layer 1: Python base image\n(cached — rarely changes)"] --> L2["Layer 2: System packages\n(cached — rarely changes)"]
  L2 --> L3["Layer 3: pip install requirements\n(cached until requirements.txt changes)"]
  L3 --> L4["Layer 4: COPY pipeline code\n(rebuilds when code changes)"]

Docker caches each layer. By copying requirements.txt before your code, dependency installation is cached. Only code changes trigger a rebuild of the final layer — going from minutes to seconds.

# Build and Run

bash
# Build the image
docker build -t sales-pipeline .

# Run with environment variables
docker run --rm \
    -e API_KEY=your-key-here \
    -e DB_PATH=/app/data/reports.db \
    -v $(pwd)/data:/app/data \
    sales-pipeline

# Run with .env file
docker run --rm \
    --env-file .env \
    -v $(pwd)/data:/app/data \
    sales-pipeline

# Step 3: Environment Management

# .env.example (Committed to Git)

bash
# Copy this to .env and fill in real values
API_KEY=your-api-key-here
API_URL=https://api.example.com/v1
DB_PATH=/app/data/reports.db
SMTP_HOST=smtp.example.com
SMTP_PASSWORD=your-smtp-password
LOG_LEVEL=INFO

# Reading Environment in Python

python
import os
from dotenv import load_dotenv

# Load .env in development, use system env vars in production
load_dotenv()

def get_config():
    """Load pipeline configuration from environment."""
    required = ["API_KEY", "API_URL"]
    missing = [var for var in required if not os.environ.get(var)]

    if missing:
        raise EnvironmentError(
            f"Missing required environment variables: {', '.join(missing)}\n"
            f"Copy .env.example to .env and fill in the values."
        )

    return {
        "api_key": os.environ["API_KEY"],
        "api_url": os.environ["API_URL"],
        "db_path": os.environ.get("DB_PATH", "/app/data/reports.db"),
        "log_level": os.environ.get("LOG_LEVEL", "INFO"),
    }

# Step 4: Docker Compose

For pipelines that need a database or multiple services, use docker-compose:

yaml
# docker-compose.yml

services:
  pipeline:
    build: .
    env_file: .env
    volumes:
      - ./data:/app/data
      - ./reports:/app/reports
    depends_on:
      db:
        condition: service_healthy
    networks:
      - pipeline-net

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: pipeline
      POSTGRES_USER: pipeline
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U pipeline"]
      interval: 5s
      timeout: 3s
      retries: 5
    networks:
      - pipeline-net

volumes:
  pgdata:

networks:
  pipeline-net:

# Running with Compose

bash
# Start database and run pipeline
docker compose up

# Run pipeline only (database already running)
docker compose run --rm pipeline

# Run with overridden command
docker compose run --rm pipeline python pipeline.py --days 7

# Tear down
docker compose down

# Step 5: Volume Mounts for Data Persistence

Data should live outside the container so it survives container restarts.

yaml
# docker-compose.yml — volume examples

services:
  pipeline:
    build: .
    volumes:
      # Source data (read-only)
      - ./data/input:/app/data/input:ro

      # Output data (read-write)
      - ./data/output:/app/data/output

      # Reports
      - ./reports:/app/reports

      # Logs
      - ./logs:/app/logs
python
# In your pipeline — use paths relative to /app
import os

INPUT_DIR = os.environ.get("INPUT_DIR", "/app/data/input")
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "/app/data/output")
REPORT_DIR = os.environ.get("REPORT_DIR", "/app/reports")

def run_pipeline():
    """Pipeline using container-friendly paths."""
    # Read from mounted input
    df = pd.read_excel(os.path.join(INPUT_DIR, "sales.xlsx"))

    # Process
    result = transform(df)

    # Write to mounted output
    result.to_csv(os.path.join(OUTPUT_DIR, "clean_sales.csv"), index=False)

    # Generate report
    report_path = os.path.join(REPORT_DIR, f"report_{timestamp}.xlsx")
    export_report(result, report_path)

# Step 6: Multi-Stage Builds (Production)

For production images, use multi-stage builds to reduce image size:

dockerfile
# Stage 1: Build dependencies
FROM python:3.11-slim AS builder

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2: Runtime image (smaller)
FROM python:3.11-slim

WORKDIR /app

# Copy only installed packages from builder
COPY --from=builder /install /usr/local

# Copy pipeline code
COPY *.py .

# Create directories
RUN mkdir -p /app/data /app/reports

# Non-root user
RUN useradd --create-home pipelineuser
USER pipelineuser

CMD ["python", "pipeline.py"]

# Image Size Comparison

Approach Image size
python:3.11 (full) ~900 MB
python:3.11-slim ~150 MB
Multi-stage slim ~120 MB
python:3.11-alpine ~50 MB (compatibility issues)

Use slim unless you have a specific reason to use alpine (glibc vs musl issues).

# Step 7: Scheduling Containers

# Cron on Host

bash
# Run the containerised pipeline every day at 6 AM
0 6 * * * cd /path/to/project && docker compose run --rm pipeline >> logs/docker.log 2>&1

# Docker-Native Scheduling

yaml
# docker-compose.yml — with Ofelia scheduler

services:
  scheduler:
    image: mcuadros/ofelia:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    labels:
      ofelia.job-exec.pipeline.schedule: "0 6 * * *"
      ofelia.job-exec.pipeline.container: "pipeline"
      ofelia.job-exec.pipeline.command: "python pipeline.py"
    depends_on:
      - pipeline

  pipeline:
    build: .
    env_file: .env
    volumes:
      - ./data:/app/data

# Step 8: Logging in Containers

Write logs to stdout/stderr — Docker captures them automatically.

python
import logging
import sys

def setup_logging(level=None):
    """Configure logging for container environment."""
    log_level = level or os.environ.get("LOG_LEVEL", "INFO")

    logging.basicConfig(
        level=getattr(logging, log_level.upper()),
        format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
        handlers=[logging.StreamHandler(sys.stdout)],
    )

    return logging.getLogger("pipeline")

logger = setup_logging()

# Viewing Logs

bash
# Follow logs in real time
docker compose logs -f pipeline

# Last 100 lines
docker compose logs --tail 100 pipeline

# Logs with timestamps
docker compose logs -t pipeline

# Step 9: Health Checks

dockerfile
# Add to Dockerfile
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD python -c "import sys; sys.exit(0)" || exit 1

For long-running pipelines, add a health endpoint:

python
# health.py — simple health check
import os
from datetime import datetime

HEALTH_FILE = "/app/data/.health"

def update_health():
    """Update health marker after successful pipeline stage."""
    with open(HEALTH_FILE, "w") as f:
        f.write(datetime.now().isoformat())

def check_health(max_age_seconds=3600):
    """Check if pipeline has run recently."""
    if not os.path.exists(HEALTH_FILE):
        return False

    with open(HEALTH_FILE, "r") as f:
        last_run = datetime.fromisoformat(f.read().strip())

    age = (datetime.now() - last_run).total_seconds()
    return age < max_age_seconds

# Common Docker Patterns for Pipelines

Pattern Use case Example
One-shot container Scheduled pipeline runs docker compose run --rm pipeline
Long-running service Worker processing queue docker compose up -d pipeline
Sidecar database Pipeline needs PostgreSQL Compose with depends_on + healthcheck
Mounted volumes Input/output data persistence -v ./data:/app/data
Multi-stage build Smaller production images Builder → runtime stages
Non-root user Security RUN useradd ... + USER ...

# What This Replaces

Local approach Docker equivalent
pip install on the server requirements.txt built into image
"Works on my machine" Identical container on every machine
Python version mismatch Specific Python version in FROM
Manual server setup docker compose up
Log files on disk docker compose logs
Cannot reproduce a bug Exact same image in dev and production

# Troubleshooting

Problem Cause Fix
ModuleNotFoundError Missing from requirements.txt Add to requirements.txt, rebuild
Permission denied on volume Container user ≠ host user Fix ownership or use same UID
Container exits immediately Script finishes (expected for one-shot) Use docker compose run not up
Slow builds Dependencies re-install every build COPY requirements.txt before code
Large image size Using full Python image Switch to python:3.11-slim
Cannot connect to database Wrong hostname Use compose service name as hostname

# Next Steps

Start by containerising your most important pipeline — the one that breaks when deployed to a new server. The Dockerfile pattern above works for 90% of Python data pipelines. Add docker-compose when you need a database or multiple services.

For building the pipelines to containerise, see How to Design Data Pipelines for Reliable Reporting. For orchestrating containerised pipelines with dependencies, see How to Schedule and Orchestrate Workflows with Prefect.

Automation services include containerising and deploying production pipeline systems.

Get in touch to discuss deploying your pipelines with Docker.

Enjoyed this article?

Get notified when I publish new articles on automation, ecommerce, and data engineering.

docker python pipelinecontainerize python scriptdockerfile pythondocker compose data pipelinepython docker deploymentdocker environment variablescontainerized data pipelinedocker python automationdocker compose databasereproducible python pipeline