Containerizing Your Python Pipelines with Docker

· 8 min read · Automation

Package your Python data pipelines into Docker containers for consistent, reproducible execution — from Dockerfile to docker-compose, with environment management and production deployment patterns.

Containerizing Your Python Pipelines with Docker

Your pipeline runs perfectly on your laptop. It crashes on the server. The Python version is different. A library is missing. The file paths do not match.

“Works on my machine” is the most common deployment failure. Docker eliminates it by packaging your pipeline with its exact dependencies, configuration, and runtime into a container that runs identically everywhere.

This guide takes a working Python pipeline and containerises it — from Dockerfile basics to docker-compose setups with databases, environment management, and production deployment patterns.

Who This Is For

  • Developers whose scripts work locally but break when deployed to another machine
  • Data engineers moving pipelines from laptops to cloud servers or team environments
  • Vibe coders who built something that works and now need to ship it reliably
  • Operations teams tired of debugging environment differences between staging and production

You should know basic Python and be comfortable running commands in a terminal. No prior Docker experience required — this guide starts from scratch.

The Container Architecture

The container includes Python, all dependencies, and your code. Data and secrets are injected from outside — never baked into the image.

What You Will Need

# Install Docker Desktop
# https://docs.docker.com/get-docker/

# Verify installation
docker --version
docker compose version

Step 1: Project Structure

Organise your pipeline for containerisation:

pipeline/
├── Dockerfile
├── docker-compose.yml
├── requirements.txt
├── .dockerignore
├── .env.example
├── pipeline.py
├── extract.py
├── transform.py
├── load.py
└── data/          ← mounted as volume (not in image)

requirements.txt

pandas==2.2.0
requests==2.31.0
openpyxl==3.1.2
python-dotenv==1.0.1

.dockerignore

# Do not copy these into the Docker image
.env
.git
__pycache__
*.pyc
data/
reports/
*.db
.venv
venv

Step 2: The Dockerfile

# Use a specific Python version — never use :latest
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies (if needed for certain Python packages)
RUN apt-get update && apt-get install -y --no-install-recommends \
    gcc \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements first (Docker layer caching)
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy pipeline code
COPY *.py .

# Create data directory for volume mount
RUN mkdir -p /app/data /app/reports

# Run as non-root user (security)
RUN useradd --create-home pipelineuser
USER pipelineuser

# Default command
CMD ["python", "pipeline.py"]

Why This Order Matters

Docker caches each layer. By copying requirements.txt before your code, dependency installation is cached. Only code changes trigger a rebuild of the final layer — going from minutes to seconds.

Build and Run

# Build the image
docker build -t sales-pipeline .

# Run with environment variables
docker run --rm \
    -e API_KEY=your-key-here \
    -e DB_PATH=/app/data/reports.db \
    -v $(pwd)/data:/app/data \
    sales-pipeline

# Run with .env file
docker run --rm \
    --env-file .env \
    -v $(pwd)/data:/app/data \
    sales-pipeline

Step 3: Environment Management

.env.example (Committed to Git)

# Copy this to .env and fill in real values
API_KEY=your-api-key-here
API_URL=https://api.example.com/v1
DB_PATH=/app/data/reports.db
SMTP_HOST=smtp.example.com
SMTP_PASSWORD=your-smtp-password
LOG_LEVEL=INFO

Reading Environment in Python

import os
from dotenv import load_dotenv

# Load .env in development, use system env vars in production
load_dotenv()

def get_config():
    """Load pipeline configuration from environment."""
    required = ["API_KEY", "API_URL"]
    missing = [var for var in required if not os.environ.get(var)]

    if missing:
        raise EnvironmentError(
            f"Missing required environment variables: {', '.join(missing)}\n"
            f"Copy .env.example to .env and fill in the values."
        )

    return {
        "api_key": os.environ["API_KEY"],
        "api_url": os.environ["API_URL"],
        "db_path": os.environ.get("DB_PATH", "/app/data/reports.db"),
        "log_level": os.environ.get("LOG_LEVEL", "INFO"),
    }

Step 4: Docker Compose

For pipelines that need a database or multiple services, use docker-compose:

# docker-compose.yml

services:
  pipeline:
    build: .
    env_file: .env
    volumes:
      - ./data:/app/data
      - ./reports:/app/reports
    depends_on:
      db:
        condition: service_healthy
    networks:
      - pipeline-net

  db:
    image: postgres:16-alpine
    environment:
      POSTGRES_DB: pipeline
      POSTGRES_USER: pipeline
      POSTGRES_PASSWORD: ${DB_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U pipeline"]
      interval: 5s
      timeout: 3s
      retries: 5
    networks:
      - pipeline-net

volumes:
  pgdata:

networks:
  pipeline-net:

Running with Compose

# Start database and run pipeline
docker compose up

# Run pipeline only (database already running)
docker compose run --rm pipeline

# Run with overridden command
docker compose run --rm pipeline python pipeline.py --days 7

# Tear down
docker compose down

Step 5: Volume Mounts for Data Persistence

Data should live outside the container so it survives container restarts.

# docker-compose.yml — volume examples

services:
  pipeline:
    build: .
    volumes:
      # Source data (read-only)
      - ./data/input:/app/data/input:ro

      # Output data (read-write)
      - ./data/output:/app/data/output

      # Reports
      - ./reports:/app/reports

      # Logs
      - ./logs:/app/logs
# In your pipeline — use paths relative to /app
import os

INPUT_DIR = os.environ.get("INPUT_DIR", "/app/data/input")
OUTPUT_DIR = os.environ.get("OUTPUT_DIR", "/app/data/output")
REPORT_DIR = os.environ.get("REPORT_DIR", "/app/reports")

def run_pipeline():
    """Pipeline using container-friendly paths."""
    # Read from mounted input
    df = pd.read_excel(os.path.join(INPUT_DIR, "sales.xlsx"))

    # Process
    result = transform(df)

    # Write to mounted output
    result.to_csv(os.path.join(OUTPUT_DIR, "clean_sales.csv"), index=False)

    # Generate report
    report_path = os.path.join(REPORT_DIR, f"report_{timestamp}.xlsx")
    export_report(result, report_path)

Step 6: Multi-Stage Builds (Production)

For production images, use multi-stage builds to reduce image size:

# Stage 1: Build dependencies
FROM python:3.11-slim AS builder

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# Stage 2: Runtime image (smaller)
FROM python:3.11-slim

WORKDIR /app

# Copy only installed packages from builder
COPY --from=builder /install /usr/local

# Copy pipeline code
COPY *.py .

# Create directories
RUN mkdir -p /app/data /app/reports

# Non-root user
RUN useradd --create-home pipelineuser
USER pipelineuser

CMD ["python", "pipeline.py"]

Image Size Comparison

ApproachImage size
python:3.11 (full)~900 MB
python:3.11-slim~150 MB
Multi-stage slim~120 MB
python:3.11-alpine~50 MB (compatibility issues)

Use slim unless you have a specific reason to use alpine (glibc vs musl issues).

Step 7: Scheduling Containers

Cron on Host

# Run the containerised pipeline every day at 6 AM
0 6 * * * cd /path/to/project && docker compose run --rm pipeline >> logs/docker.log 2>&1

Docker-Native Scheduling

# docker-compose.yml — with Ofelia scheduler

services:
  scheduler:
    image: mcuadros/ofelia:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    labels:
      ofelia.job-exec.pipeline.schedule: "0 6 * * *"
      ofelia.job-exec.pipeline.container: "pipeline"
      ofelia.job-exec.pipeline.command: "python pipeline.py"
    depends_on:
      - pipeline

  pipeline:
    build: .
    env_file: .env
    volumes:
      - ./data:/app/data

Step 8: Logging in Containers

Write logs to stdout/stderr — Docker captures them automatically.

import logging
import sys

def setup_logging(level=None):
    """Configure logging for container environment."""
    log_level = level or os.environ.get("LOG_LEVEL", "INFO")

    logging.basicConfig(
        level=getattr(logging, log_level.upper()),
        format="%(asctime)s [%(levelname)s] %(name)s: %(message)s",
        handlers=[logging.StreamHandler(sys.stdout)],
    )

    return logging.getLogger("pipeline")

logger = setup_logging()

Viewing Logs

# Follow logs in real time
docker compose logs -f pipeline

# Last 100 lines
docker compose logs --tail 100 pipeline

# Logs with timestamps
docker compose logs -t pipeline

Step 9: Health Checks

# Add to Dockerfile
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
    CMD python -c "import sys; sys.exit(0)" || exit 1

For long-running pipelines, add a health endpoint:

# health.py — simple health check
import os
from datetime import datetime

HEALTH_FILE = "/app/data/.health"

def update_health():
    """Update health marker after successful pipeline stage."""
    with open(HEALTH_FILE, "w") as f:
        f.write(datetime.now().isoformat())

def check_health(max_age_seconds=3600):
    """Check if pipeline has run recently."""
    if not os.path.exists(HEALTH_FILE):
        return False

    with open(HEALTH_FILE, "r") as f:
        last_run = datetime.fromisoformat(f.read().strip())

    age = (datetime.now() - last_run).total_seconds()
    return age < max_age_seconds

Common Docker Patterns for Pipelines

PatternUse caseExample
One-shot containerScheduled pipeline runsdocker compose run --rm pipeline
Long-running serviceWorker processing queuedocker compose up -d pipeline
Sidecar databasePipeline needs PostgreSQLCompose with depends_on + healthcheck
Mounted volumesInput/output data persistence-v ./data:/app/data
Multi-stage buildSmaller production imagesBuilder → runtime stages
Non-root userSecurityRUN useradd ... + USER ...

What This Replaces

Local approachDocker equivalent
pip install on the serverrequirements.txt built into image
”Works on my machine”Identical container on every machine
Python version mismatchSpecific Python version in FROM
Manual server setupdocker compose up
Log files on diskdocker compose logs
Cannot reproduce a bugExact same image in dev and production

Troubleshooting

ProblemCauseFix
ModuleNotFoundErrorMissing from requirements.txtAdd to requirements.txt, rebuild
Permission denied on volumeContainer user ≠ host userFix ownership or use same UID
Container exits immediatelyScript finishes (expected for one-shot)Use docker compose run not up
Slow buildsDependencies re-install every buildCOPY requirements.txt before code
Large image sizeUsing full Python imageSwitch to python:3.11-slim
Cannot connect to databaseWrong hostnameUse compose service name as hostname

Next Steps

Start by containerising your most important pipeline — the one that breaks when deployed to a new server. The Dockerfile pattern above works for 90% of Python data pipelines. Add docker-compose when you need a database or multiple services.

For building the pipelines to containerise, see How to Design Data Pipelines for Reliable Reporting. For orchestrating containerised pipelines with dependencies, see How to Schedule and Orchestrate Workflows with Prefect.

Automation services include containerising and deploying production pipeline systems.

Get in touch to discuss deploying your pipelines with Docker.

docker python pipeline containerize python script dockerfile python docker compose data pipeline python docker deployment docker environment variables containerized data pipeline docker python automation docker compose database reproducible python pipeline

Enjoyed this article?

Get notified when I publish new articles on automation, ecommerce, and data engineering.

Get in touch

Related Articles