SQL for Data Engineers: Window Functions, CTEs, and Query Optimization

·9 min read·Data & Dashboards

Go beyond SELECT * — master window functions, CTEs, and query optimization techniques that turn slow, convoluted SQL into fast, readable queries for reporting pipelines.

SQL for Data Engineers: Window Functions, CTEs, and Query Optimization

Most data pipeline SQL starts simple: SELECT * FROM orders WHERE date > '2026-01-01'. Then requirements grow. You need running totals. You need rankings within groups. You need to compare each row to the previous one.

Without the right SQL patterns, queries become nested subqueries three levels deep — slow, unreadable, and impossible to debug. Window functions and CTEs solve this by expressing complex logic cleanly and efficiently.

This guide covers practical SQL patterns for data engineering and reporting. Every example uses real scenarios: sales analysis, funnel tracking, and pipeline monitoring.

# The Query Architecture

flowchart LR
  R["Raw Tables"] --> CTE["CTEs\n(staging)"]
  CTE --> W["Window Functions\n(analysis)"]
  W --> A["Aggregation\n(summary)"]
  A --> O["Output\n(report)"]
  CTE -.-> I["Indexes\n(performance)"]

CTEs break complex queries into readable stages. Window functions compute analytics without collapsing rows. Indexes make it all fast.

# Window Functions: The Core Skill

A window function performs a calculation across a set of rows related to the current row — without grouping them into one output row. This is the key difference from GROUP BY.

# Syntax

sql
function_name() OVER (
    PARTITION BY column    -- group rows
    ORDER BY column        -- order within group
    ROWS BETWEEN ...       -- optional: define the window frame
)

# ROW_NUMBER: Rank Within Groups

Find the top order per customer:

sql
-- Top order by revenue for each customer
SELECT
    customer_id,
    order_id,
    order_date,
    revenue,
    ROW_NUMBER() OVER (
        PARTITION BY customer_id
        ORDER BY revenue DESC
    ) AS rank_by_revenue
FROM orders;
text
 customer_id | order_id | order_date | revenue | rank_by_revenue
-------------+----------+------------+---------+-----------------
         101 |     1015 | 2026-03-20 | 4500.00 |               1
         101 |     1003 | 2026-03-05 | 2200.00 |               2
         101 |     1008 | 2026-03-12 |  850.00 |               3
         102 |     1022 | 2026-03-25 | 3100.00 |               1
         102 |     1011 | 2026-03-15 | 1650.00 |               2

To get only the top order per customer:

sql
WITH ranked AS (
    SELECT
        customer_id,
        order_id,
        revenue,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY revenue DESC
        ) AS rn
    FROM orders
)
SELECT customer_id, order_id, revenue
FROM ranked
WHERE rn = 1;

# LAG and LEAD: Compare Adjacent Rows

Calculate day-over-day revenue change:

sql
SELECT
    order_date,
    daily_revenue,
    LAG(daily_revenue) OVER (ORDER BY order_date) AS prev_day,
    daily_revenue - LAG(daily_revenue) OVER (ORDER BY order_date) AS change,
    ROUND(
        (daily_revenue - LAG(daily_revenue) OVER (ORDER BY order_date))
        / NULLIF(LAG(daily_revenue) OVER (ORDER BY order_date), 0)
        * 100, 1
    ) AS change_pct
FROM (
    SELECT
        DATE(created_at) AS order_date,
        SUM(revenue) AS daily_revenue
    FROM orders
    GROUP BY DATE(created_at)
) daily
ORDER BY order_date;
text
 order_date | daily_revenue | prev_day |  change | change_pct
------------+---------------+----------+---------+------------
 2026-03-01 |       4520.00 |     NULL |    NULL |       NULL
 2026-03-02 |       3890.00 |  4520.00 | -630.00 |      -13.9
 2026-03-03 |       5210.00 |  3890.00 | 1320.00 |       33.9
 2026-03-04 |       4750.00 |  5210.00 | -460.00 |       -8.8

# Running Totals with SUM OVER

sql
SELECT
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7_day_avg
FROM daily_sales
ORDER BY order_date;
text
 order_date | daily_revenue | cumulative_revenue | rolling_7_day_avg
------------+---------------+--------------------+-------------------
 2026-03-01 |       4520.00 |            4520.00 |           4520.00
 2026-03-02 |       3890.00 |            8410.00 |           4205.00
 2026-03-03 |       5210.00 |           13620.00 |           4540.00
 2026-03-04 |       4750.00 |           18370.00 |           4592.50
 2026-03-05 |       4080.00 |           22450.00 |           4490.00
 2026-03-06 |       5590.00 |           28040.00 |           4673.33
 2026-03-07 |       3920.00 |           31960.00 |           4565.71
 2026-03-08 |       4650.00 |           36610.00 |           4584.29

# NTILE: Distribute Into Buckets

Segment customers into quartiles by total spend:

sql
SELECT
    customer_id,
    total_spend,
    NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile,
    CASE NTILE(4) OVER (ORDER BY total_spend DESC)
        WHEN 1 THEN 'Top 25%'
        WHEN 2 THEN 'Upper Mid'
        WHEN 3 THEN 'Lower Mid'
        WHEN 4 THEN 'Bottom 25%'
    END AS segment
FROM (
    SELECT customer_id, SUM(revenue) AS total_spend
    FROM orders
    GROUP BY customer_id
) customer_totals;

# Common Table Expressions (CTEs)

CTEs let you break complex queries into named stages. Each stage is a readable SELECT that feeds into the next.

# Basic CTE Pattern

sql
WITH
-- Stage 1: Get raw daily data
daily_metrics AS (
    SELECT
        DATE(created_at) AS day,
        COUNT(*) AS order_count,
        SUM(revenue) AS total_revenue,
        COUNT(DISTINCT customer_id) AS unique_customers
    FROM orders
    WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY DATE(created_at)
),
-- Stage 2: Add rolling averages
with_trends AS (
    SELECT
        day,
        order_count,
        total_revenue,
        unique_customers,
        AVG(total_revenue) OVER (
            ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) AS revenue_7d_avg,
        AVG(order_count) OVER (
            ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) AS orders_7d_avg
    FROM daily_metrics
),
-- Stage 3: Flag anomalies
flagged AS (
    SELECT
        *,
        CASE
            WHEN total_revenue < revenue_7d_avg * 0.5 THEN 'LOW'
            WHEN total_revenue > revenue_7d_avg * 1.5 THEN 'HIGH'
            ELSE 'NORMAL'
        END AS revenue_flag
    FROM with_trends
)
-- Final output
SELECT * FROM flagged ORDER BY day;

Each CTE is independently testable. You can run just the first one to verify your daily metrics. Then add the next stage.

# Recursive CTEs

Generate a date series (useful when your data has gaps):

sql
WITH RECURSIVE date_series AS (
    -- Base case: start date
    SELECT DATE('2026-03-01') AS day

    UNION ALL

    -- Recursive case: add one day
    SELECT DATE(day, '+1 day')
    FROM date_series
    WHERE day < DATE('2026-03-31')
)
SELECT
    ds.day,
    COALESCE(o.order_count, 0) AS order_count,
    COALESCE(o.revenue, 0) AS revenue
FROM date_series ds
LEFT JOIN (
    SELECT
        DATE(created_at) AS day,
        COUNT(*) AS order_count,
        SUM(revenue) AS revenue
    FROM orders
    GROUP BY DATE(created_at)
) o ON ds.day = o.day
ORDER BY ds.day;

This guarantees every date appears in your output — even days with zero orders.

# CTE for Funnel Analysis

sql
WITH
sessions AS (
    SELECT COUNT(DISTINCT session_id) AS total
    FROM events WHERE event_type = 'page_view'
),
product_views AS (
    SELECT COUNT(DISTINCT session_id) AS total
    FROM events WHERE event_type = 'product_view'
),
add_to_cart AS (
    SELECT COUNT(DISTINCT session_id) AS total
    FROM events WHERE event_type = 'add_to_cart'
),
checkouts AS (
    SELECT COUNT(DISTINCT session_id) AS total
    FROM events WHERE event_type = 'checkout_started'
),
purchases AS (
    SELECT COUNT(DISTINCT session_id) AS total
    FROM events WHERE event_type = 'purchase'
),
funnel AS (
    SELECT 'Sessions' AS stage, 1 AS step, total FROM sessions
    UNION ALL SELECT 'Product Views', 2, total FROM product_views
    UNION ALL SELECT 'Add to Cart', 3, total FROM add_to_cart
    UNION ALL SELECT 'Checkout', 4, total FROM checkouts
    UNION ALL SELECT 'Purchase', 5, total FROM purchases
)
SELECT
    stage,
    total,
    ROUND(total * 100.0 / FIRST_VALUE(total) OVER (ORDER BY step), 1) AS conversion_pct,
    total - LAG(total) OVER (ORDER BY step) AS drop_off
FROM funnel
ORDER BY step;

# Query Optimization

Fast queries on large tables require understanding how the database actually executes your SQL.

# Reading EXPLAIN ANALYZE

sql
EXPLAIN ANALYZE
SELECT customer_id, SUM(revenue) AS total
FROM orders
WHERE created_at >= '2026-01-01'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10;
text
Limit  (cost=1250.42..1250.45 rows=10 width=40) (actual time=45.2..45.3 rows=10)
  ->  Sort  (cost=1250.42..1255.12 rows=1880 width=40) (actual time=45.2..45.2 rows=10)
        Sort Key: (sum(revenue)) DESC
        ->  HashAggregate  (cost=1180.00..1200.00 rows=1880 width=40) (actual time=44.1..44.5 rows=1880)
              ->  Seq Scan on orders  (cost=0..980.00 rows=40000 width=16) (actual time=0.01..12.3 rows=38500)
                    Filter: (created_at >= '2026-01-01')
                    Rows Removed by Filter: 11500
flowchart TD
  S["Seq Scan: orders\n40K rows → 38.5K\n(12.3ms)"] --> H["HashAggregate\n38.5K → 1.8K groups\n(44.1ms)"]
  H --> SO["Sort by total DESC\n(45.2ms)"]
  SO --> L["Limit 10\n(45.3ms)"]

The bottleneck: a sequential scan reading every row. An index on created_at would fix this.

# Index Strategy

sql
-- Speed up date range filters
CREATE INDEX idx_orders_created_at ON orders (created_at);

-- Speed up customer lookups with date range
CREATE INDEX idx_orders_customer_date ON orders (customer_id, created_at);

-- Speed up the exact query pattern above
CREATE INDEX idx_orders_date_covering ON orders (created_at)
    INCLUDE (customer_id, revenue);
Index type Best for Example
Single column Filters on one column WHERE created_at > ...
Composite Filters on multiple columns WHERE customer_id = ... AND created_at > ...
Covering Avoiding table lookups Index includes all columns needed by the query
Partial Filtering a subset CREATE INDEX ... WHERE status = 'active'

# After Adding the Index

text
Limit  (cost=250.42..250.45 rows=10 width=40) (actual time=5.2..5.3 rows=10)
  ->  Sort  (cost=250.42..255.12 rows=1880 width=40) (actual time=5.2..5.2 rows=10)
        ->  HashAggregate  (cost=180.00..200.00 rows=1880 width=40) (actual time=4.1..4.5 rows=1880)
              ->  Index Scan on idx_orders_created_at  (cost=0..120.00 rows=38500 width=16) (actual time=0.01..2.1 rows=38500)

From 45ms to 5ms. On a table with millions of rows, the difference would be seconds to milliseconds.

# Common Optimization Patterns

sql
-- Slow: function on indexed column prevents index use
SELECT * FROM orders WHERE DATE(created_at) = '2026-03-15';

-- Fast: range query uses the index
SELECT * FROM orders
WHERE created_at >= '2026-03-15' AND created_at < '2026-03-16';

-- Slow: OR can prevent index use
SELECT * FROM orders WHERE region = 'North' OR region = 'South';

-- Fast: IN uses the index efficiently
SELECT * FROM orders WHERE region IN ('North', 'South');

-- Slow: SELECT * fetches unused columns
SELECT * FROM orders WHERE customer_id = 101;

-- Fast: select only what you need
SELECT order_id, revenue, created_at FROM orders WHERE customer_id = 101;

-- Slow: correlated subquery runs per row
SELECT *,
    (SELECT SUM(revenue) FROM orders o2 WHERE o2.customer_id = o1.customer_id) AS total
FROM orders o1;

-- Fast: window function computes in one pass
SELECT *,
    SUM(revenue) OVER (PARTITION BY customer_id) AS total
FROM orders;

# Real-World Reporting Queries

# Monthly Revenue with Year-over-Year Comparison

sql
WITH monthly AS (
    SELECT
        DATE_TRUNC('month', created_at) AS month,
        SUM(revenue) AS revenue,
        COUNT(*) AS orders
    FROM orders
    GROUP BY DATE_TRUNC('month', created_at)
)
SELECT
    month,
    revenue,
    orders,
    LAG(revenue, 12) OVER (ORDER BY month) AS revenue_last_year,
    ROUND(
        (revenue - LAG(revenue, 12) OVER (ORDER BY month))
        / NULLIF(LAG(revenue, 12) OVER (ORDER BY month), 0)
        * 100, 1
    ) AS yoy_growth_pct
FROM monthly
ORDER BY month DESC
LIMIT 12;

# Customer Cohort Retention

sql
WITH first_purchase AS (
    SELECT
        customer_id,
        DATE_TRUNC('month', MIN(created_at)) AS cohort_month
    FROM orders
    GROUP BY customer_id
),
activity AS (
    SELECT
        fp.cohort_month,
        DATE_TRUNC('month', o.created_at) AS activity_month,
        COUNT(DISTINCT o.customer_id) AS active_customers
    FROM orders o
    JOIN first_purchase fp ON o.customer_id = fp.customer_id
    GROUP BY fp.cohort_month, DATE_TRUNC('month', o.created_at)
),
cohort_sizes AS (
    SELECT cohort_month, COUNT(*) AS cohort_size
    FROM first_purchase
    GROUP BY cohort_month
)
SELECT
    a.cohort_month,
    cs.cohort_size,
    a.activity_month,
    a.active_customers,
    ROUND(a.active_customers * 100.0 / cs.cohort_size, 1) AS retention_pct
FROM activity a
JOIN cohort_sizes cs ON a.cohort_month = cs.cohort_month
ORDER BY a.cohort_month, a.activity_month;

# What This Replaces

Old approach New approach
Nested subqueries 3 levels deep CTEs with named stages
GROUP BY losing row detail Window functions that keep rows
Correlated subqueries (slow) Single-pass window aggregation
Guessing why queries are slow EXPLAIN ANALYZE + index strategy
Python post-processing for rankings ROW_NUMBER() / RANK() in SQL
Multiple queries combined in Python One CTE query returning everything

# Pattern Reference

Pattern Function Use case
Ranking ROW_NUMBER(), RANK(), DENSE_RANK() Top-N per group
Compare previous LAG(), LEAD() Day-over-day, period-over-period
Running total SUM() OVER (ORDER BY ...) Cumulative metrics
Moving average AVG() OVER (ROWS BETWEEN n PRECEDING ...) Trend smoothing
Percentile bucket NTILE() Customer segmentation
First/last value FIRST_VALUE(), LAST_VALUE() Within-group extremes
Staged logic CTEs (WITH ... AS) Break complex queries into steps
Fill date gaps Recursive CTE + LEFT JOIN Continuous time series

# Next Steps

Start with CTEs to restructure your most complex query. The readability improvement alone is worth it. Then add window functions for any logic that currently requires subqueries or Python post-processing.

For building the Python pipelines that consume these SQL queries, see How to Design Data Pipelines for Reliable Reporting. For building dashboards from the output, see How to Build a Data Dashboard Without Manual Excel Work.

Data analytics services include designing and optimizing the SQL queries that power reporting systems.

Get in touch to discuss improving your reporting queries.

Enjoyed this article?

Get notified when I publish new articles on automation, ecommerce, and data engineering.

sql window functionscommon table expressions sqlsql query optimizationsql for data engineerssql running totalrank partition sqlrecursive cte sqlsql explain analyzesql index strategysql reporting queries