SQL for Data Engineers: Window Functions, CTEs, and Query Optimization

· 10 min read · Data & Dashboards

Go beyond SELECT * — master window functions, CTEs, and query optimization techniques that turn slow, convoluted SQL into fast, readable queries for reporting pipelines.

SQL for Data Engineers: Window Functions, CTEs, and Query Optimization

Most data pipeline SQL starts simple: SELECT * FROM orders WHERE date > '2026-01-01'. Then requirements grow. You need running totals. You need rankings within groups. You need to compare each row to the previous one.

Without the right SQL patterns, queries become nested subqueries three levels deep — slow, unreadable, and impossible to debug. Window functions and CTEs solve this by expressing complex logic cleanly and efficiently.

This guide covers practical SQL patterns for data engineering and reporting. Every example uses real scenarios: sales analysis, funnel tracking, and pipeline monitoring.

Who This Is For

  • Analysts writing queries that have become unreadable nests of subqueries
  • Data engineers who need performant SQL for pipelines that process millions of rows
  • Vibe coders building data tools who want to push more logic into SQL instead of Python
  • Backend developers who know basic SQL but have never used window functions or CTEs

You need basic SQL knowledge (SELECT, WHERE, GROUP BY, JOIN). This guide builds on those fundamentals with patterns that handle running totals, rankings, period comparisons, and performance optimisation.

The Query Architecture

CTEs break complex queries into readable stages. Window functions compute analytics without collapsing rows. Indexes make it all fast.

Window Functions: The Core Skill

A window function performs a calculation across a set of rows related to the current row — without grouping them into one output row. This is the key difference from GROUP BY.

Syntax

function_name() OVER (
    PARTITION BY column    -- group rows
    ORDER BY column        -- order within group
    ROWS BETWEEN ...       -- optional: define the window frame
)

ROW_NUMBER: Rank Within Groups

Find the top order per customer:

-- Top order by revenue for each customer
SELECT
    customer_id,
    order_id,
    order_date,
    revenue,
    ROW_NUMBER() OVER (
        PARTITION BY customer_id
        ORDER BY revenue DESC
    ) AS rank_by_revenue
FROM orders;
 customer_id | order_id | order_date | revenue | rank_by_revenue
-------------+----------+------------+---------+-----------------
         101 |     1015 | 2026-03-20 | 4500.00 |               1
         101 |     1003 | 2026-03-05 | 2200.00 |               2
         101 |     1008 | 2026-03-12 |  850.00 |               3
         102 |     1022 | 2026-03-25 | 3100.00 |               1
         102 |     1011 | 2026-03-15 | 1650.00 |               2

To get only the top order per customer:

WITH ranked AS (
    SELECT
        customer_id,
        order_id,
        revenue,
        ROW_NUMBER() OVER (
            PARTITION BY customer_id
            ORDER BY revenue DESC
        ) AS rn
    FROM orders
)
SELECT customer_id, order_id, revenue
FROM ranked
WHERE rn = 1;

LAG and LEAD: Compare Adjacent Rows

Calculate day-over-day revenue change:

SELECT
    order_date,
    daily_revenue,
    LAG(daily_revenue) OVER (ORDER BY order_date) AS prev_day,
    daily_revenue - LAG(daily_revenue) OVER (ORDER BY order_date) AS change,
    ROUND(
        (daily_revenue - LAG(daily_revenue) OVER (ORDER BY order_date))
        / NULLIF(LAG(daily_revenue) OVER (ORDER BY order_date), 0)
        * 100, 1
    ) AS change_pct
FROM (
    SELECT
        DATE(created_at) AS order_date,
        SUM(revenue) AS daily_revenue
    FROM orders
    GROUP BY DATE(created_at)
) daily
ORDER BY order_date;
 order_date | daily_revenue | prev_day |  change | change_pct
------------+---------------+----------+---------+------------
 2026-03-01 |       4520.00 |     NULL |    NULL |       NULL
 2026-03-02 |       3890.00 |  4520.00 | -630.00 |      -13.9
 2026-03-03 |       5210.00 |  3890.00 | 1320.00 |       33.9
 2026-03-04 |       4750.00 |  5210.00 | -460.00 |       -8.8

Running Totals with SUM OVER

SELECT
    order_date,
    daily_revenue,
    SUM(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS cumulative_revenue,
    AVG(daily_revenue) OVER (
        ORDER BY order_date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS rolling_7_day_avg
FROM daily_sales
ORDER BY order_date;
 order_date | daily_revenue | cumulative_revenue | rolling_7_day_avg
------------+---------------+--------------------+-------------------
 2026-03-01 |       4520.00 |            4520.00 |           4520.00
 2026-03-02 |       3890.00 |            8410.00 |           4205.00
 2026-03-03 |       5210.00 |           13620.00 |           4540.00
 2026-03-04 |       4750.00 |           18370.00 |           4592.50
 2026-03-05 |       4080.00 |           22450.00 |           4490.00
 2026-03-06 |       5590.00 |           28040.00 |           4673.33
 2026-03-07 |       3920.00 |           31960.00 |           4565.71
 2026-03-08 |       4650.00 |           36610.00 |           4584.29

NTILE: Distribute Into Buckets

Segment customers into quartiles by total spend:

SELECT
    customer_id,
    total_spend,
    NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile,
    CASE NTILE(4) OVER (ORDER BY total_spend DESC)
        WHEN 1 THEN 'Top 25%'
        WHEN 2 THEN 'Upper Mid'
        WHEN 3 THEN 'Lower Mid'
        WHEN 4 THEN 'Bottom 25%'
    END AS segment
FROM (
    SELECT customer_id, SUM(revenue) AS total_spend
    FROM orders
    GROUP BY customer_id
) customer_totals;

Common Table Expressions (CTEs)

CTEs let you break complex queries into named stages. Each stage is a readable SELECT that feeds into the next.

Basic CTE Pattern

WITH
-- Stage 1: Get raw daily data
daily_metrics AS (
    SELECT
        DATE(created_at) AS day,
        COUNT(*) AS order_count,
        SUM(revenue) AS total_revenue,
        COUNT(DISTINCT customer_id) AS unique_customers
    FROM orders
    WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
    GROUP BY DATE(created_at)
),
-- Stage 2: Add rolling averages
with_trends AS (
    SELECT
        day,
        order_count,
        total_revenue,
        unique_customers,
        AVG(total_revenue) OVER (
            ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) AS revenue_7d_avg,
        AVG(order_count) OVER (
            ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
        ) AS orders_7d_avg
    FROM daily_metrics
),
-- Stage 3: Flag anomalies
flagged AS (
    SELECT
        *,
        CASE
            WHEN total_revenue < revenue_7d_avg * 0.5 THEN 'LOW'
            WHEN total_revenue > revenue_7d_avg * 1.5 THEN 'HIGH'
            ELSE 'NORMAL'
        END AS revenue_flag
    FROM with_trends
)
-- Final output
SELECT * FROM flagged ORDER BY day;

Each CTE is independently testable. You can run just the first one to verify your daily metrics. Then add the next stage.

Recursive CTEs

Generate a date series (useful when your data has gaps):

WITH RECURSIVE date_series AS (
    -- Base case: start date
    SELECT DATE('2026-03-01') AS day

    UNION ALL

    -- Recursive case: add one day
    SELECT DATE(day, '+1 day')
    FROM date_series
    WHERE day < DATE('2026-03-31')
)
SELECT
    ds.day,
    COALESCE(o.order_count, 0) AS order_count,
    COALESCE(o.revenue, 0) AS revenue
FROM date_series ds
LEFT JOIN (
    SELECT
        DATE(created_at) AS day,
        COUNT(*) AS order_count,
        SUM(revenue) AS revenue
    FROM orders
    GROUP BY DATE(created_at)
) o ON ds.day = o.day
ORDER BY ds.day;

This guarantees every date appears in your output — even days with zero orders.

CTE for Funnel Analysis

WITH
sessions AS (
    SELECT COUNT(DISTINCT session_id) AS total
    FROM events WHERE event_type = 'page_view'
),
product_views AS (
    SELECT COUNT(DISTINCT session_id) AS total
    FROM events WHERE event_type = 'product_view'
),
add_to_cart AS (
    SELECT COUNT(DISTINCT session_id) AS total
    FROM events WHERE event_type = 'add_to_cart'
),
checkouts AS (
    SELECT COUNT(DISTINCT session_id) AS total
    FROM events WHERE event_type = 'checkout_started'
),
purchases AS (
    SELECT COUNT(DISTINCT session_id) AS total
    FROM events WHERE event_type = 'purchase'
),
funnel AS (
    SELECT 'Sessions' AS stage, 1 AS step, total FROM sessions
    UNION ALL SELECT 'Product Views', 2, total FROM product_views
    UNION ALL SELECT 'Add to Cart', 3, total FROM add_to_cart
    UNION ALL SELECT 'Checkout', 4, total FROM checkouts
    UNION ALL SELECT 'Purchase', 5, total FROM purchases
)
SELECT
    stage,
    total,
    ROUND(total * 100.0 / FIRST_VALUE(total) OVER (ORDER BY step), 1) AS conversion_pct,
    total - LAG(total) OVER (ORDER BY step) AS drop_off
FROM funnel
ORDER BY step;

Query Optimization

Fast queries on large tables require understanding how the database actually executes your SQL.

Reading EXPLAIN ANALYZE

EXPLAIN ANALYZE
SELECT customer_id, SUM(revenue) AS total
FROM orders
WHERE created_at >= '2026-01-01'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10;
Limit  (cost=1250.42..1250.45 rows=10 width=40) (actual time=45.2..45.3 rows=10)
  ->  Sort  (cost=1250.42..1255.12 rows=1880 width=40) (actual time=45.2..45.2 rows=10)
        Sort Key: (sum(revenue)) DESC
        ->  HashAggregate  (cost=1180.00..1200.00 rows=1880 width=40) (actual time=44.1..44.5 rows=1880)
              ->  Seq Scan on orders  (cost=0..980.00 rows=40000 width=16) (actual time=0.01..12.3 rows=38500)
                    Filter: (created_at >= '2026-01-01')
                    Rows Removed by Filter: 11500

The bottleneck: a sequential scan reading every row. An index on created_at would fix this.

Index Strategy

-- Speed up date range filters
CREATE INDEX idx_orders_created_at ON orders (created_at);

-- Speed up customer lookups with date range
CREATE INDEX idx_orders_customer_date ON orders (customer_id, created_at);

-- Speed up the exact query pattern above
CREATE INDEX idx_orders_date_covering ON orders (created_at)
    INCLUDE (customer_id, revenue);
Index typeBest forExample
Single columnFilters on one columnWHERE created_at > ...
CompositeFilters on multiple columnsWHERE customer_id = ... AND created_at > ...
CoveringAvoiding table lookupsIndex includes all columns needed by the query
PartialFiltering a subsetCREATE INDEX ... WHERE status = 'active'

After Adding the Index

Limit  (cost=250.42..250.45 rows=10 width=40) (actual time=5.2..5.3 rows=10)
  ->  Sort  (cost=250.42..255.12 rows=1880 width=40) (actual time=5.2..5.2 rows=10)
        ->  HashAggregate  (cost=180.00..200.00 rows=1880 width=40) (actual time=4.1..4.5 rows=1880)
              ->  Index Scan on idx_orders_created_at  (cost=0..120.00 rows=38500 width=16) (actual time=0.01..2.1 rows=38500)

From 45ms to 5ms. On a table with millions of rows, the difference would be seconds to milliseconds.

Common Optimization Patterns

-- Slow: function on indexed column prevents index use
SELECT * FROM orders WHERE DATE(created_at) = '2026-03-15';

-- Fast: range query uses the index
SELECT * FROM orders
WHERE created_at >= '2026-03-15' AND created_at < '2026-03-16';

-- Slow: OR can prevent index use
SELECT * FROM orders WHERE region = 'North' OR region = 'South';

-- Fast: IN uses the index efficiently
SELECT * FROM orders WHERE region IN ('North', 'South');

-- Slow: SELECT * fetches unused columns
SELECT * FROM orders WHERE customer_id = 101;

-- Fast: select only what you need
SELECT order_id, revenue, created_at FROM orders WHERE customer_id = 101;

-- Slow: correlated subquery runs per row
SELECT *,
    (SELECT SUM(revenue) FROM orders o2 WHERE o2.customer_id = o1.customer_id) AS total
FROM orders o1;

-- Fast: window function computes in one pass
SELECT *,
    SUM(revenue) OVER (PARTITION BY customer_id) AS total
FROM orders;

Real-World Reporting Queries

Monthly Revenue with Year-over-Year Comparison

WITH monthly AS (
    SELECT
        DATE_TRUNC('month', created_at) AS month,
        SUM(revenue) AS revenue,
        COUNT(*) AS orders
    FROM orders
    GROUP BY DATE_TRUNC('month', created_at)
)
SELECT
    month,
    revenue,
    orders,
    LAG(revenue, 12) OVER (ORDER BY month) AS revenue_last_year,
    ROUND(
        (revenue - LAG(revenue, 12) OVER (ORDER BY month))
        / NULLIF(LAG(revenue, 12) OVER (ORDER BY month), 0)
        * 100, 1
    ) AS yoy_growth_pct
FROM monthly
ORDER BY month DESC
LIMIT 12;

Customer Cohort Retention

WITH first_purchase AS (
    SELECT
        customer_id,
        DATE_TRUNC('month', MIN(created_at)) AS cohort_month
    FROM orders
    GROUP BY customer_id
),
activity AS (
    SELECT
        fp.cohort_month,
        DATE_TRUNC('month', o.created_at) AS activity_month,
        COUNT(DISTINCT o.customer_id) AS active_customers
    FROM orders o
    JOIN first_purchase fp ON o.customer_id = fp.customer_id
    GROUP BY fp.cohort_month, DATE_TRUNC('month', o.created_at)
),
cohort_sizes AS (
    SELECT cohort_month, COUNT(*) AS cohort_size
    FROM first_purchase
    GROUP BY cohort_month
)
SELECT
    a.cohort_month,
    cs.cohort_size,
    a.activity_month,
    a.active_customers,
    ROUND(a.active_customers * 100.0 / cs.cohort_size, 1) AS retention_pct
FROM activity a
JOIN cohort_sizes cs ON a.cohort_month = cs.cohort_month
ORDER BY a.cohort_month, a.activity_month;

What This Replaces

Old approachNew approach
Nested subqueries 3 levels deepCTEs with named stages
GROUP BY losing row detailWindow functions that keep rows
Correlated subqueries (slow)Single-pass window aggregation
Guessing why queries are slowEXPLAIN ANALYZE + index strategy
Python post-processing for rankingsROW_NUMBER() / RANK() in SQL
Multiple queries combined in PythonOne CTE query returning everything

Pattern Reference

PatternFunctionUse case
RankingROW_NUMBER(), RANK(), DENSE_RANK()Top-N per group
Compare previousLAG(), LEAD()Day-over-day, period-over-period
Running totalSUM() OVER (ORDER BY ...)Cumulative metrics
Moving averageAVG() OVER (ROWS BETWEEN n PRECEDING ...)Trend smoothing
Percentile bucketNTILE()Customer segmentation
First/last valueFIRST_VALUE(), LAST_VALUE()Within-group extremes
Staged logicCTEs (WITH ... AS)Break complex queries into steps
Fill date gapsRecursive CTE + LEFT JOINContinuous time series

Next Steps

Start with CTEs to restructure your most complex query. The readability improvement alone is worth it. Then add window functions for any logic that currently requires subqueries or Python post-processing.

For building the Python pipelines that consume these SQL queries, see How to Design Data Pipelines for Reliable Reporting. For building dashboards from the output, see How to Build a Data Dashboard Without Manual Excel Work.

Data analytics services include designing and optimizing the SQL queries that power reporting systems.

Get in touch to discuss improving your reporting queries.

sql window functions common table expressions sql sql query optimization sql for data engineers sql running total rank partition sql recursive cte sql sql explain analyze sql index strategy sql reporting queries

Enjoyed this article?

Get notified when I publish new articles on automation, ecommerce, and data engineering.

Get in touch

Related Articles