SQL for Data Engineers: Window Functions, CTEs, and Query Optimization
Go beyond SELECT * — master window functions, CTEs, and query optimization techniques that turn slow, convoluted SQL into fast, readable queries for reporting pipelines.

Most data pipeline SQL starts simple: SELECT * FROM orders WHERE date > '2026-01-01'. Then requirements grow. You need running totals. You need rankings within groups. You need to compare each row to the previous one.
Without the right SQL patterns, queries become nested subqueries three levels deep — slow, unreadable, and impossible to debug. Window functions and CTEs solve this by expressing complex logic cleanly and efficiently.
This guide covers practical SQL patterns for data engineering and reporting. Every example uses real scenarios: sales analysis, funnel tracking, and pipeline monitoring.
# The Query Architecture
flowchart LR R["Raw Tables"] --> CTE["CTEs\n(staging)"] CTE --> W["Window Functions\n(analysis)"] W --> A["Aggregation\n(summary)"] A --> O["Output\n(report)"] CTE -.-> I["Indexes\n(performance)"]
CTEs break complex queries into readable stages. Window functions compute analytics without collapsing rows. Indexes make it all fast.
# Window Functions: The Core Skill
A window function performs a calculation across a set of rows related to the current row — without grouping them into one output row. This is the key difference from GROUP BY.
# Syntax
function_name() OVER (
PARTITION BY column -- group rows
ORDER BY column -- order within group
ROWS BETWEEN ... -- optional: define the window frame
)
# ROW_NUMBER: Rank Within Groups
Find the top order per customer:
-- Top order by revenue for each customer
SELECT
customer_id,
order_id,
order_date,
revenue,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY revenue DESC
) AS rank_by_revenue
FROM orders;
customer_id | order_id | order_date | revenue | rank_by_revenue
-------------+----------+------------+---------+-----------------
101 | 1015 | 2026-03-20 | 4500.00 | 1
101 | 1003 | 2026-03-05 | 2200.00 | 2
101 | 1008 | 2026-03-12 | 850.00 | 3
102 | 1022 | 2026-03-25 | 3100.00 | 1
102 | 1011 | 2026-03-15 | 1650.00 | 2
To get only the top order per customer:
WITH ranked AS (
SELECT
customer_id,
order_id,
revenue,
ROW_NUMBER() OVER (
PARTITION BY customer_id
ORDER BY revenue DESC
) AS rn
FROM orders
)
SELECT customer_id, order_id, revenue
FROM ranked
WHERE rn = 1;
# LAG and LEAD: Compare Adjacent Rows
Calculate day-over-day revenue change:
SELECT
order_date,
daily_revenue,
LAG(daily_revenue) OVER (ORDER BY order_date) AS prev_day,
daily_revenue - LAG(daily_revenue) OVER (ORDER BY order_date) AS change,
ROUND(
(daily_revenue - LAG(daily_revenue) OVER (ORDER BY order_date))
/ NULLIF(LAG(daily_revenue) OVER (ORDER BY order_date), 0)
* 100, 1
) AS change_pct
FROM (
SELECT
DATE(created_at) AS order_date,
SUM(revenue) AS daily_revenue
FROM orders
GROUP BY DATE(created_at)
) daily
ORDER BY order_date;
order_date | daily_revenue | prev_day | change | change_pct
------------+---------------+----------+---------+------------
2026-03-01 | 4520.00 | NULL | NULL | NULL
2026-03-02 | 3890.00 | 4520.00 | -630.00 | -13.9
2026-03-03 | 5210.00 | 3890.00 | 1320.00 | 33.9
2026-03-04 | 4750.00 | 5210.00 | -460.00 | -8.8
# Running Totals with SUM OVER
SELECT
order_date,
daily_revenue,
SUM(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_revenue,
AVG(daily_revenue) OVER (
ORDER BY order_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS rolling_7_day_avg
FROM daily_sales
ORDER BY order_date;
order_date | daily_revenue | cumulative_revenue | rolling_7_day_avg
------------+---------------+--------------------+-------------------
2026-03-01 | 4520.00 | 4520.00 | 4520.00
2026-03-02 | 3890.00 | 8410.00 | 4205.00
2026-03-03 | 5210.00 | 13620.00 | 4540.00
2026-03-04 | 4750.00 | 18370.00 | 4592.50
2026-03-05 | 4080.00 | 22450.00 | 4490.00
2026-03-06 | 5590.00 | 28040.00 | 4673.33
2026-03-07 | 3920.00 | 31960.00 | 4565.71
2026-03-08 | 4650.00 | 36610.00 | 4584.29
# NTILE: Distribute Into Buckets
Segment customers into quartiles by total spend:
SELECT
customer_id,
total_spend,
NTILE(4) OVER (ORDER BY total_spend DESC) AS spend_quartile,
CASE NTILE(4) OVER (ORDER BY total_spend DESC)
WHEN 1 THEN 'Top 25%'
WHEN 2 THEN 'Upper Mid'
WHEN 3 THEN 'Lower Mid'
WHEN 4 THEN 'Bottom 25%'
END AS segment
FROM (
SELECT customer_id, SUM(revenue) AS total_spend
FROM orders
GROUP BY customer_id
) customer_totals;
# Common Table Expressions (CTEs)
CTEs let you break complex queries into named stages. Each stage is a readable SELECT that feeds into the next.
# Basic CTE Pattern
WITH
-- Stage 1: Get raw daily data
daily_metrics AS (
SELECT
DATE(created_at) AS day,
COUNT(*) AS order_count,
SUM(revenue) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
WHERE created_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY DATE(created_at)
),
-- Stage 2: Add rolling averages
with_trends AS (
SELECT
day,
order_count,
total_revenue,
unique_customers,
AVG(total_revenue) OVER (
ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS revenue_7d_avg,
AVG(order_count) OVER (
ORDER BY day ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS orders_7d_avg
FROM daily_metrics
),
-- Stage 3: Flag anomalies
flagged AS (
SELECT
*,
CASE
WHEN total_revenue < revenue_7d_avg * 0.5 THEN 'LOW'
WHEN total_revenue > revenue_7d_avg * 1.5 THEN 'HIGH'
ELSE 'NORMAL'
END AS revenue_flag
FROM with_trends
)
-- Final output
SELECT * FROM flagged ORDER BY day;
Each CTE is independently testable. You can run just the first one to verify your daily metrics. Then add the next stage.
# Recursive CTEs
Generate a date series (useful when your data has gaps):
WITH RECURSIVE date_series AS (
-- Base case: start date
SELECT DATE('2026-03-01') AS day
UNION ALL
-- Recursive case: add one day
SELECT DATE(day, '+1 day')
FROM date_series
WHERE day < DATE('2026-03-31')
)
SELECT
ds.day,
COALESCE(o.order_count, 0) AS order_count,
COALESCE(o.revenue, 0) AS revenue
FROM date_series ds
LEFT JOIN (
SELECT
DATE(created_at) AS day,
COUNT(*) AS order_count,
SUM(revenue) AS revenue
FROM orders
GROUP BY DATE(created_at)
) o ON ds.day = o.day
ORDER BY ds.day;
This guarantees every date appears in your output — even days with zero orders.
# CTE for Funnel Analysis
WITH
sessions AS (
SELECT COUNT(DISTINCT session_id) AS total
FROM events WHERE event_type = 'page_view'
),
product_views AS (
SELECT COUNT(DISTINCT session_id) AS total
FROM events WHERE event_type = 'product_view'
),
add_to_cart AS (
SELECT COUNT(DISTINCT session_id) AS total
FROM events WHERE event_type = 'add_to_cart'
),
checkouts AS (
SELECT COUNT(DISTINCT session_id) AS total
FROM events WHERE event_type = 'checkout_started'
),
purchases AS (
SELECT COUNT(DISTINCT session_id) AS total
FROM events WHERE event_type = 'purchase'
),
funnel AS (
SELECT 'Sessions' AS stage, 1 AS step, total FROM sessions
UNION ALL SELECT 'Product Views', 2, total FROM product_views
UNION ALL SELECT 'Add to Cart', 3, total FROM add_to_cart
UNION ALL SELECT 'Checkout', 4, total FROM checkouts
UNION ALL SELECT 'Purchase', 5, total FROM purchases
)
SELECT
stage,
total,
ROUND(total * 100.0 / FIRST_VALUE(total) OVER (ORDER BY step), 1) AS conversion_pct,
total - LAG(total) OVER (ORDER BY step) AS drop_off
FROM funnel
ORDER BY step;
# Query Optimization
Fast queries on large tables require understanding how the database actually executes your SQL.
# Reading EXPLAIN ANALYZE
EXPLAIN ANALYZE
SELECT customer_id, SUM(revenue) AS total
FROM orders
WHERE created_at >= '2026-01-01'
GROUP BY customer_id
ORDER BY total DESC
LIMIT 10;
Limit (cost=1250.42..1250.45 rows=10 width=40) (actual time=45.2..45.3 rows=10)
-> Sort (cost=1250.42..1255.12 rows=1880 width=40) (actual time=45.2..45.2 rows=10)
Sort Key: (sum(revenue)) DESC
-> HashAggregate (cost=1180.00..1200.00 rows=1880 width=40) (actual time=44.1..44.5 rows=1880)
-> Seq Scan on orders (cost=0..980.00 rows=40000 width=16) (actual time=0.01..12.3 rows=38500)
Filter: (created_at >= '2026-01-01')
Rows Removed by Filter: 11500
flowchart TD S["Seq Scan: orders\n40K rows → 38.5K\n(12.3ms)"] --> H["HashAggregate\n38.5K → 1.8K groups\n(44.1ms)"] H --> SO["Sort by total DESC\n(45.2ms)"] SO --> L["Limit 10\n(45.3ms)"]
The bottleneck: a sequential scan reading every row. An index on created_at would fix this.
# Index Strategy
-- Speed up date range filters
CREATE INDEX idx_orders_created_at ON orders (created_at);
-- Speed up customer lookups with date range
CREATE INDEX idx_orders_customer_date ON orders (customer_id, created_at);
-- Speed up the exact query pattern above
CREATE INDEX idx_orders_date_covering ON orders (created_at)
INCLUDE (customer_id, revenue);
| Index type | Best for | Example |
|---|---|---|
| Single column | Filters on one column | WHERE created_at > ... |
| Composite | Filters on multiple columns | WHERE customer_id = ... AND created_at > ... |
| Covering | Avoiding table lookups | Index includes all columns needed by the query |
| Partial | Filtering a subset | CREATE INDEX ... WHERE status = 'active' |
# After Adding the Index
Limit (cost=250.42..250.45 rows=10 width=40) (actual time=5.2..5.3 rows=10)
-> Sort (cost=250.42..255.12 rows=1880 width=40) (actual time=5.2..5.2 rows=10)
-> HashAggregate (cost=180.00..200.00 rows=1880 width=40) (actual time=4.1..4.5 rows=1880)
-> Index Scan on idx_orders_created_at (cost=0..120.00 rows=38500 width=16) (actual time=0.01..2.1 rows=38500)
From 45ms to 5ms. On a table with millions of rows, the difference would be seconds to milliseconds.
# Common Optimization Patterns
-- Slow: function on indexed column prevents index use
SELECT * FROM orders WHERE DATE(created_at) = '2026-03-15';
-- Fast: range query uses the index
SELECT * FROM orders
WHERE created_at >= '2026-03-15' AND created_at < '2026-03-16';
-- Slow: OR can prevent index use
SELECT * FROM orders WHERE region = 'North' OR region = 'South';
-- Fast: IN uses the index efficiently
SELECT * FROM orders WHERE region IN ('North', 'South');
-- Slow: SELECT * fetches unused columns
SELECT * FROM orders WHERE customer_id = 101;
-- Fast: select only what you need
SELECT order_id, revenue, created_at FROM orders WHERE customer_id = 101;
-- Slow: correlated subquery runs per row
SELECT *,
(SELECT SUM(revenue) FROM orders o2 WHERE o2.customer_id = o1.customer_id) AS total
FROM orders o1;
-- Fast: window function computes in one pass
SELECT *,
SUM(revenue) OVER (PARTITION BY customer_id) AS total
FROM orders;
# Real-World Reporting Queries
# Monthly Revenue with Year-over-Year Comparison
WITH monthly AS (
SELECT
DATE_TRUNC('month', created_at) AS month,
SUM(revenue) AS revenue,
COUNT(*) AS orders
FROM orders
GROUP BY DATE_TRUNC('month', created_at)
)
SELECT
month,
revenue,
orders,
LAG(revenue, 12) OVER (ORDER BY month) AS revenue_last_year,
ROUND(
(revenue - LAG(revenue, 12) OVER (ORDER BY month))
/ NULLIF(LAG(revenue, 12) OVER (ORDER BY month), 0)
* 100, 1
) AS yoy_growth_pct
FROM monthly
ORDER BY month DESC
LIMIT 12;
# Customer Cohort Retention
WITH first_purchase AS (
SELECT
customer_id,
DATE_TRUNC('month', MIN(created_at)) AS cohort_month
FROM orders
GROUP BY customer_id
),
activity AS (
SELECT
fp.cohort_month,
DATE_TRUNC('month', o.created_at) AS activity_month,
COUNT(DISTINCT o.customer_id) AS active_customers
FROM orders o
JOIN first_purchase fp ON o.customer_id = fp.customer_id
GROUP BY fp.cohort_month, DATE_TRUNC('month', o.created_at)
),
cohort_sizes AS (
SELECT cohort_month, COUNT(*) AS cohort_size
FROM first_purchase
GROUP BY cohort_month
)
SELECT
a.cohort_month,
cs.cohort_size,
a.activity_month,
a.active_customers,
ROUND(a.active_customers * 100.0 / cs.cohort_size, 1) AS retention_pct
FROM activity a
JOIN cohort_sizes cs ON a.cohort_month = cs.cohort_month
ORDER BY a.cohort_month, a.activity_month;
# What This Replaces
| Old approach | New approach |
|---|---|
| Nested subqueries 3 levels deep | CTEs with named stages |
GROUP BY losing row detail |
Window functions that keep rows |
| Correlated subqueries (slow) | Single-pass window aggregation |
| Guessing why queries are slow | EXPLAIN ANALYZE + index strategy |
| Python post-processing for rankings | ROW_NUMBER() / RANK() in SQL |
| Multiple queries combined in Python | One CTE query returning everything |
# Pattern Reference
| Pattern | Function | Use case |
|---|---|---|
| Ranking | ROW_NUMBER(), RANK(), DENSE_RANK() |
Top-N per group |
| Compare previous | LAG(), LEAD() |
Day-over-day, period-over-period |
| Running total | SUM() OVER (ORDER BY ...) |
Cumulative metrics |
| Moving average | AVG() OVER (ROWS BETWEEN n PRECEDING ...) |
Trend smoothing |
| Percentile bucket | NTILE() |
Customer segmentation |
| First/last value | FIRST_VALUE(), LAST_VALUE() |
Within-group extremes |
| Staged logic | CTEs (WITH ... AS) |
Break complex queries into steps |
| Fill date gaps | Recursive CTE + LEFT JOIN | Continuous time series |
# Next Steps
Start with CTEs to restructure your most complex query. The readability improvement alone is worth it. Then add window functions for any logic that currently requires subqueries or Python post-processing.
For building the Python pipelines that consume these SQL queries, see How to Design Data Pipelines for Reliable Reporting. For building dashboards from the output, see How to Build a Data Dashboard Without Manual Excel Work.
Data analytics services include designing and optimizing the SQL queries that power reporting systems.
Get in touch to discuss improving your reporting queries.
Enjoyed this article?
Get notified when I publish new articles on automation, ecommerce, and data engineering.