Data Engineering Cheatsheet

Single page recall reference for data engineering interviews. SQL, Python, Spark, Airflow, dbt, Kafka, schema design, and pipeline patterns.

SQL · Python · Spark · Airflow · dbt · Kafka · Schema · Numbers

This is a recall aid, not a tutorial. If you have never seen window functions, this page will not teach them. If you have used them and need to recall the syntax under pressure, this page is for you. For deeper teaching, follow the link to a structured lesson.

SQL

Window functions

SELECT
  user_id,
  event_time,
  ROW_NUMBER()    OVER (PARTITION BY user_id ORDER BY event_time)         AS rn,
  RANK()          OVER (PARTITION BY user_id ORDER BY event_time)         AS rk,
  DENSE_RANK()    OVER (PARTITION BY user_id ORDER BY event_time)         AS drk,
  LAG(event_time)  OVER (PARTITION BY user_id ORDER BY event_time)        AS prev_t,
  LEAD(event_time) OVER (PARTITION BY user_id ORDER BY event_time)        AS next_t,
  SUM(amount)      OVER (PARTITION BY user_id ORDER BY event_time
                         ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_sum,
  AVG(amount)      OVER (PARTITION BY user_id ORDER BY event_time
                         ROWS BETWEEN 6 PRECEDING AND CURRENT ROW)         AS rolling_7
FROM events;

Traps:

ROWS counts physical rows. RANGE counts values. If days can be missing, use RANGE BETWEEN INTERVAL '6' DAY PRECEDING AND CURRENT ROW.
RANK leaves gaps. DENSE_RANK does not. ROW_NUMBER is unique.
LAG and LEAD accept a default for the boundary: LAG(x, 1, 0).

Lesson: window functions advanced. Drill: datadriven.io/sql-window-functions-practice.

Joins

Type	Use when	Trap
`INNER`	Both sides must match	Silently drops rows
`LEFT`	Keep all left rows	Right side fan out blows up the result
`FULL OUTER`	Keep everything	Often a sign you should have unioned
`CROSS`	Cartesian product	Almost always a mistake
`LATERAL`	Reference left from a right side subquery	Postgres and Snowflake syntax differs
Semi (`EXISTS`)	Filter, do not project	Faster than `IN` for large lists
Anti (`NOT EXISTS`)	Filter out matches	`NOT IN` is `NULL` unsafe

Lesson: joins advanced.

NULL semantics

SELECT NULL = NULL;            -- NULL, not TRUE
SELECT NULL <> NULL;           -- NULL, not FALSE
SELECT NULL IS NULL;           -- TRUE
WHERE col IN (1, 2, NULL);     -- TRUE if col=1 or 2, NULL otherwise
WHERE col NOT IN (1, 2, NULL); -- always NULL, filters out everything

Use IS DISTINCT FROM and IS NOT DISTINCT FROM for NULL safe equality.

Aggregation patterns

-- Conditional count
SELECT COUNT(*) FILTER (WHERE status = 'success') AS successes
FROM requests;

-- Conditional aggregation (the universal pivot)
SELECT
  user_id,
  SUM(CASE WHEN event = 'view'  THEN 1 ELSE 0 END) AS views,
  SUM(CASE WHEN event = 'click' THEN 1 ELSE 0 END) AS clicks
FROM events
GROUP BY user_id;

-- Approximate distinct count for big data
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;

Lesson: aggregating advanced.

Date and time

DATE_TRUNC('day', ts)
DATEDIFF('day', start_ts, end_ts)
DATE_TRUNC('month', CURRENT_DATE)
DATE_TRUNC('week', event_time)

Most "weekly counts off by one" bugs are timezone bugs. TIMESTAMP and TIMESTAMPTZ are not the same. Reference: datadriven.io/sql-tutorial.

Performance reasoning

Mention these in any "why is this slow" question:

Predicate pushdown. Filters applied early.
Partition pruning. Partition column in the filter.
Join order. Smaller side on the build side of a hash join.
Skew. One key dominating distribution.
Spill to disk. Memory pressure during sort or aggregate.
Sort keys (Redshift), clustering keys (Snowflake, BigQuery).

Reference: datadriven.io/sql-query-optimization.

Python

Data wrangling patterns

import itertools
from collections import defaultdict, Counter

# Chunk an iterable into batches of n
def batched(iterable, n):
    it = iter(iterable)
    while chunk := list(itertools.islice(it, n)):
        yield chunk

# Group by key
grouped = defaultdict(list)
for record in records:
    grouped[record["key"]].append(record)

# Top N with Counter
top10 = Counter(items).most_common(10)

# Hash partitioning
def partition(record, n_partitions):
    return hash(record["id"]) % n_partitions

# Interval merging
def merge(intervals):
    intervals.sort()
    out = [intervals[0]]
    for start, end in intervals[1:]:
        if start <= out[-1][1]:
            out[-1] = (out[-1][0], max(end, out[-1][1]))
        else:
            out.append((start, end))
    return out

# Sessionization with inactivity gap
def sessionize(events, gap_seconds):
    events.sort(key=lambda e: e["t"])
    last_t, sid = None, 0
    for e in events:
        if last_t is None or e["t"] - last_t > gap_seconds:
            sid += 1
        e["session_id"] = sid
        last_t = e["t"]
    return events

Lesson: collections advanced.

Complexity cheatsheet

Operation	List	Dict	Set
Lookup	O(n)	O(1)	O(1)
Insert	O(1) end	O(1)	O(1)
Delete	O(n)	O(1)	O(1)
Membership (`in`)	O(n)	O(1)	O(1)

Nested loops are usually flattenable with a dict or set lookup.

Generators

def read_lines(path):
    with open(path) as f:
        for line in f:
            yield line.strip()

def parse(lines):
    for line in lines:
        yield json.loads(line)

def filter_recent(records, since):
    for r in records:
        if r["ts"] >= since:
            yield r

# pipeline composition, constant memory
for r in filter_recent(parse(read_lines("events.jsonl")), since):
    process(r)

Spark

Spark is lazy. Transformations build a DAG. Actions trigger execution. Catalyst optimizes the DAG before any work happens.

Narrow vs wide

Narrow (no shuffle)	Wide (causes shuffle)
`filter`, `map`, `select`, `withColumn`	`groupBy`, `join`, `distinct`, `repartition`, `orderBy`

Performance tactics

Predicate pushdown. Filter early, especially on partitioned columns.
Column pruning. Read only what you need.
Broadcast join when one side is under ~10 MB: broadcast(small_df).
Salt high frequency keys to fix skew.
Enable Adaptive Query Execution (AQE).
Target 128 to 512 MB output files.
Z order (Delta) or cluster (Iceberg) on common filter columns.

If your data fits on one machine, use DuckDB or pandas. Spark has cold start cost.

Airflow

from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime

with DAG(
    "my_pipeline",
    start_date=datetime(2026, 1, 1),
    schedule="0 * * * *",
    catchup=False,
    max_active_runs=1,
    default_args={"retries": 3},
) as dag:
    extract   = PythonOperator(task_id="extract",   python_callable=extract_fn)
    transform = PythonOperator(task_id="transform", python_callable=transform_fn)
    load      = PythonOperator(task_id="load",      python_callable=load_fn)
    extract >> transform >> load

Vocabulary

DAG, task, operator, sensor, XCom, backfill, catchup, pool, TaskGroup, variable, connection, idempotency.

Common questions

Backfill 6 months of data? airflow dags backfill with bounded date range and parallelism cap. Confirm idempotency first.
DAG running long? Check task duration in the UI, pool saturation, parallelism settings, executor health.
Avoid scheduling drift? Use schedule_interval, not start_date arithmetic. catchup=False.

dbt

dbt compiles SQL SELECT statements into CREATE TABLE or CREATE VIEW, in dependency order.

Materializations

Type	Use when	Tradeoff
`view`	Cheap models	Recomputed on every query
`table`	Expensive transforms read often	Storage cost, freshness lag
`incremental`	Big append or update tables	Backfill complexity
`ephemeral`	Reused CTEs	Hard to debug

Incremental pattern

{% raw %}{{ config(materialized='incremental', unique_key='id') }}{% endraw %}

SELECT * FROM {% raw %}{{ source('raw', 'events') }}{% endraw %}
{% raw %}{% if is_incremental() %}{% endraw %}
WHERE event_time > (SELECT MAX(event_time) FROM {% raw %}{{ this }}{% endraw %})
{% raw %}{% endif %}{% endraw %}

Tests

models:
  - name: orders
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: customer_id
        tests:
          - relationships:
              to: ref('customers')
              field: customer_id

Kafka

Distributed, partitioned, replicated, append only log.

Vocabulary

Topic, partition, replication factor, producer, consumer, consumer group, offset, retention.

Delivery semantics

Semantic	When
At most once	Commit before processing
At least once	Commit after processing (default)
Exactly once	Idempotent producer + transactional consumer

Common questions

Guarantee ordering? Per partition only. Pick a partition key that groups what must be ordered together.
Reprocess? Reset offset, or read from a different consumer group.
Rebalance? Consumers join or leave a group, partition assignment recomputed, brief processing pause.

Schema design

Star schema

            +-------------+
            | dim_date    |
            +-------------+
                  |
+----------+    +------------+    +-------------+
| dim_user |----| fact_order |----| dim_product |
+----------+    +------------+    +-------------+
                  |
            +-------------+
            | dim_store   |
            +-------------+

Slowly changing dimensions

Type	Behavior
0	No change. History lost.
1	Overwrite. History lost.
2	New row with effective dates. History preserved.
3	New column for previous value. Limited history.
4	Separate history table.
6	Combination of 1, 2, 3.

When in doubt, use type 2.

Lesson: SCD.

Fact table grain

Pick the grain first. Examples: one row per order, per order line, per page view, per session, per user per day. Mixing grains in one table is the most common schema design mistake.

Pipeline architecture vocabulary

Term	Means
Bronze, silver, gold	Medallion lakehouse layers (raw, cleaned, business ready)
CDC	Change data capture
Backfill	Re run a pipeline for a historical window
Replay	Re emit messages from a log to recover state
Late arriving fact	Fact whose dimension is not yet present
Late arriving dimension	Dimension that updates after facts
Watermark	Threshold past which late events are dropped
Idempotency	Same input produces same output, even on retry
Lambda	Batch + stream paths merged in serving
Kappa	Stream only, replay log to backfill
Reverse ETL	Push warehouse data back into operational tools

Glossary: datadriven.io/data-engineering-concepts.

Storage formats

Format	Use when	Avoid when
Parquet	Columnar analytics (default)	Row level updates
ORC	Hive ecosystem	Outside Hive
Avro	Row oriented streaming, schema evolution	Analytical queries
JSON	Flexible, debuggable	Anything at scale
CSV	Interoperability	Anything at scale
Delta, Iceberg, Hudi	ACID on object storage, time travel	Pure batch with no updates

Numbers

Order of magnitude reference for back of the envelope estimates.

Operation	Order
L1 cache reference	1 ns
Main memory reference	100 ns
SSD random read	100 us
Network round trip same DC	500 us
Disk seek	10 ms
Network round trip cross continent	150 ms
Modern Kafka cluster throughput	millions of records per second
Daily events at a top consumer app	billions
Bytes per row in compressed event log	100 to 500
S3 PUT cost	~0.005 USD per 1000

Companion repos

data-engineering-interview-handbook. Full chapter by chapter handbook.
data-engineering-interview-questions. 1418 practice problems.
system-design-for-data-engineers. 120 case studies.
data-engineer-interview-prep. 8 week structured practice.
data-engineer-interview-handbook. 7 day sprint.
awesome-data-engineering-interview. Curated resources.

Contributing

Spot a wrong fact, an outdated link, or a missing pattern? Open a PR. Keep entries terse. This is a recall aid. If your addition needs more than ten lines, link to a longer reference.

License

CC BY-SA 4.0. Lessons hosted at datadriven.io.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Data Engineering Cheatsheet

SQL

Window functions

Joins

NULL semantics

Aggregation patterns

Date and time

Performance reasoning

Python

Data wrangling patterns

Complexity cheatsheet

Generators

Spark

Narrow vs wide

Performance tactics

Airflow

Vocabulary

Common questions

dbt

Materializations

Incremental pattern

Tests

Kafka

Vocabulary

Delivery semantics

Common questions

Schema design

Star schema

Slowly changing dimensions

Fact table grain

Pipeline architecture vocabulary

Storage formats

Numbers

Companion repos

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages