Single page recall reference for data engineering interviews. SQL, Python, Spark, Airflow, dbt, Kafka, schema design, and pipeline patterns.
SQL · Python · Spark · Airflow · dbt · Kafka · Schema · Numbers
This is a recall aid, not a tutorial. If you have never seen window functions, this page will not teach them. If you have used them and need to recall the syntax under pressure, this page is for you. For deeper teaching, follow the link to a structured lesson.
SELECT
user_id,
event_time,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY event_time) AS rn,
RANK() OVER (PARTITION BY user_id ORDER BY event_time) AS rk,
DENSE_RANK() OVER (PARTITION BY user_id ORDER BY event_time) AS drk,
LAG(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS prev_t,
LEAD(event_time) OVER (PARTITION BY user_id ORDER BY event_time) AS next_t,
SUM(amount) OVER (PARTITION BY user_id ORDER BY event_time
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_sum,
AVG(amount) OVER (PARTITION BY user_id ORDER BY event_time
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS rolling_7
FROM events;Traps:
ROWScounts physical rows.RANGEcounts values. If days can be missing, useRANGE BETWEEN INTERVAL '6' DAY PRECEDING AND CURRENT ROW.RANKleaves gaps.DENSE_RANKdoes not.ROW_NUMBERis unique.LAGandLEADaccept a default for the boundary:LAG(x, 1, 0).
Lesson: window functions advanced. Drill: datadriven.io/sql-window-functions-practice.
| Type | Use when | Trap |
|---|---|---|
INNER |
Both sides must match | Silently drops rows |
LEFT |
Keep all left rows | Right side fan out blows up the result |
FULL OUTER |
Keep everything | Often a sign you should have unioned |
CROSS |
Cartesian product | Almost always a mistake |
LATERAL |
Reference left from a right side subquery | Postgres and Snowflake syntax differs |
Semi (EXISTS) |
Filter, do not project | Faster than IN for large lists |
Anti (NOT EXISTS) |
Filter out matches | NOT IN is NULL unsafe |
Lesson: joins advanced.
SELECT NULL = NULL; -- NULL, not TRUE
SELECT NULL <> NULL; -- NULL, not FALSE
SELECT NULL IS NULL; -- TRUE
WHERE col IN (1, 2, NULL); -- TRUE if col=1 or 2, NULL otherwise
WHERE col NOT IN (1, 2, NULL); -- always NULL, filters out everythingUse IS DISTINCT FROM and IS NOT DISTINCT FROM for NULL safe equality.
-- Conditional count
SELECT COUNT(*) FILTER (WHERE status = 'success') AS successes
FROM requests;
-- Conditional aggregation (the universal pivot)
SELECT
user_id,
SUM(CASE WHEN event = 'view' THEN 1 ELSE 0 END) AS views,
SUM(CASE WHEN event = 'click' THEN 1 ELSE 0 END) AS clicks
FROM events
GROUP BY user_id;
-- Approximate distinct count for big data
SELECT APPROX_COUNT_DISTINCT(user_id) FROM events;Lesson: aggregating advanced.
DATE_TRUNC('day', ts)
DATEDIFF('day', start_ts, end_ts)
DATE_TRUNC('month', CURRENT_DATE)
DATE_TRUNC('week', event_time)Most "weekly counts off by one" bugs are timezone bugs. TIMESTAMP and TIMESTAMPTZ are not the same. Reference: datadriven.io/sql-tutorial.
Mention these in any "why is this slow" question:
- Predicate pushdown. Filters applied early.
- Partition pruning. Partition column in the filter.
- Join order. Smaller side on the build side of a hash join.
- Skew. One key dominating distribution.
- Spill to disk. Memory pressure during sort or aggregate.
- Sort keys (Redshift), clustering keys (Snowflake, BigQuery).
Reference: datadriven.io/sql-query-optimization.
import itertools
from collections import defaultdict, Counter
# Chunk an iterable into batches of n
def batched(iterable, n):
it = iter(iterable)
while chunk := list(itertools.islice(it, n)):
yield chunk
# Group by key
grouped = defaultdict(list)
for record in records:
grouped[record["key"]].append(record)
# Top N with Counter
top10 = Counter(items).most_common(10)
# Hash partitioning
def partition(record, n_partitions):
return hash(record["id"]) % n_partitions
# Interval merging
def merge(intervals):
intervals.sort()
out = [intervals[0]]
for start, end in intervals[1:]:
if start <= out[-1][1]:
out[-1] = (out[-1][0], max(end, out[-1][1]))
else:
out.append((start, end))
return out
# Sessionization with inactivity gap
def sessionize(events, gap_seconds):
events.sort(key=lambda e: e["t"])
last_t, sid = None, 0
for e in events:
if last_t is None or e["t"] - last_t > gap_seconds:
sid += 1
e["session_id"] = sid
last_t = e["t"]
return eventsLesson: collections advanced.
| Operation | List | Dict | Set |
|---|---|---|---|
| Lookup | O(n) | O(1) | O(1) |
| Insert | O(1) end | O(1) | O(1) |
| Delete | O(n) | O(1) | O(1) |
Membership (in) |
O(n) | O(1) | O(1) |
Nested loops are usually flattenable with a dict or set lookup.
def read_lines(path):
with open(path) as f:
for line in f:
yield line.strip()
def parse(lines):
for line in lines:
yield json.loads(line)
def filter_recent(records, since):
for r in records:
if r["ts"] >= since:
yield r
# pipeline composition, constant memory
for r in filter_recent(parse(read_lines("events.jsonl")), since):
process(r)Spark is lazy. Transformations build a DAG. Actions trigger execution. Catalyst optimizes the DAG before any work happens.
| Narrow (no shuffle) | Wide (causes shuffle) |
|---|---|
filter, map, select, withColumn |
groupBy, join, distinct, repartition, orderBy |
- Predicate pushdown. Filter early, especially on partitioned columns.
- Column pruning. Read only what you need.
- Broadcast join when one side is under ~10 MB:
broadcast(small_df). - Salt high frequency keys to fix skew.
- Enable Adaptive Query Execution (AQE).
- Target 128 to 512 MB output files.
- Z order (Delta) or cluster (Iceberg) on common filter columns.
If your data fits on one machine, use DuckDB or pandas. Spark has cold start cost.
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
with DAG(
"my_pipeline",
start_date=datetime(2026, 1, 1),
schedule="0 * * * *",
catchup=False,
max_active_runs=1,
default_args={"retries": 3},
) as dag:
extract = PythonOperator(task_id="extract", python_callable=extract_fn)
transform = PythonOperator(task_id="transform", python_callable=transform_fn)
load = PythonOperator(task_id="load", python_callable=load_fn)
extract >> transform >> loadDAG, task, operator, sensor, XCom, backfill, catchup, pool, TaskGroup, variable, connection, idempotency.
- Backfill 6 months of data?
airflow dags backfillwith bounded date range and parallelism cap. Confirm idempotency first. - DAG running long? Check task duration in the UI, pool saturation, parallelism settings, executor health.
- Avoid scheduling drift? Use
schedule_interval, notstart_datearithmetic.catchup=False.
dbt compiles SQL SELECT statements into CREATE TABLE or CREATE VIEW, in dependency order.
| Type | Use when | Tradeoff |
|---|---|---|
view |
Cheap models | Recomputed on every query |
table |
Expensive transforms read often | Storage cost, freshness lag |
incremental |
Big append or update tables | Backfill complexity |
ephemeral |
Reused CTEs | Hard to debug |
{% raw %}{{ config(materialized='incremental', unique_key='id') }}{% endraw %}
SELECT * FROM {% raw %}{{ source('raw', 'events') }}{% endraw %}
{% raw %}{% if is_incremental() %}{% endraw %}
WHERE event_time > (SELECT MAX(event_time) FROM {% raw %}{{ this }}{% endraw %})
{% raw %}{% endif %}{% endraw %}models:
- name: orders
columns:
- name: order_id
tests: [unique, not_null]
- name: customer_id
tests:
- relationships:
to: ref('customers')
field: customer_idDistributed, partitioned, replicated, append only log.
Topic, partition, replication factor, producer, consumer, consumer group, offset, retention.
| Semantic | When |
|---|---|
| At most once | Commit before processing |
| At least once | Commit after processing (default) |
| Exactly once | Idempotent producer + transactional consumer |
- Guarantee ordering? Per partition only. Pick a partition key that groups what must be ordered together.
- Reprocess? Reset offset, or read from a different consumer group.
- Rebalance? Consumers join or leave a group, partition assignment recomputed, brief processing pause.
+-------------+
| dim_date |
+-------------+
|
+----------+ +------------+ +-------------+
| dim_user |----| fact_order |----| dim_product |
+----------+ +------------+ +-------------+
|
+-------------+
| dim_store |
+-------------+
| Type | Behavior |
|---|---|
| 0 | No change. History lost. |
| 1 | Overwrite. History lost. |
| 2 | New row with effective dates. History preserved. |
| 3 | New column for previous value. Limited history. |
| 4 | Separate history table. |
| 6 | Combination of 1, 2, 3. |
When in doubt, use type 2.
Lesson: SCD.
Pick the grain first. Examples: one row per order, per order line, per page view, per session, per user per day. Mixing grains in one table is the most common schema design mistake.
| Term | Means |
|---|---|
| Bronze, silver, gold | Medallion lakehouse layers (raw, cleaned, business ready) |
| CDC | Change data capture |
| Backfill | Re run a pipeline for a historical window |
| Replay | Re emit messages from a log to recover state |
| Late arriving fact | Fact whose dimension is not yet present |
| Late arriving dimension | Dimension that updates after facts |
| Watermark | Threshold past which late events are dropped |
| Idempotency | Same input produces same output, even on retry |
| Lambda | Batch + stream paths merged in serving |
| Kappa | Stream only, replay log to backfill |
| Reverse ETL | Push warehouse data back into operational tools |
Glossary: datadriven.io/data-engineering-concepts.
| Format | Use when | Avoid when |
|---|---|---|
| Parquet | Columnar analytics (default) | Row level updates |
| ORC | Hive ecosystem | Outside Hive |
| Avro | Row oriented streaming, schema evolution | Analytical queries |
| JSON | Flexible, debuggable | Anything at scale |
| CSV | Interoperability | Anything at scale |
| Delta, Iceberg, Hudi | ACID on object storage, time travel | Pure batch with no updates |
Order of magnitude reference for back of the envelope estimates.
| Operation | Order |
|---|---|
| L1 cache reference | 1 ns |
| Main memory reference | 100 ns |
| SSD random read | 100 us |
| Network round trip same DC | 500 us |
| Disk seek | 10 ms |
| Network round trip cross continent | 150 ms |
| Modern Kafka cluster throughput | millions of records per second |
| Daily events at a top consumer app | billions |
| Bytes per row in compressed event log | 100 to 500 |
| S3 PUT cost | ~0.005 USD per 1000 |
- data-engineering-interview-handbook. Full chapter by chapter handbook.
- data-engineering-interview-questions. 1418 practice problems.
- system-design-for-data-engineers. 120 case studies.
- data-engineer-interview-prep. 8 week structured practice.
- data-engineer-interview-handbook. 7 day sprint.
- awesome-data-engineering-interview. Curated resources.
Spot a wrong fact, an outdated link, or a missing pattern? Open a PR. Keep entries terse. This is a recall aid. If your addition needs more than ten lines, link to a longer reference.
CC BY-SA 4.0. Lessons hosted at datadriven.io.