Add 60 interview-style problems and tracking index by shiningflash · Pull Request #1 · shiningflash/data-engineering-practice-problems

shiningflash · 2026-05-14T03:15:31Z

Summary

Adds 60 new interview problems (Problems 6 through 65) across 10 new categories: Fundamentals, SQL Thinking, System Design, Scenarios, Cloud Decisions, Data Modeling, Debugging, Cost and Performance, Streaming, and People and Process.
Adds PROBLEMS.md — a single index table linking every problem to its question and solution, with category, topic and difficulty tags suitable for filtering on a website.
Adds Problem 5 (Merging Messy CSVs from Multiple Partners) as a worked example matching the existing Problem 1-4 style.

What each problem contains

Every problem has its own folder with:

question.md — scenario, task list, and what a good answer should cover.
solution.md (or solution.py for code-style ones) — a walkthrough written like an experienced engineer would explain it on a whiteboard.

Solutions include diagrams (ASCII / mermaid-style boxes) where the topic benefits from visualization, capacity math where relevant, common-mistake lists, and bonus follow-up questions to anticipate.

Format consistency

All new problems follow the same shape as the existing Problems 1-4:

Scenario.
Task list.
"What a good answer covers."
Solution: short version, walkthrough, picture, common mistakes, bonus follow-up.

Coverage

Category	Problems
Fundamentals	6-14
SQL Thinking	15-20
System Design	21-28
Scenarios	29-34
Cloud Decisions	35-40
Data Modeling	41-45
Debugging	46-50
Cost & Performance	51-55
Streaming	56-58
People & Process	59-65

Test plan

Review PROBLEMS.md table renders correctly on GitHub.
Spot-check 3-4 problem folders to confirm question/solution pairs are coherent.
Confirm category and topic labels are useful for filtering on the website.
Check that links in PROBLEMS.md resolve to the right files (URL-encoded paths).

Adds 9 interview-style fundamentals problems with full question and solution markdown files, including diagrams and concrete examples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

6 SQL-focused interview problems with worked examples and EXPLAIN plan walkthroughs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

8 system design problems covering electricity retailer platform, banking widgets, surge pricing, streaming aggregations, billing pipelines, real-time driver tracking, year-in-review batch, and notification dedup. Each includes architecture diagrams, capacity math, and risk discussion. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

6 scenario-based interview problems covering silent data bugs, cost spikes, analyst trust, pipeline ownership transfer, executive pressure, and Kafka data loss recovery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

6 cloud-decision interview problems comparing Lambda vs Cloud Run, scheduled serverless jobs, BigQuery vs Snowflake, S3 vs warehouse storage, managed Airflow vs self-hosted, and BigQuery access control models. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

5 data modeling problems covering Airbnb-style schema, subscription history with valid_from/to, mixing facts and dimensions, explaining grain, and current state vs event history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

5 debugging-scenario interview problems covering region-zero revenue, silent task success with empty output, sudden query slowdowns, vague user reports, and recurring partition anomalies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

5 cost/performance problems covering BigQuery bill investigation, Spark job tuning, daily-data hourly-scan waste, the 'throw more memory' reflex, and partitioning vs clustering vs materialized views. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

3 streaming-fundamentals problems covering watermarks, Kafka per-partition ordering, and diagnosing growing consumer lag before scaling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

7 people/process problems covering analyst onboarding, fast-vs-right trade-offs, metric ownership disputes, blameless postmortems, inheriting undocumented pipelines, breaking dbt changes with many consumers, and Airflow scheduler scaling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add rows for problems 6-65 across 10 new categories (Fundamentals, SQL Thinking, System Design, Scenarios, Cloud Decisions, Data Modeling, Debugging, Cost and Performance, Streaming, People and Process) plus expanded legend and difficulty guide. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

shiningflash and others added 11 commits May 14, 2026 03:54

Add Problems 6-14 (Fundamentals and Concepts)

2e5f72b

Adds 9 interview-style fundamentals problems with full question and solution markdown files, including diagrams and concrete examples. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add Problems 15-20 (SQL Thinking)

be9633e

6 SQL-focused interview problems with worked examples and EXPLAIN plan walkthroughs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add Problems 29-34 (Scenarios and War Stories)

2ce628d

6 scenario-based interview problems covering silent data bugs, cost spikes, analyst trust, pipeline ownership transfer, executive pressure, and Kafka data loss recovery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add Problems 41-45 (Data Modeling)

d836742

5 data modeling problems covering Airbnb-style schema, subscription history with valid_from/to, mixing facts and dimensions, explaining grain, and current state vs event history. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add Problems 56-58 (Streaming Concepts)

ecab99c

3 streaming-fundamentals problems covering watermarks, Kafka per-partition ordering, and diagnosing growing consumer lag before scaling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add 60 interview-style problems and tracking index#1

Add 60 interview-style problems and tracking index#1
shiningflash wants to merge 11 commits into
mainfrom
problems

shiningflash commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shiningflash commented May 14, 2026

Summary

What each problem contains

Format consistency

Coverage

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant