pingcap · hawkingrei · Mar 14, 2026 · Mar 13, 2026 · Mar 13, 2026
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,4 @@
 .env
+.DS_Store
 
-ref
+ref
diff --git a/skills/tidb-query-tuning/AGENTS.md b/skills/tidb-query-tuning/AGENTS.md
@@ -0,0 +1,92 @@
+# TiDB Query Tuning Agent Notes
+
+Use this skill directory for two different knowledge sources:
+
+- curated topic references under `references/`
+- GitHub issue-derived field experience generated into markdown files
+
+## When to mine issue experience
+
+Use GitHub issue mining when:
+
+- the local references do not cover a customer-facing symptom well enough
+- you need a recent field precedent instead of a general tuning rule
+- you want fix PRs, merge timestamps, and open issue reminders
+- you want to build or refresh a local corpus of customer-driven planner or stats bugs
+
+Do not start from issue mining if a stable reference under `references/` already answers the question. Use the issue corpus to complement the topic docs, not replace them.
+
+## Default workflow
+
+1. Start with the local references in `references/`.
+2. Search `references/optimizer-oncall-experiences-redacted/` for a symptom match.
+3. Search `references/tidb-customer-planner-issues/` if you need linked PRs, merge times, or still-open customer gaps.
+4. If the local corpora are still missing the pattern, mine GitHub issues with the script in `scripts/`.
+5. Review the generated files, then fold reusable learnings back into the relevant reference docs when appropriate.
+
+## Issue mining script
+
+Use:
+
+`scripts/generate_tidb_issue_experiences.py`
+
+The script:
+
+- searches GitHub issues with a provided query
+- follows issue timeline cross-references and explicit fix comments
+- collects linked PR metadata and changed files
+- writes one markdown file per issue
+- writes an index `README.md` into the output directory
+
+The current checked-in issue corpus lives under:
+
+- `references/tidb-customer-planner-issues/`
+
+## Recommended query patterns
+
+For customer-driven planner issues:
+
+```text
+repo:pingcap/tidb is:issue label:"report/customer" label:"sig/planner" created:>=2024-01-01
+```
+
+For stats-heavy issue mining:
+
+```text
+repo:pingcap/tidb is:issue label:"report/customer" (label:"sig/planner" OR label:"sig/execution") stats created:>=2024-01-01
+```
+
+Adjust the query rather than hardcoding different behaviors into the script.
+
+## Usage example
+
+```bash
+python3 skills/tidb-query-tuning/scripts/generate_tidb_issue_experiences.py \
+  --query 'repo:pingcap/tidb is:issue label:"report/customer" label:"sig/planner" created:>=2024-01-01' \
+  --out-dir outputs/tidb-customer-planner-issues
+```
+
+## What to keep from generated issue files
+
+Keep and reuse:
+
+- customer-facing symptom descriptions
+- investigation clues
+- linked fix PRs
+- merge timestamps
+- affected modules
+- open issues that should remain on the reminder list
+
+Do not treat every generated issue as a mature tuning rule. Generated files are raw field precedents. Promote them into `references/` only after the pattern is stable and reusable.
+
+## Tooling assumptions
+
+- `gh` CLI must be installed and authenticated
+- network access to GitHub must be available
+- the output directory should be treated as generated content
+
+## Editing guidance
+
+- Keep curated docs in `references/` concise and topic-oriented.
+- Keep generated issue corpora outside the hand-written topic docs unless they are intentionally promoted.
+- If you regenerate a corpus, prefer writing into a fresh output directory or knowingly replacing the previous generated set.
diff --git a/skills/tidb-query-tuning/SKILL.md b/skills/tidb-query-tuning/SKILL.md
@@ -0,0 +1,66 @@
+# TiDB Query Tuning
+
+**name:** tidb-query-tuning
+
+**description:** Diagnose and optimize slow TiDB queries using optimizer hints, session variables, join strategy selection, subquery optimization, and index tuning. Use when a query is slow, produces a bad plan, or needs performance guidance on TiDB.
+
+---
+
+## Workflow
+
+1. **Capture the current plan:**
+   - Run `EXPLAIN ANALYZE <query>` to get actual execution stats.
+   - Compare `estRows` vs `actRows` — large divergence means stale or missing statistics.
+   - Note the most expensive operators (wall time, memory, rows processed).
+
+2. **Check statistics health:**
+   - `SHOW STATS_HEALTHY WHERE Db_name = '<db>' AND Table_name = '<table>';`
+   - If health < 80 or `actRows` diverges significantly from `estRows`, run `ANALYZE TABLE <table>;` and re-check the plan.
+   - For specific indexes: `ANALYZE TABLE <table> INDEX <index_name>;`
+
+3. **Identify the bottleneck pattern:**
+   - **Bad join order or strategy** → see `references/join-strategies.md`
+   - **Subquery not handled well** → see `references/subquery-optimization.md`
+   - **Wrong or missing index** → see `references/index-selection.md`
+   - **Optimizer choosing a suboptimal plan despite good stats** → see `references/optimizer-hints.md` and `references/session-variables.md`
+   - **Stats are stale or auto analyze cannot keep up** → see `references/stats-health-and-auto-analyze.md`
+   - **Plans change after restart or sync stats loading times out** → see `references/stats-loading-and-startup.md`
+   - **Need to tune analyze version, column coverage, or memory-heavy stats collection** → see `references/stats-version-and-analyze-configuration.md`
+   - **Need a matching field incident, workaround, or fixed-version precedent** → search `references/optimizer-oncall-experiences-redacted/`
+   - **Need recent customer issue precedents with linked PRs and merge timestamps** → search `references/tidb-customer-planner-issues/`
+
+4. **Apply the fix:**
+   - Prefer the least invasive change: refresh stats → add index → hint → session variable.
+   - Use hints when the fix is query-specific.
+   - Use session variables when the fix applies to a workload pattern.
+
+5. **Verify the improvement:**
+   - Re-run `EXPLAIN ANALYZE` with the fix applied.
+   - Confirm `actRows` and execution time improved.
+   - If the fix is a hint, document it in a SQL comment so future readers understand why.
+
+## High-signal rules
+
+- **Always check stats first.** Most bad plans in TiDB come from stale or missing statistics, not optimizer bugs.
+- **Treat stats maintenance as capacity planning.** If `AUTO ANALYZE` cannot keep up with stats decay, plan quality will drift even when SQL does not change.
+- **`EXPLAIN ANALYZE` is the ground truth.** `EXPLAIN` alone shows estimates; `ANALYZE` shows what actually happened.
+- **Search known field cases before inventing a new workaround.** The oncall corpus under `references/optimizer-oncall-experiences-redacted/` is useful for symptom matching, investigation signals, and fix-version lookup.
+- **Search recent GitHub issue precedents when fix lineage matters.** The corpus under `references/tidb-customer-planner-issues/` is useful when you need linked PRs, merge times, and still-open customer gaps.
+- **Correlated subqueries:** TiDB decorrelates by default. When the subquery is well-indexed and the outer query is selective, `NO_DECORRELATE()` often wins. See `references/subquery-optimization.md`.
+- **Join strategies matter:** TiDB supports hash join, index join, merge join, and shuffle joins. The right choice depends on table sizes, index availability, and data distribution. See `references/join-strategies.md`.
+- **Hints are per-query; variables are per-session/global.** Use hints for surgical fixes, variables for workload-wide tuning.
+- **TiFlash acceleration:** For analytical queries on large tables, push computation to TiFlash replicas using `READ_FROM_STORAGE(TIFLASH[<table>])`. See `references/session-variables.md`.
+
+## References
+
+- `references/optimizer-hints.md` — Optimizer hints: syntax, catalog, and when to use each.
+- `references/session-variables.md` — Session/global variables that affect plan choice.
+- `references/join-strategies.md` — Join algorithms, when TiDB picks each, and how to override.
+- `references/subquery-optimization.md` — Decorrelation, semi-join, EXISTS/IN patterns and NO_DECORRELATE.
+- `references/index-selection.md` — Index hints, invisible indexes, index advisor, composite index guidance.
+- `references/explain-patterns.md` — Reading EXPLAIN ANALYZE output to identify bottlenecks.
+- `references/stats-health-and-auto-analyze.md` — Statistics health, auto analyze backlog diagnosis, and safe concurrency tuning.
+- `references/stats-loading-and-startup.md` — Init stats, sync load, restart-time plan instability, and version-based mitigation.
+- `references/stats-version-and-analyze-configuration.md` — Stats versioning, analyze coverage, and memory-safe stats collection settings.
+- `references/optimizer-oncall-experiences-redacted/` — Redacted optimizer oncall case corpus with user symptoms, investigation signals, workarounds, and fixed versions.
+- `references/tidb-customer-planner-issues/README.md` — Generated GitHub issue corpus with one file per customer-driven planner issue, including linked PRs and merge timestamps.
diff --git a/skills/tidb-query-tuning/references/explain-patterns.md b/skills/tidb-query-tuning/references/explain-patterns.md
@@ -0,0 +1,142 @@
+# Reading EXPLAIN ANALYZE Output
+
+`EXPLAIN ANALYZE` is the primary tool for understanding how TiDB actually executes a query. It runs the query and reports actual row counts, execution time, and memory usage for each operator.
+
+## Syntax
+
+```sql
+-- Basic (text format):
+EXPLAIN ANALYZE SELECT ...;
+
+-- Structured JSON format (better for programmatic analysis):
+EXPLAIN FORMAT = "tidb_json" SELECT ...;
+
+-- Estimate only (does NOT execute the query):
+EXPLAIN SELECT ...;
+```
+
+**Use `EXPLAIN ANALYZE` for tuning.** Plain `EXPLAIN` shows estimates only, which may be wrong.
+
+## Key columns in EXPLAIN ANALYZE output
+
+| Column | Meaning |
+|--------|---------|
+| `id` | Operator name and position in the tree |
+| `estRows` | Estimated row count from optimizer statistics |
+| `actRows` | Actual row count observed during execution |
+| `task` | Where the operator runs: `root` (TiDB), `cop[tikv]` (TiKV coprocessor), `cop[tiflash]` (TiFlash) |
+| `access object` | Table, index, or partition being accessed |
+| `execution info` | Wall time, loops, memory, disk usage, concurrency |
+| `operator info` | Filter conditions, join keys, sort keys |
+
+## What to look for
+
+### 1. estRows vs actRows divergence
+
+Large differences indicate stale or inaccurate statistics.
+
+```
+estRows: 100    actRows: 500000   ← Stats are wrong!
+```
+
+**Action:** Run `ANALYZE TABLE <table>` and re-check.
+
+### 2. Expensive operators (by wall time)
+
+Look at `execution info` for `time:` values. The operator with the longest time is the bottleneck.
+
+```
+execution info: time:2.5s, loops:1, ...   ← This is the bottleneck
+```
+
+### 3. Full table scans on large tables
+
+```
+TableFullScan   table:orders   actRows:10000000
+```
+
+**Action:** Add an index or use a hint to force an index scan.
+
+### 4. Hash join with large build side
+
+```
+HashJoin         actRows:50000
+├── Build        actRows:5000000   ← Build side is huge
+└── Probe        actRows:50000
+```
+
+**Action:** Consider `INL_JOIN` if the build side has an index, or `LEADING` to swap build/probe sides.
+
+### 5. Unnecessary Sort operators
+
+```
+Sort             actRows:1000000
+└── TableFullScan
+```
+
+If there's an index that provides the sort order, use `ORDER_INDEX` to eliminate the Sort.
+
+### 6. Apply operator (correlated subquery)
+
+```
+Apply            actRows:1000
+├── Outer        actRows:1000
+└── Inner        actRows:1000   (executed 1000 times)
+```
+
+This means correlated execution. If `actRows` on the outer side is small, this is fine. If large, consider letting TiDB decorrelate (remove `NO_DECORRELATE`) or adding an index on the inner side.
+
+## Operator reference
+
+### Scan operators
+
+| Operator | Meaning |
+|----------|---------|
+| `TableFullScan` | Full table scan — reads every row |
+| `TableRangeScan` | Scans a range of the primary key |
+| `IndexRangeScan` | Scans a range of an index |
+| `IndexFullScan` | Scans the entire index |
+| `IndexLookUp` | Two-phase: index scan → table lookup for remaining columns |
+
+### Join operators
+
+| Operator | Meaning |
+|----------|---------|
+| `HashJoin` | Hash join (look for Build and Probe children) |
+| `IndexJoin` | Index nested loop join |
+| `MergeJoin` | Sort-merge join |
+| `Apply` | Correlated subquery execution (per-row) |
+
+### Aggregation operators
+
+| Operator | Meaning |
+|----------|---------|
+| `HashAgg` | Hash-based aggregation |
+| `StreamAgg` | Stream aggregation (requires sorted input) |
+
+### Other operators
+
+| Operator | Meaning |
+|----------|---------|
+| `Sort` | Sorts rows (expensive for large datasets) |
+| `TopN` | Sort + limit combined (more efficient than separate Sort + Limit) |
+| `Selection` | Filters rows (WHERE conditions not pushed to scan) |
+| `Projection` | Computes output columns |
+| `Limit` | Returns only N rows |
+
+## Diagnostic workflow
+
+```
+1. Run EXPLAIN ANALYZE
+2. Find the operator with the highest wall time
+3. Check estRows vs actRows for that operator and its children
+   ├── Big divergence → ANALYZE TABLE, then re-run
+   └── Stats are fine → Operator choice is the problem
+4. Identify the pattern:
+   ├── Full scan → Add index or USE_INDEX hint
+   ├── Wrong join strategy → HASH_JOIN / INL_JOIN hint
+   ├── Wrong join order → LEADING hint
+   ├── Expensive correlated subquery → Check NO_DECORRELATE guidance
+   └── Unnecessary Sort → ORDER_INDEX hint or add sorted index
+5. Apply fix and re-run EXPLAIN ANALYZE to verify
+```