Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
.env
.DS_Store

ref
ref
92 changes: 92 additions & 0 deletions skills/tidb-query-tuning/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
# TiDB Query Tuning Agent Notes

Use this skill directory for two different knowledge sources:

- curated topic references under `references/`
- GitHub issue-derived field experience generated into markdown files

## When to mine issue experience

Use GitHub issue mining when:

- the local references do not cover a customer-facing symptom well enough
- you need a recent field precedent instead of a general tuning rule
- you want fix PRs, merge timestamps, and open issue reminders
- you want to build or refresh a local corpus of customer-driven planner or stats bugs

Do not start from issue mining if a stable reference under `references/` already answers the question. Use the issue corpus to complement the topic docs, not replace them.

## Default workflow

1. Start with the local references in `references/`.
2. Search `references/optimizer-oncall-experiences-redacted/` for a symptom match.
3. Search `references/tidb-customer-planner-issues/` if you need linked PRs, merge times, or still-open customer gaps.
4. If the local corpora are still missing the pattern, mine GitHub issues with the script in `scripts/`.
5. Review the generated files, then fold reusable learnings back into the relevant reference docs when appropriate.

## Issue mining script

Use:

`scripts/generate_tidb_issue_experiences.py`

The script:

- searches GitHub issues with a provided query
- follows issue timeline cross-references and explicit fix comments
- collects linked PR metadata and changed files
- writes one markdown file per issue
- writes an index `README.md` into the output directory

The current checked-in issue corpus lives under:

- `references/tidb-customer-planner-issues/`

## Recommended query patterns

For customer-driven planner issues:

```text
repo:pingcap/tidb is:issue label:"report/customer" label:"sig/planner" created:>=2024-01-01
```

For stats-heavy issue mining:

```text
repo:pingcap/tidb is:issue label:"report/customer" (label:"sig/planner" OR label:"sig/execution") stats created:>=2024-01-01
```

Adjust the query rather than hardcoding different behaviors into the script.

## Usage example

```bash
python3 skills/tidb-query-tuning/scripts/generate_tidb_issue_experiences.py \
--query 'repo:pingcap/tidb is:issue label:"report/customer" label:"sig/planner" created:>=2024-01-01' \
--out-dir outputs/tidb-customer-planner-issues
```

## What to keep from generated issue files

Keep and reuse:

- customer-facing symptom descriptions
- investigation clues
- linked fix PRs
- merge timestamps
- affected modules
- open issues that should remain on the reminder list

Do not treat every generated issue as a mature tuning rule. Generated files are raw field precedents. Promote them into `references/` only after the pattern is stable and reusable.

## Tooling assumptions

- `gh` CLI must be installed and authenticated
- network access to GitHub must be available
- the output directory should be treated as generated content

## Editing guidance

- Keep curated docs in `references/` concise and topic-oriented.
- Keep generated issue corpora outside the hand-written topic docs unless they are intentionally promoted.
- If you regenerate a corpus, prefer writing into a fresh output directory or knowingly replacing the previous generated set.
66 changes: 66 additions & 0 deletions skills/tidb-query-tuning/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
# TiDB Query Tuning

**name:** tidb-query-tuning

**description:** Diagnose and optimize slow TiDB queries using optimizer hints, session variables, join strategy selection, subquery optimization, and index tuning. Use when a query is slow, produces a bad plan, or needs performance guidance on TiDB.

---

## Workflow

1. **Capture the current plan:**
- Run `EXPLAIN ANALYZE <query>` to get actual execution stats.
- Compare `estRows` vs `actRows` — large divergence means stale or missing statistics.
- Note the most expensive operators (wall time, memory, rows processed).

2. **Check statistics health:**
- `SHOW STATS_HEALTHY WHERE Db_name = '<db>' AND Table_name = '<table>';`
- If health < 80 or `actRows` diverges significantly from `estRows`, run `ANALYZE TABLE <table>;` and re-check the plan.
- For specific indexes: `ANALYZE TABLE <table> INDEX <index_name>;`

3. **Identify the bottleneck pattern:**
- **Bad join order or strategy** → see `references/join-strategies.md`
- **Subquery not handled well** → see `references/subquery-optimization.md`
- **Wrong or missing index** → see `references/index-selection.md`
- **Optimizer choosing a suboptimal plan despite good stats** → see `references/optimizer-hints.md` and `references/session-variables.md`
- **Stats are stale or auto analyze cannot keep up** → see `references/stats-health-and-auto-analyze.md`
- **Plans change after restart or sync stats loading times out** → see `references/stats-loading-and-startup.md`
- **Need to tune analyze version, column coverage, or memory-heavy stats collection** → see `references/stats-version-and-analyze-configuration.md`
- **Need a matching field incident, workaround, or fixed-version precedent** → search `references/optimizer-oncall-experiences-redacted/`
- **Need recent customer issue precedents with linked PRs and merge timestamps** → search `references/tidb-customer-planner-issues/`

4. **Apply the fix:**
- Prefer the least invasive change: refresh stats → add index → hint → session variable.
- Use hints when the fix is query-specific.
- Use session variables when the fix applies to a workload pattern.

5. **Verify the improvement:**
- Re-run `EXPLAIN ANALYZE` with the fix applied.
- Confirm `actRows` and execution time improved.
- If the fix is a hint, document it in a SQL comment so future readers understand why.

## High-signal rules

- **Always check stats first.** Most bad plans in TiDB come from stale or missing statistics, not optimizer bugs.
- **Treat stats maintenance as capacity planning.** If `AUTO ANALYZE` cannot keep up with stats decay, plan quality will drift even when SQL does not change.
- **`EXPLAIN ANALYZE` is the ground truth.** `EXPLAIN` alone shows estimates; `ANALYZE` shows what actually happened.
- **Search known field cases before inventing a new workaround.** The oncall corpus under `references/optimizer-oncall-experiences-redacted/` is useful for symptom matching, investigation signals, and fix-version lookup.
- **Search recent GitHub issue precedents when fix lineage matters.** The corpus under `references/tidb-customer-planner-issues/` is useful when you need linked PRs, merge times, and still-open customer gaps.
- **Correlated subqueries:** TiDB decorrelates by default. When the subquery is well-indexed and the outer query is selective, `NO_DECORRELATE()` often wins. See `references/subquery-optimization.md`.
- **Join strategies matter:** TiDB supports hash join, index join, merge join, and shuffle joins. The right choice depends on table sizes, index availability, and data distribution. See `references/join-strategies.md`.
- **Hints are per-query; variables are per-session/global.** Use hints for surgical fixes, variables for workload-wide tuning.
- **TiFlash acceleration:** For analytical queries on large tables, push computation to TiFlash replicas using `READ_FROM_STORAGE(TIFLASH[<table>])`. See `references/session-variables.md`.

## References

- `references/optimizer-hints.md` — Optimizer hints: syntax, catalog, and when to use each.
- `references/session-variables.md` — Session/global variables that affect plan choice.
- `references/join-strategies.md` — Join algorithms, when TiDB picks each, and how to override.
- `references/subquery-optimization.md` — Decorrelation, semi-join, EXISTS/IN patterns and NO_DECORRELATE.
- `references/index-selection.md` — Index hints, invisible indexes, index advisor, composite index guidance.
- `references/explain-patterns.md` — Reading EXPLAIN ANALYZE output to identify bottlenecks.
- `references/stats-health-and-auto-analyze.md` — Statistics health, auto analyze backlog diagnosis, and safe concurrency tuning.
- `references/stats-loading-and-startup.md` — Init stats, sync load, restart-time plan instability, and version-based mitigation.
- `references/stats-version-and-analyze-configuration.md` — Stats versioning, analyze coverage, and memory-safe stats collection settings.
- `references/optimizer-oncall-experiences-redacted/` — Redacted optimizer oncall case corpus with user symptoms, investigation signals, workarounds, and fixed versions.
- `references/tidb-customer-planner-issues/README.md` — Generated GitHub issue corpus with one file per customer-driven planner issue, including linked PRs and merge timestamps.
142 changes: 142 additions & 0 deletions skills/tidb-query-tuning/references/explain-patterns.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# Reading EXPLAIN ANALYZE Output

`EXPLAIN ANALYZE` is the primary tool for understanding how TiDB actually executes a query. It runs the query and reports actual row counts, execution time, and memory usage for each operator.

## Syntax

```sql
-- Basic (text format):
EXPLAIN ANALYZE SELECT ...;

-- Structured JSON format (better for programmatic analysis):
EXPLAIN FORMAT = "tidb_json" SELECT ...;

-- Estimate only (does NOT execute the query):
EXPLAIN SELECT ...;
```

**Use `EXPLAIN ANALYZE` for tuning.** Plain `EXPLAIN` shows estimates only, which may be wrong.

## Key columns in EXPLAIN ANALYZE output

| Column | Meaning |
|--------|---------|
| `id` | Operator name and position in the tree |
| `estRows` | Estimated row count from optimizer statistics |
| `actRows` | Actual row count observed during execution |
| `task` | Where the operator runs: `root` (TiDB), `cop[tikv]` (TiKV coprocessor), `cop[tiflash]` (TiFlash) |
| `access object` | Table, index, or partition being accessed |
| `execution info` | Wall time, loops, memory, disk usage, concurrency |
| `operator info` | Filter conditions, join keys, sort keys |

## What to look for

### 1. estRows vs actRows divergence

Large differences indicate stale or inaccurate statistics.

```
estRows: 100 actRows: 500000 ← Stats are wrong!
```

**Action:** Run `ANALYZE TABLE <table>` and re-check.

### 2. Expensive operators (by wall time)

Look at `execution info` for `time:` values. The operator with the longest time is the bottleneck.

```
execution info: time:2.5s, loops:1, ... ← This is the bottleneck
```

### 3. Full table scans on large tables

```
TableFullScan table:orders actRows:10000000
```

**Action:** Add an index or use a hint to force an index scan.

### 4. Hash join with large build side

```
HashJoin actRows:50000
├── Build actRows:5000000 ← Build side is huge
└── Probe actRows:50000
```

**Action:** Consider `INL_JOIN` if the build side has an index, or `LEADING` to swap build/probe sides.

### 5. Unnecessary Sort operators

```
Sort actRows:1000000
└── TableFullScan
```

If there's an index that provides the sort order, use `ORDER_INDEX` to eliminate the Sort.

### 6. Apply operator (correlated subquery)

```
Apply actRows:1000
├── Outer actRows:1000
└── Inner actRows:1000 (executed 1000 times)
```

This means correlated execution. If `actRows` on the outer side is small, this is fine. If large, consider letting TiDB decorrelate (remove `NO_DECORRELATE`) or adding an index on the inner side.

## Operator reference

### Scan operators

| Operator | Meaning |
|----------|---------|
| `TableFullScan` | Full table scan — reads every row |
| `TableRangeScan` | Scans a range of the primary key |
| `IndexRangeScan` | Scans a range of an index |
| `IndexFullScan` | Scans the entire index |
| `IndexLookUp` | Two-phase: index scan → table lookup for remaining columns |

### Join operators

| Operator | Meaning |
|----------|---------|
| `HashJoin` | Hash join (look for Build and Probe children) |
| `IndexJoin` | Index nested loop join |
| `MergeJoin` | Sort-merge join |
| `Apply` | Correlated subquery execution (per-row) |

### Aggregation operators

| Operator | Meaning |
|----------|---------|
| `HashAgg` | Hash-based aggregation |
| `StreamAgg` | Stream aggregation (requires sorted input) |

### Other operators

| Operator | Meaning |
|----------|---------|
| `Sort` | Sorts rows (expensive for large datasets) |
| `TopN` | Sort + limit combined (more efficient than separate Sort + Limit) |
| `Selection` | Filters rows (WHERE conditions not pushed to scan) |
| `Projection` | Computes output columns |
| `Limit` | Returns only N rows |

## Diagnostic workflow

```
1. Run EXPLAIN ANALYZE
2. Find the operator with the highest wall time
3. Check estRows vs actRows for that operator and its children
├── Big divergence → ANALYZE TABLE, then re-run
└── Stats are fine → Operator choice is the problem
4. Identify the pattern:
├── Full scan → Add index or USE_INDEX hint
├── Wrong join strategy → HASH_JOIN / INL_JOIN hint
├── Wrong join order → LEADING hint
├── Expensive correlated subquery → Check NO_DECORRELATE guidance
└── Unnecessary Sort → ORDER_INDEX hint or add sorted index
5. Apply fix and re-run EXPLAIN ANALYZE to verify
```
Loading