docs: add text-to-sql dev note by dhruvnathawani · Pull Request #349 · NVIDIA-NeMo/DataDesigner

dhruvnathawani · 2026-02-23T18:24:49Z

Summary

Add a dev note documenting the enterprise-grade text-to-SQL SDG pipeline used to generate training data for Nemotron's SQL capabilities across PostgreSQL, MySQL, and SQLite.

What's in the post

Motivation: The "real-world gap" between academic benchmarks (Spider >85%) and production SQL (Spider 2.0 Lite <50%) — dialect specificity, dirty data, distractor tables, industry-specific schemas, complexity gradients
Pipeline walkthrough: Conditional samplers (60 industries, 700 topics, 90 SQL concepts) → three-stage LLM generation (natural language prompt → database context with dirty data + distractors → SQL with chain-of-thought reasoning) → quality waterfall (SQLFluff syntax validation + 4-dimension LLM judge scoring)
ASCII pipeline diagram showing the 4-stage flow (Conditional Samplers → Three-Stage LLM Generation → Quality Waterfall → Output)
Deep dive into SubcategorySamplerParams for two-level conditional sampling (industry→topic, complexity→sql_concept)
Quality waterfall breakdown: 300k generated → ~180k after SQLFluff → 96.5k after judge filtering (68% total rejection)
Rich metadata table (industry, topic, complexity, concept, dialect, instruction style, 4 judge scores) enabling precision filtering for downstream training
Discussion of chain-of-thought reasoning traces teaching models to think like Data Engineers
Results: 96.5k filtered records, 3 SQL dialects, 60 industries, 700 topics, 89 SQL concepts, 100% syntax-verified, all judge dimensions ≥ 3/4
7 key takeaways covering conditional sampling, three-stage generation, dirty data, distractor tables, hard validators, multi-dimension scoring and CoT reasoning

Files changed

docs/devnotes/posts/text-to-sql.md (new)

greptile-apps · 2026-03-09T22:04:41Z

Greptile Summary

This PR adds a new dev note (docs/devnotes/posts/text-to-sql.md) documenting the enterprise-grade text-to-SQL synthetic data generation pipeline used to train Nemotron Super v3's SQL capabilities, along with two supporting images and two new author entries. The post is well-structured and technically detailed, covering the five-stage pipeline (seeding, prompt generation, schema generation, SQL generation, and quality waterfall), BIRD benchmark results (+15 EX points), and rich metadata for precision filtering.

Several issues were identified and addressed in prior review threads (score-0 falsy check, sql_dialect seeding, Window Functions taxonomy contradiction, EHR naming mismatch, num_records vs. 300k claim, BIRD score framing, MySQL dialect restrictions). Two new minor clarity issues remain:

Key Takeaway Add support for images in preview_result.display_sample_record() #5 cites REPLACE() vs regexp_replace as an example of dialect-specific syntax, but REPLACE() is a universal function available identically in all three target dialects (SQLite, MySQL, PostgreSQL), making the comparison misleading.
The Results table formats the minimum judge threshold as ≥ 3/4, which is ambiguous (fraction vs. score-out-of-4); ≥ 3 out of 4 would be clearer.

Confidence Score: 4/5

Safe to merge with minor documentation clarity fixes applied.
This is a documentation-only PR with no code changes to production logic. Most previously raised issues have been addressed. The two remaining new findings are style-level documentation clarity issues (a misleading dialect example and an ambiguous notation) that do not affect correctness. The pipeline description and code snippets are internally consistent in the current diff.
docs/devnotes/posts/text-to-sql.md — review Key Takeaway Add support for images in preview_result.display_sample_record() #5 REPLACE() example and the ≥ 3/4 threshold notation in the Results table.

Important Files Changed

Filename	Overview
docs/devnotes/.authors.yml	Adds two new author entries (ymeyer, mvansegbroeck) with correct avatar URLs and descriptions; no issues.
docs/devnotes/posts/images/bird-benchmark-results.jpg	New binary image file for BIRD benchmark results visualization; no issues.
docs/devnotes/posts/images/text-to-sql-pipeline.jpg	New binary image file for the text-to-SQL pipeline diagram; no issues.
docs/devnotes/posts/text-to-sql.md	598-line dev note documenting the enterprise text-to-SQL SDG pipeline. Several issues remain from prior review threads (num_records/300k discrepancy, Window Functions categorisation, EHR Systems naming). Two new minor issues identified: REPLACE() listed as dialect-specific when it is universal, and the "≥ 3/4" judge threshold notation is ambiguous.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Stage 1: Seeding & Diversification\n(CategorySampler + SubcategorySampler)\nindustry×topic, complexity×sql_concept,\ndialect, task_type, prompt style"] --> B
    B["Stage 2: Prompt Generation\n(Reasoning LLM)\nNatural-language business request\n(no SQL jargon)"] --> C
    C["Stage 3: Schema + Data Generation\n(Reasoning LLM)\nDialect DDL + INSERT\n+ distractor tables/columns\n+ dirty data injection"] --> D
    D["Stage 4: SQL Generation\n(Reasoning LLM)\nDialect-specific executable SQL\nwith chain-of-thought reasoning\n+ dirty data handling"] --> E
    E["Stage 5: Quality Waterfall\nSyntax validator (SQLFluff)\n+ 5 LLM judges × 15 dimensions\n0–4 scale per dimension"] --> F
    F{"Pass threshold?\n≥ 3/4 on all\ndimensions"}
    F -- Yes --> G["Final Dataset\n~32k records / dialect\n96.5k total\n(3 dialects)"]
    F -- No --> H["Rejected (~68%)"]
    style G fill:#2e7d32,color:#fff
    style H fill:#c62828,color:#fff

Prompt To Fix All With AI

This is a comment left during a code review.
Path: docs/devnotes/posts/text-to-sql.md
Line: 483

Comment:
**`REPLACE()` is not dialect-specific**

Key Takeaway #5 lists `REPLACE()` vs `regexp_replace` as an example of dialect-specific syntax differences:

> "the pipeline produces dialect-specific schemas and queries with appropriate syntax (`strftime` vs `DATE_SUB` vs `interval`, `REPLACE()` vs `regexp_replace`)"

However, `REPLACE()` is a universal SQL function supported identically in SQLite, MySQL, and PostgreSQL — it is not dialect-specific. The comparison implies that some dialects use `REPLACE()` while others use `regexp_replace`, but both can be present in the same dialect (e.g., PostgreSQL supports both). A more accurate example pair would be something like `strftime('%Y', col)` (SQLite) vs. `YEAR(col)` (MySQL) vs. `DATE_PART('year', col)` (PostgreSQL), which the `strftime vs DATE_SUB vs interval` pair already illustrates.

```suggestion
5. **Per-dialect generation avoids lowest-common-denominator SQL.** Rather than generating ANSI SQL and hoping it works everywhere, the pipeline produces dialect-specific schemas and queries with appropriate syntax (`strftime` vs `DATE_SUB` vs `interval`, `strftime('%Y', col)` vs `YEAR()` vs `DATE_PART()`). Each dialect gets its own tailored prompts, validators, and judge prompts.
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docs/devnotes/posts/text-to-sql.md
Line: 449

Comment:
**Ambiguous "≥ 3/4" notation**

The results table entry `≥ 3/4 across all dimensions` uses a fraction-like notation that is ambiguous. A reader could interpret it as "at least 75% (the fraction three-quarters)" rather than the intended meaning of "at least 3 out of a maximum of 4". Given that the judge scoring scale (0–4) is defined just above this table, writing it as `≥ 3 out of 4` makes the threshold unambiguous.

```suggestion
| Minimum judge score | ≥ 3 out of 4 across all dimensions |
```

How can I resolve this? If you propose a fix, please make it concise.

_{Last reviewed commit: 541787f}