Skip to content

docs: add text-to-sql dev note#349

Open
dhruvnathawani wants to merge 15 commits intomainfrom
dhruv/devnotes/text-to-sql
Open

docs: add text-to-sql dev note#349
dhruvnathawani wants to merge 15 commits intomainfrom
dhruv/devnotes/text-to-sql

Conversation

@dhruvnathawani
Copy link
Contributor

@dhruvnathawani dhruvnathawani commented Feb 23, 2026

Summary

Add a dev note documenting the enterprise-grade text-to-SQL SDG pipeline used to generate training data for Nemotron's SQL capabilities across PostgreSQL, MySQL, and SQLite.

What's in the post

  • Motivation: The "real-world gap" between academic benchmarks (Spider >85%) and production SQL (Spider 2.0 Lite <50%) — dialect specificity, dirty data, distractor tables, industry-specific schemas, complexity gradients
  • Pipeline walkthrough: Conditional samplers (60 industries, 700 topics, 90 SQL concepts) → three-stage LLM generation (natural language prompt → database context with dirty data + distractors → SQL with chain-of-thought reasoning) → quality waterfall (SQLFluff syntax validation + 4-dimension LLM judge scoring)
  • ASCII pipeline diagram showing the 4-stage flow (Conditional Samplers → Three-Stage LLM Generation → Quality Waterfall → Output)
  • Deep dive into SubcategorySamplerParams for two-level conditional sampling (industry→topic, complexity→sql_concept)
  • Quality waterfall breakdown: 300k generated → ~180k after SQLFluff → 96.5k after judge filtering (68% total rejection)
  • Rich metadata table (industry, topic, complexity, concept, dialect, instruction style, 4 judge scores) enabling precision filtering for downstream training
  • Discussion of chain-of-thought reasoning traces teaching models to think like Data Engineers
  • Results: 96.5k filtered records, 3 SQL dialects, 60 industries, 700 topics, 89 SQL concepts, 100% syntax-verified, all judge dimensions ≥ 3/4
  • 7 key takeaways covering conditional sampling, three-stage generation, dirty data, distractor tables, hard validators, multi-dimension scoring and CoT reasoning

Files changed

  • docs/devnotes/posts/text-to-sql.md (new)

@nabinchha nabinchha requested a review from 3mei February 26, 2026 00:11
@dhruvnathawani dhruvnathawani marked this pull request as ready for review March 9, 2026 22:01
@dhruvnathawani dhruvnathawani requested review from a team as code owners March 9, 2026 22:01
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 9, 2026

Greptile Summary

This PR adds a new dev note (docs/devnotes/posts/text-to-sql.md) documenting the enterprise-grade text-to-SQL synthetic data generation pipeline used to train Nemotron Super v3's SQL capabilities, along with two supporting images and two new author entries. The post is well-structured and technically detailed, covering the five-stage pipeline (seeding, prompt generation, schema generation, SQL generation, and quality waterfall), BIRD benchmark results (+15 EX points), and rich metadata for precision filtering.

Several issues were identified and addressed in prior review threads (score-0 falsy check, sql_dialect seeding, Window Functions taxonomy contradiction, EHR naming mismatch, num_records vs. 300k claim, BIRD score framing, MySQL dialect restrictions). Two new minor clarity issues remain:

  • Key Takeaway Add support for images in preview_result.display_sample_record() #5 cites REPLACE() vs regexp_replace as an example of dialect-specific syntax, but REPLACE() is a universal function available identically in all three target dialects (SQLite, MySQL, PostgreSQL), making the comparison misleading.
  • The Results table formats the minimum judge threshold as ≥ 3/4, which is ambiguous (fraction vs. score-out-of-4); ≥ 3 out of 4 would be clearer.

Confidence Score: 4/5

  • Safe to merge with minor documentation clarity fixes applied.
  • This is a documentation-only PR with no code changes to production logic. Most previously raised issues have been addressed. The two remaining new findings are style-level documentation clarity issues (a misleading dialect example and an ambiguous notation) that do not affect correctness. The pipeline description and code snippets are internally consistent in the current diff.
  • docs/devnotes/posts/text-to-sql.md — review Key Takeaway Add support for images in preview_result.display_sample_record() #5 REPLACE() example and the ≥ 3/4 threshold notation in the Results table.

Important Files Changed

Filename Overview
docs/devnotes/.authors.yml Adds two new author entries (ymeyer, mvansegbroeck) with correct avatar URLs and descriptions; no issues.
docs/devnotes/posts/images/bird-benchmark-results.jpg New binary image file for BIRD benchmark results visualization; no issues.
docs/devnotes/posts/images/text-to-sql-pipeline.jpg New binary image file for the text-to-SQL pipeline diagram; no issues.
docs/devnotes/posts/text-to-sql.md 598-line dev note documenting the enterprise text-to-SQL SDG pipeline. Several issues remain from prior review threads (num_records/300k discrepancy, Window Functions categorisation, EHR Systems naming). Two new minor issues identified: REPLACE() listed as dialect-specific when it is universal, and the "≥ 3/4" judge threshold notation is ambiguous.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Stage 1: Seeding & Diversification\n(CategorySampler + SubcategorySampler)\nindustry×topic, complexity×sql_concept,\ndialect, task_type, prompt style"] --> B
    B["Stage 2: Prompt Generation\n(Reasoning LLM)\nNatural-language business request\n(no SQL jargon)"] --> C
    C["Stage 3: Schema + Data Generation\n(Reasoning LLM)\nDialect DDL + INSERT\n+ distractor tables/columns\n+ dirty data injection"] --> D
    D["Stage 4: SQL Generation\n(Reasoning LLM)\nDialect-specific executable SQL\nwith chain-of-thought reasoning\n+ dirty data handling"] --> E
    E["Stage 5: Quality Waterfall\nSyntax validator (SQLFluff)\n+ 5 LLM judges × 15 dimensions\n0–4 scale per dimension"] --> F
    F{"Pass threshold?\n≥ 3/4 on all\ndimensions"}
    F -- Yes --> G["Final Dataset\n~32k records / dialect\n96.5k total\n(3 dialects)"]
    F -- No --> H["Rejected (~68%)"]
    style G fill:#2e7d32,color:#fff
    style H fill:#c62828,color:#fff
Loading
Prompt To Fix All With AI
This is a comment left during a code review.
Path: docs/devnotes/posts/text-to-sql.md
Line: 483

Comment:
**`REPLACE()` is not dialect-specific**

Key Takeaway #5 lists `REPLACE()` vs `regexp_replace` as an example of dialect-specific syntax differences:

> "the pipeline produces dialect-specific schemas and queries with appropriate syntax (`strftime` vs `DATE_SUB` vs `interval`, `REPLACE()` vs `regexp_replace`)"

However, `REPLACE()` is a universal SQL function supported identically in SQLite, MySQL, and PostgreSQL — it is not dialect-specific. The comparison implies that some dialects use `REPLACE()` while others use `regexp_replace`, but both can be present in the same dialect (e.g., PostgreSQL supports both). A more accurate example pair would be something like `strftime('%Y', col)` (SQLite) vs. `YEAR(col)` (MySQL) vs. `DATE_PART('year', col)` (PostgreSQL), which the `strftime vs DATE_SUB vs interval` pair already illustrates.

```suggestion
5. **Per-dialect generation avoids lowest-common-denominator SQL.** Rather than generating ANSI SQL and hoping it works everywhere, the pipeline produces dialect-specific schemas and queries with appropriate syntax (`strftime` vs `DATE_SUB` vs `interval`, `strftime('%Y', col)` vs `YEAR()` vs `DATE_PART()`). Each dialect gets its own tailored prompts, validators, and judge prompts.
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: docs/devnotes/posts/text-to-sql.md
Line: 449

Comment:
**Ambiguous "≥ 3/4" notation**

The results table entry `≥ 3/4 across all dimensions` uses a fraction-like notation that is ambiguous. A reader could interpret it as "at least 75% (the fraction three-quarters)" rather than the intended meaning of "at least 3 out of a maximum of 4". Given that the judge scoring scale (0–4) is defined just above this table, writing it as `≥ 3 out of 4` makes the threshold unambiguous.

```suggestion
| Minimum judge score | ≥ 3 out of 4 across all dimensions |
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 541787f

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
PR feedback fixes:
- Fix Window Functions contradiction: Key Takeaway #1 now uses
  "Geospatial SQL" (Advanced) instead of "Window Functions" (Intermediate)
- Fix score-0 truthiness bug: use `is not none` instead of truthy check
  in Jinja2 expression columns (inline example + production pipeline)
- Soften Code Sandbox language: "A natural next step would be..." instead
  of "We are actively implementing..."
- Cut Gretel reference per mvansegbroeck: replaced with NVIDIA/Nemotron
  team description
- Replace Qwen model references with Nemotron per mvansegbroeck: MODEL_NAME,
  ASCII diagram labels, Pipeline Overview prose
- Rename sdg_qwen_235b.py -> sdg_ndd_text2sql.py per mvansegbroeck
- Fix Try It Yourself: use MODEL_ALIAS = "nvidia-text" with default
  provider pattern (matches structured-outputs dev note), remove unused
  explicit ModelConfig
- Remove placeholder dataset link (#), add "Dataset: Internal" note
New content:
- Add BIRD Benchmark Results section with bar chart (JPG), data table,
  BIRD caveat paragraph, and Jocelyn Huang acknowledgement
  (Nemotron Super EX: 26.77% -> 41.80%, +15 pts, beats GPT-OSS-120B)
- Replace "Looking Ahead: Code Sandbox" with broader "Next Steps":
  Code Sandbox, RL on BIRD via NeMo Gym, schema representation, Spider 2.0
- Add Project Summary table at end of post
dhruvnathawani and others added 3 commits March 11, 2026 11:54
- Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1
  to match the exact taxonomy string in the code example (greptile)
- Add admonition clarifying code snippets are illustrative, not
  runnable, with link to Enterprise Text-to-SQL Recipe (nabinchha)
- Add context before score extraction snippet referencing the five
  LLMJudgeColumnConfig columns and linking to full recipe (nabinchha)
- Add companion file note and recipe link to production pipeline
  details block for prompts.py, rubrics.py, text2sql_seed.json (nabinchha)
… recipe

- Fix "EHR Systems" -> "Electronic Health Records" in Key Takeaway #1
  to match the exact taxonomy string in the code example (greptile)
- Add admonition clarifying inline code snippets are illustrative,
  with link to runnable Enterprise Text-to-SQL Recipe (nabinchha)
- Add context before score extraction snippet referencing the five
  LLMJudgeColumnConfig columns and linking to full recipe (nabinchha)
- Replace production pipeline <details> block (230 lines with phantom
  imports from prompts.py, rubrics.py, text2sql_seed.json) with
  snippet include of enterprise_text_to_sql.py recipe — self-contained
  and runnable, consistent with other merged dev notes (nabinchha)
- Wrap minimal inline example in collapsible <details> dropdown
- Rename "A Team Effort" section to "Summary"
- Remove redundant Scale/Dialects/Dataset line
The Step 3/4 prompt templates reference {{ sql_dialect }} but the
Step 1 seeding code never defined it, leaving an unresolved Jinja2
variable for readers following along. Add the sql_dialect sampler
with a comment explaining the pipeline runs once per dialect.

Made-with: Cursor
- Remove specific "60-70%" BIRD claim from intro to avoid contradiction
  with the 41.80%/38.25% direct-generation results shown later (those
  higher figures come from specialized systems with schema linking)
- Reword MySQL "forbids" to "prompts exclude" -- REGEXP_REPLACE and
  CONVERT_TZ are valid MySQL functions; the pipeline excluded them for
  portability, not because the dialect forbids them
import data_designer.config as dd
from data_designer.interface import DataDesigner

MODEL_ALIAS = "nvidia-text"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

REPLACE() is not dialect-specific

Key Takeaway #5 lists REPLACE() vs regexp_replace as an example of dialect-specific syntax differences:

"the pipeline produces dialect-specific schemas and queries with appropriate syntax (strftime vs DATE_SUB vs interval, REPLACE() vs regexp_replace)"

However, REPLACE() is a universal SQL function supported identically in SQLite, MySQL, and PostgreSQL — it is not dialect-specific. The comparison implies that some dialects use REPLACE() while others use regexp_replace, but both can be present in the same dialect (e.g., PostgreSQL supports both). A more accurate example pair would be something like strftime('%Y', col) (SQLite) vs. YEAR(col) (MySQL) vs. DATE_PART('year', col) (PostgreSQL), which the strftime vs DATE_SUB vs interval pair already illustrates.

Suggested change
MODEL_ALIAS = "nvidia-text"
5. **Per-dialect generation avoids lowest-common-denominator SQL.** Rather than generating ANSI SQL and hoping it works everywhere, the pipeline produces dialect-specific schemas and queries with appropriate syntax (`strftime` vs `DATE_SUB` vs `interval`, `strftime('%Y', col)` vs `YEAR()` vs `DATE_PART()`). Each dialect gets its own tailored prompts, validators, and judge prompts.
Prompt To Fix With AI
This is a comment left during a code review.
Path: docs/devnotes/posts/text-to-sql.md
Line: 483

Comment:
**`REPLACE()` is not dialect-specific**

Key Takeaway #5 lists `REPLACE()` vs `regexp_replace` as an example of dialect-specific syntax differences:

> "the pipeline produces dialect-specific schemas and queries with appropriate syntax (`strftime` vs `DATE_SUB` vs `interval`, `REPLACE()` vs `regexp_replace`)"

However, `REPLACE()` is a universal SQL function supported identically in SQLite, MySQL, and PostgreSQL — it is not dialect-specific. The comparison implies that some dialects use `REPLACE()` while others use `regexp_replace`, but both can be present in the same dialect (e.g., PostgreSQL supports both). A more accurate example pair would be something like `strftime('%Y', col)` (SQLite) vs. `YEAR(col)` (MySQL) vs. `DATE_PART('year', col)` (PostgreSQL), which the `strftime vs DATE_SUB vs interval` pair already illustrates.

```suggestion
5. **Per-dialect generation avoids lowest-common-denominator SQL.** Rather than generating ANSI SQL and hoping it works everywhere, the pipeline produces dialect-specific schemas and queries with appropriate syntax (`strftime` vs `DATE_SUB` vs `interval`, `strftime('%Y', col)` vs `YEAR()` vs `DATE_PART()`). Each dialect gets its own tailored prompts, validators, and judge prompts.
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants