Skip to content

docs(devnotes): add Nemotron-Personas dev note#611

Open
3mei wants to merge 9 commits into
mainfrom
yev/nemotron_personas_dev_note
Open

docs(devnotes): add Nemotron-Personas dev note#611
3mei wants to merge 9 commits into
mainfrom
yev/nemotron_personas_dev_note

Conversation

@3mei
Copy link
Copy Markdown
Contributor

@3mei 3mei commented May 7, 2026

📋 Summary

Adds the Inside Nemotron-Personas dev note covering how the multi-locale Nemotron-Personas HF collection is built (4-stage compound-AI pipeline) and how it's used as a seeding primitive across Nemotron training (long-context, tool-use, formal logic, safety refusals, instruction-following). Ships alongside a runnable Tutorial 7 demonstrating reproduction + customization, plus a Colab variant

🔗 Related Issue

N/A

🔄 Changes

✨ Added

  • docs/devnotes/posts/nemotron-personas.md — new dev note
  • docs/devnotes/posts/assets/nemotron-personas/ — four images: three pipeline-stage diagrams from the partner repo plus a black-background Nemotron-Personas world-map hero
  • docs/notebook_source/7-nemotron-personas.py — jupytext source for the Reproducing & Customizing Nemotron-Personas tutorial;
  • docs/colab_notebooks/7-nemotron-personas.ipynb — committed Colab variant; i

🔧 Changed

  • docs/scripts/generate_colab_notebooks.py — adds an ADDITIONAL_SETUP_CELLS map paralleling ADDITIONAL_DEPENDENCIES; injects NGC CLI install + NGC_API_KEY cells. Future devnote-paired tutorials needing extra Colab bootstrap can register one-line entries in the same map.
  • mkdocs.yml — adds Reproducing & Customizing Nemotron-Personas under the Tutorials nav

🧪 Testing

  • make test passes
  • Notebook runs end-to-end via jupytext --to ipynb --execute
  • make generate-colab-notebooks regenerates the Colab .ipynb cleanly with the NGC setup cells in the expected position
  • Unit tests added/updated (N/A — this PR is docs + tutorial assets; no engine code changed)
  • E2E tests added/updated (N/A — Tutorial 7 is opt-in via make convert-execute-notebooks and gated on NVIDIA_API_KEY + on-disk NGC dataset, matching how Tutorials 5/6 are gated on OPENROUTER_API_KEY)

✅ Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Architecture docs updated (N/A — no architectural changes)

3mei added 2 commits May 7, 2026 02:04
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 7, 2026

MkDocs preview: https://cb8a7ea0.dd-docs-preview.pages.dev

Fern preview: https://nvidia-preview-pr-611.docs.buildwithfern.com/nemo/datadesigner

Fern previews include the docs-website version archive with PR changes synced into latest. Notebook tutorials are rendered without execution outputs in previews.

@3mei 3mei changed the title Nemotron-Personas Dev Note docs(devnotes): add Nemotron-Personas dev note May 7, 2026
Copy link
Copy Markdown
Contributor

@danecor danecor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Some possible issues / suggestions attached.

Comment thread docs/scripts/generate_colab_notebooks.py
Comment thread docs/notebook_source/7-nemotron-personas.py Outdated
Comment thread docs/notebook_source/7-nemotron-personas.py Outdated
Comment thread docs/notebook_source/7-nemotron-personas.py Outdated
Comment thread docs/devnotes/posts/nemotron-personas.md Outdated
Comment thread docs/devnotes/posts/nemotron-personas.md Outdated
Comment thread docs/devnotes/posts/nemotron-personas.md Outdated
Comment thread docs/devnotes/posts/nemotron-personas.md Outdated
Comment thread docs/devnotes/posts/nemotron-personas.md Outdated
Comment thread docs/notebook_source/7-nemotron-personas.py Outdated
Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
@3mei 3mei requested review from danecor and johnnygreco May 27, 2026 23:02
…_dev_note

# Conflicts:
#	docs/scripts/generate_colab_notebooks.py
@3mei 3mei marked this pull request as ready for review May 28, 2026 01:29
@3mei 3mei requested a review from a team as a code owner May 28, 2026 01:29
@github-actions
Copy link
Copy Markdown
Contributor

Code Review: PR #611docs(devnotes): add Nemotron-Personas dev note

Summary

A docs-and-tutorial PR that ships:

  • docs/devnotes/posts/nemotron-personas.md — long-form dev note covering how the multi-locale Nemotron-Personas HF collection is built (a 4-stage compound-AI pipeline) and how those personas seed Nemotron training (long-context, tool-use, formal-logic, safety refusals, instruction-following).
  • docs/notebook_source/7-nemotron-personas.py (732 lines) — jupytext source for Reproducing & Customizing Nemotron-Personas. Reproduces the released schema from the NGC-hosted Nemotron-Personas-USA artifact via PersonSampler + two LLMStructuredColumnConfig stages, then layers a small tech_persona customization example.
  • docs/colab_notebooks/7-nemotron-personas.ipynb — committed Colab variant.
  • docs/scripts/generate_colab_notebooks.py — extends the Colab-cell generator with ADDITIONAL_API_KEY_BLOCKS (and a parallel-but-currently-unused ADDITIONAL_SETUP_CELLS) so this notebook can request an NGC_API_KEY in addition to the standard NVIDIA_API_KEY.
  • mkdocs.yml — adds the Tutorial 7 nav entry.
  • Four pipeline-diagram / hero PNGs under docs/devnotes/posts/assets/nemotron-personas/.
  • Tiny doc fix: quick-start/latest/quick-start/ in docs/notebook_source/README.md.

The diff is +2228/−4. No engine code is touched; structural invariants (import direction, lazy heavy imports, etc.) are not at risk.

Findings

Accuracy & content (dev note + tutorial)

  • Frontmatter matches existing dev-note convention (design-principles.md, text-to-sql.md): date + authors only. ✅
  • Quotes from the Nemotron 3 Super Technical Report are attributed and linked. Each block-quote includes a citation and the same URL is repeated rather than relying on an unstable shorthand. ✅
  • Pipeline narrative is consistent with the tutorial code. Stage 1 (OCEAN), Stage 2 (PGM/PersonSampler), Stage 3 (PersonaAttributes), Stage 4 (Personas) line up across both files. The dev note correctly says "nine cohesive persona descriptions" and the Personas Pydantic model in the tutorial defines exactly nine fields (professional, finance, healthcare, sports, arts, travel, culinary, concise, detailed). ✅
  • Locale list is consistent across the dev note, tutorial markdown, and the Try it yourself / Next Steps sections (en_US, en_IN, en_SG, fr_FR, hi_Deva_IN, hi_Latn_IN, ja_JP, ko_KR, pt_BR). ✅
  • Cross-link to other dev notes (design-principles.md, text-to-sql.md, push-datasets-to-hugging-face-hub.md) uses relative paths, which is correct for mkdocs. ✅
  • Image references inside the dev note use relative paths (assets/nemotron-personas/...) — correct. The two embedded diagrams in the tutorial source use raw GitHub URLs pointing at main, which means they won't render in a freshly-rendered notebook until this PR is merged. That's the same approach Tutorials 5/6 take, so consistent — but worth being aware of when reviewing the rendered output before merge.

Minor accuracy issue

  • Typo "experince" appears twice in 7-nemotron-personas.py (lines 515 and 604) inside PERSONA_SYSTEM_PROMPT and the inline prompt string ("A neonatal nurse with decades of experince…"). The same typo appears verbatim in the dev note's prompt copy. It's user-facing prompt text fed to the LLM, so the impact is small, but worth fixing — easy win and the typo presumably came from an upstream copy.

Code quality — 7-nemotron-personas.py

  • from __future__ import annotations ✅, modern type syntax (dict[str, dict[str, str]], int | None) ✅, absolute imports ✅, type-annotated helpers ✅. Consistent with project style.
  • The "verify dataset is on disk" cell raises SystemExit with a clear pointer back to the setup cell — good UX for a notebook that depends on an out-of-band download.
  • The SAMPLE_FROM_SDG_PGM = True branch is gated behind raise NotImplementedError with an informative message. The dead code above the raise is a sketch of the eventual integration. This is a reasonable pattern for a tutorial that documents a "future path", but consider one of:
    • Move the sketch into a markdown cell (it's documentation, not code that runs), or
    • At minimum, add # pragma: no cover / a comment that it's intentionally unreachable.
    • Today, lints / static analysis on docs/notebook_source/ may flag the unreachable lines as dead. (Not a blocker; the notebook isn't part of the import path.)
  • Validator workaround: several ExpressionColumnConfigs use {{ field if field else ' ' }} (single space) and the comment notes "DD's validator rejects expression columns that render to ''". Reasonable workaround for a tutorial; if this is a frequent pattern, a follow-up engine change to allow nullable expression columns would be cleaner — flag for the engine team but not in scope here.
  • temperature=UniformDistribution(low=0.9, high=1.1) is unusual (>1.0 can be aggressive on some endpoints). The markdown above the cell explicitly tells the user to consult the model card — that's the right escape hatch for a tutorial.
  • NUM_RECORDS = 50 for the scale-up cell is appropriately small for a runnable tutorial; the surrounding text correctly notes the released artifact scales to millions.

generate_colab_notebooks.py extension

  • The change is backward-compatible: new parameters on create_colab_setup_cells have defaults, and existing callers (notebooks 1–6) pick up empty maps. ✅
  • Joining the NGC API-key block into the existing os.environ/getpass cell (rather than emitting a duplicate cell with its own imports) is the right call. The comment explains the rationale clearly.
  • Minor concern: ADDITIONAL_SETUP_CELLS is added but currently empty. The comment ("Currently unused; left in place so future tutorials can register…") explicitly flags it as speculative. AGENTS.md style guidance is "Don't design for hypothetical future requirements". This is a docs script, not engine code, and the cost is small (one empty dict + a .get call), so I'd flag it as a nit rather than a blocker — but if a future tutorial needs setup cells, the dict could be added at that time with no extra ceremony. Consider removing the unused map and the corresponding parameter, then re-adding when the first real consumer lands.
  • One small naming nit: the public hash key is "7-nemotron-personas.py" (with .py), matching the existing ADDITIONAL_DEPENDENCIES convention. Consistent. ✅

Notebook execution & gating

  • The PR description states the notebook is "opt-in via make convert-execute-notebooks and gated on NVIDIA_API_KEY + on-disk NGC dataset, matching how Tutorials 5/6 are gated on OPENROUTER_API_KEY". I did not verify the gating mechanism in this review, but the verify-on-disk cell (raise SystemExit if the parquet isn't found) provides a clean fail-fast path locally, and the tutorial author confirms make test passes. Worth a sanity check on CI that this notebook is excluded from auto-execution unless the NGC asset is present.

Performance & security

  • No new dependencies. ✅
  • The Colab cell stores the NGC API key via os.environ from userdata.get(...) with a getpass fallback — same pattern as the existing NVIDIA-API-key cell. No secrets logged in the notebook source. ✅
  • Output is written via data_designer.create(...) to a named dataset; nothing exotic. ✅

mkdocs.yml

  • The added entry "Reproducing & Customizing Nemotron-Personas": notebooks/7-nemotron-personas.ipynb slots in after Tutorials 1–6 in the Tutorials section. Naming style (long descriptive title in quotes) is consistent with the rest of the nav. ✅

Suggestions

  1. Fix the experince typo in PERSONA_SYSTEM_PROMPT (line 515) and in the inline prompt at line 604 of docs/notebook_source/7-nemotron-personas.py, plus the matching block in docs/devnotes/posts/nemotron-personas.md. Re-run make generate-colab-notebooks after.
  2. Consider removing ADDITIONAL_SETUP_CELLS from generate_colab_notebooks.py until a real consumer needs it. The repo style is to avoid speculative abstractions. If kept, the comment is honest about its status, so this is a soft nit.
  3. Consider moving the if SAMPLE_FROM_SDG_PGM: integration sketch into a markdown cell. The cell currently contains dead-but-illustrative code followed by a raise NotImplementedError. As a markdown-only "future-shape" snippet it would be more obviously instructional and avoid confusing readers (or linters) into treating it as live code.
  4. (Optional) Cross-link Tutorial 7 from the dev note's "Try it yourself" section using the notebooks/7-... mkdocs route, in addition to the Colab link. Mirrors how some other dev notes link to both surfaces.

Test coverage

Per the PR checklist, no unit-test changes are expected (docs + tutorial PR). The author confirms make test passes and that the notebook executes end-to-end via jupytext --to ipynb --execute, which is the right validation surface for this kind of change. The make generate-colab-notebooks regeneration claim is consistent with the diff (the committed .ipynb includes the NGC API-key block and badge cell).

Verdict

Looks good — approve with minor follow-ups. This is a well-structured, well-cited dev note plus a runnable, self-contained tutorial that exercises a real ingestion path (PersonSampler against the NGC artifact). The Colab generator change is small, backward-compatible, and well-commented. The only items worth fixing before merge are the experince typo (cheap) and a judgment call on whether to keep the unused ADDITIONAL_SETUP_CELLS slot. Neither is a blocker; both are easy follow-ups. No risk to engine packages or structural invariants.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 28, 2026

Greptile Summary

This PR ships the Inside Nemotron-Personas dev note alongside Tutorial 7 ("Reproducing & Customizing Nemotron-Personas"), which demonstrates the four-stage compound-AI pipeline that builds the Nemotron-Personas HF collection. It also extends generate_colab_notebooks.py to inject an NGC_API_KEY Colab setup cell alongside the standard NVIDIA_API_KEY cell.

  • New tutorial (docs/notebook_source/7-nemotron-personas.py / docs/colab_notebooks/7-nemotron-personas.ipynb): reproduces the PGM-grounded OCEAN → persona-attributes → persona-descriptions pipeline and demonstrates domain-specific extension with a TechPersona schema.
  • generate_colab_notebooks.py: adds ADDITIONAL_API_KEY_BLOCKS (plus an empty ADDITIONAL_SETUP_CELLS hook) so Tutorial 7's Colab variant appends a well-formed NGC_API_KEY try/except block into the shared API-key cell rather than generating a second cell with duplicate imports.
  • Nav & fern assets: mkdocs.yml and fern/versions/latest.yml are updated; pipeline-stage images are mirrored into both docs/devnotes/posts/assets/ and fern/assets/.

Confidence Score: 5/5

Safe to merge — no engine code is touched; all changes are documentation, tutorial assets, and a small Colab generation script extension.

The changes are purely additive docs and tutorial content. The Colab notebook generator change is small and correct: NGC_API_KEY handling is appended into the shared API-key cell with no duplicate imports. The tutorial notebook's two previously noted edge cases (the SDG-PGM hook and always-true age guards) are pre-existing design choices acknowledged in the code comments, not regressions introduced here.

fern/versions/latest.yml is missing the Tutorial 7 entry in its Tutorials nav section, creating a small discoverability gap on the fern-based docs site.

Important Files Changed

Filename Overview
docs/scripts/generate_colab_notebooks.py Adds ADDITIONAL_API_KEY_BLOCKS and NGC_API_KEY_BLOCK to inject NGC_API_KEY env-var handling into Tutorial 7's Colab cell; ADDITIONAL_SETUP_CELLS added as an empty extension point. Logic is correct — imports are already present in COLAB_API_KEY_CELL and the block is joined cleanly.
docs/notebook_source/7-nemotron-personas.py New tutorial notebook reproducing the Nemotron-Personas pipeline. SAMPLE_FROM_SDG_PGM=True path is intentionally a hook (raises NotImplementedError) but Next Steps prose advertises flipping it (previously flagged). Age conditionals are always true given age_range=[18,114] (previously flagged).
docs/devnotes/posts/nemotron-personas.md New dev note covering the 4-stage pipeline, Nemotron training usage, and customization pattern. Well-structured and consistent with the notebook's code examples.
fern/versions/latest.yml Adds the dev note to Dev Notes but does not add Tutorial 7 to the Tutorials section, creating a nav asymmetry versus mkdocs.yml.
mkdocs.yml Adds Tutorial 7 entry under Tutorials nav; clean 2-line addition.
fern/versions/latest/pages/devnotes/posts/nemotron-personas.mdx Fern-flavored MDX mirror of the dev note; content is consistent with the mkdocs variant.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[NGC-hosted Nemotron-Personas Dataset\nor SDG-PGMs custom PGM] -->|PersonSampler| B

    subgraph Stage1 ["Stage 1: OCEAN Big-Five Sampling"]
        B[Sample OCEAN T-scores\nmu=50, sigma=10, clip 20-80]
        B --> B2[Score to label to prose description per trait]
    end

    subgraph Stage2 ["Stage 2: Demographically-Grounded Sampling"]
        C[PGM-grounded demographic record\nage x education x occupation x geography]
    end

    B2 --> D
    C --> D

    subgraph Stage3 ["Stage 3: Persona Attributes via LLM Structured Output"]
        D[LLMStructuredColumnConfig\nPersonaAttributes schema]
        D --> D2["cultural_background / skills_and_expertise\ncareer_goals_and_ambitions / hobbies_and_interests"]
    end

    D2 --> E

    subgraph Stage4 ["Stage 4: Persona Descriptions via LLM Structured Output"]
        E[LLMStructuredColumnConfig\nPersonas schema]
        E --> E2["professional / finance / healthcare\nsports / arts / travel / culinary\nconcise / detailed persona"]
    end

    E2 --> F[Released Nemotron-Personas Dataset\n~53M personas across 7 locales]
    E2 --> G[Custom Extension\ne.g. TechPersona schema]
Loading
Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
fern/versions/latest.yml:64-79
**Tutorial 7 missing from fern Tutorials nav**

`mkdocs.yml` adds *Reproducing & Customizing Nemotron-Personas* as Tutorial 7 under the Tutorials section, but `fern/versions/latest.yml` only adds the dev note to Dev Notes — the Tutorials section here still ends at Tutorial 6 (Image-to-Image Editing). Users browsing the fern-based docs won't find Tutorial 7 through the Tutorials nav; they can only reach the Colab notebook via the dev note link.

Reviews (4): Last reviewed commit: "docs(devnotes): move Nemotron-Personas t..." | Re-trigger Greptile

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
danecor
danecor previously approved these changes May 28, 2026
Copy link
Copy Markdown
Contributor

@danecor danecor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

…navs

Signed-off-by: Yev Meyer <ymeyer@nvidia.com>
@johnnygreco
Copy link
Copy Markdown
Contributor

Hey Yev, leaving a few flags from Codex review here so they are visible before the human review comes through. A human review is still coming.

  • The new-locale / SDG-PGMs path looks overstated in the Dev Note. The post says users can declare a PGMGenerator / PGMGeneratorPluginConfig path, but the tutorial currently marks SAMPLE_FROM_SDG_PGM=True as TODO and raises NotImplementedError. Either implementing that path or framing it as future / advanced work would avoid sending readers toward a non-working branch.

  • The PersonSampler field access example appears inaccurate. The post says {{ person.county }} is available directly, but the notebook maps person.district into a county expression column. A reader copying the post as-is may get a broken Jinja reference.

  • A few “inside Nemotron training” claims probably need tighter sourcing or narrower wording. The Super report supports personas in long-context samples, general tool use, and formal logic, but I could not verify the SSCR / general-chat / instruction-following claims from that report as written. The Japanese model card supports Japanese tool-calling data seeded by Nemotron-Personas-Japan, but not broad instruction-following + general-chat data in the current wording.

  • The PR adds duplicate Dev Note prose under legacy docs/ and wires mkdocs.yml. Current docs guidance says Dev Notes prose should live under fern/, so keeping both copies may create drift unless there is still an intentional legacy publish path here.

Narratively, the post reads well: the flow from why personas matter, to how they are used, to how Data Designer builds and customizes them is strong. These are mostly accuracy / maintenance flags rather than a request for a structural rewrite.

Comment thread mkdocs.yml
@3mei
Copy link
Copy Markdown
Contributor Author

3mei commented May 29, 2026

@johnnygreco

Re review from Codez:

"The new-locale / SDG-PGMs path looks overstated in the Dev Note."
Codex was confused, as this landed in SDG-PGMs: https://github.com/NVIDIA-NeMo/SDG-PGMs/tree/main/examples/us_person , https://github.com/NVIDIA-NeMo/SDG-PGMs/tree/main/src/data_designer_plugins

I updated the language in the note to make this a bit more clear.

Rebased to bring the prompt sensitivity and updated mkdocs/fern. Should be good to go.

@3mei 3mei requested a review from johnnygreco May 29, 2026 22:20
Copy link
Copy Markdown
Contributor

@johnnygreco johnnygreco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an awesome post @3mei!!! thanks!

Note that I think the blog card is missing. Up to you if you want to add now or in a follow up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants