NVIDIA-NeMo · johnnygreco · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026
@@ -0,0 +1,30 @@
+# Preview Review Guide
+
+## Mindset
+
+Quality is statistical, not per-record. Fix systemic issues that affect many records; don't chase cosmetic flaws in individual ones. But don't stop early — clear patterns of broken data or ignored instructions are worth fixing.
+
+## Reading Sample Records
+
+Load `dataset.parquet` from the preview results directory (printed as `Results path:` by the preview command, or the most recent `artifacts/preview_results_*/` directory). Use pandas to load the parquet file and print the records in a compact, reviewable format.
+
+## What to Look For
+
+The specifics depend on the dataset and its intended use. The categories below are common starting points — adapt based on what matters for this dataset.
+
+### Diversity
+- **Mode collapse**: are records clustering around the same patterns, topics, or phrasings?
+- **Sampler effectiveness**: are samplers being used effectively to steer diversity in the dataset?
+- **Structural monotony**: do LLM-generated columns follow the same template across records?
+
+### Data Quality
+- **Instruction compliance**: does generated content follow prompt constraints (step counts, format requirements, allowed values)?
+- **Internal consistency**: does data within a record agree with itself?
+- **Encoding integrity**: no garbled encoding, mojibake, or broken unicode.
+- **Plausibility**: do examples look like they could come from the real domain, or are they obviously synthetic?
+- **Judge calibration** (if applicable): are scores consistent across similar-quality records? Does the judge catch visible problems?
+
+### Design Choices
+Are the right Data Designer features being used? For example:
+- A text column that consistently produces structured data or code might be better as a specialized column type.
+- Values drawn from a fixed set or known distribution could use a sampler instead of an LLM column.
@@ -20,7 +20,7 @@ In this mode, make reasonable design decisions autonomously based on the dataset
   - Note the sample records directory printed by the `data-designer preview` command
   - Give the user a clickable link: `file://<sample-records-dir>/sample_records_browser.html`
 7. **Create** — If the user specified a record count:
-  - 50 or fewer: run `data-designer create <path> --num-records <N> --dataset-name <name>` directly.
-  - More than 50: warn that generation can take a long time and ask for confirmation before running.
+  - Run `data-designer create <path> --num-records <N> --dataset-name <name>`.
+  - Generation speed depends heavily on the dataset configuration and the user's inference setup. For larger datasets, warn the user and ask for confirmation before running.
-  - Run `data-designer create <path> --num-records <N> --dataset-name <name>`.
-  - Generation speed depends heavily on the dataset configuration and the user's inference setup. For larger datasets, warn the user and ask for confirmation before running.
+  - Generation speed depends heavily on the dataset configuration and the user's inference setup. For larger datasets, warn the user and ask for confirmation before running.
+  - Run `data-designer create <path> --num-records <N> --dataset-name <name>`.
-  - Run `data-designer create <path> --num-records <N> --dataset-name <name>`.
-  - Generation speed depends heavily on the dataset configuration and the user's inference setup. For larger datasets, warn the user and ask for confirmation before running.
+  - Generation speed depends heavily on the dataset configuration and the user's inference setup. For larger datasets, warn the user and ask for confirmation before running.
+  - Run `data-designer create <path> --num-records <N> --dataset-name <name>`.
   - If no record count was specified, skip this step.
 8. **Present** — Summarize what was built: columns, samplers used, key design choices. If the create command was run, share the results. Ask the user if they want any changes. If so, edit the script, re-validate, re-preview, and iterate.
@@ -23,8 +23,11 @@ This is an interactive, iterative design process. Do not disengage from the loop
 6. **Preview** — Run `data-designer preview <path> --save-results` to generate sample records as HTML files.
   - Note the sample records directory printed by the `data-designer preview` command
   - Give the user a clickable link: `file://<sample-records-dir>/sample_records_browser.html`
-7. **Iterate** — Ask the user for feedback. Edit the script, re-validate, re-preview, and serve again. Repeat until they are satisfied.
+7. **Iterate**
+   - Ask the user for feedback.
+   - Offer to review the records yourself and suggest improvements. If the user accepts, read `references/preview-review.md` for guidance.
+   - Apply changes, re-validate, and re-preview. Repeat until the user is satisfied.
 8. **Finalize** — Once the user is happy, tell them they can run the following command to create the dataset:
   - `data-designer create <path> --num-records <N> --dataset-name <name>`.
-  - Warn the user that generation can take a long time for large record counts (50+).
-  - Do not run this command yourself — it can take a long time for large datasets and the user should control when it runs.
+  - Caution the user that generation speed depends heavily on the dataset configuration and their inference setup.
+  - Do not run this command yourself — the user should control when it runs.