Fix conversation loading logic in UltraChat dataset by jzh26 · Pull Request #1680 · NVIDIA/Model-Optimizer

jzh26 · 2026-06-11T10:04:55Z

What does this PR do?

Type of change: Bug fix.

Previously, only the first user prompt was extracted from each example, discarding all subsequent turns. UltraChat stores full multi-turn conversations in the "messages" field, so switching to that field preserves the complete dialogue rather than truncating to a single user message.

Usage

# Add a code snippet demonstrating how to use this

Testing

python make_dataset.py -f test_cfg.yaml --full
python make_dataset.py -f test_cfg.yaml

      - name: "ultrachat"
        splits:
          train_gen: 100
          train_sft: 100

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A
Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

Refactor
- Updated the UltraChat dataset conversation loader to construct conversations directly from message data and standardized conversation ID formatting based on prompt information.

Previously, only the first user prompt was extracted from each example, discarding all subsequent turns. UltraChat stores full multi-turn conversations in the "messages" field, so switching to that field preserves the complete dialogue rather than truncating to a single user message. Signed-off-by: jzh26 <226629529+jzh26@users.noreply.github.com>

coderabbitai · 2026-06-11T10:05:12Z

📝 Walkthrough

Walkthrough

The PR updates the UltraChat conversation loader in examples/dataset/make_dataset.py to construct conversations directly from each dataset row's messages field instead of the prompt field. The new logic skips empty message samples, derives or generates prompt_id, and formats conversation identifiers as ultrachat-{split_name}-{prompt_id}. The dataset mixing, deduplication, and output flow remain unchanged.

Changes

UltraChat Messages Loading

Layer / File(s)	Summary
UltraChat message-based conversation loading `examples/dataset/make_dataset.py`	The `_load_ultrachat_conversations` function now extracts conversation turns from each row's `messages` field, skips rows with empty messages, resolves `prompt_id` from the row data or generates it from message content, and yields standardized conversation records with `ultrachat-{split_name}-{prompt_id}` identifiers.

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly describes the primary change: fixing the conversation loading logic in the UltraChat dataset by extracting full multi-turn conversations from the messages field instead of just the first prompt.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	All six security anti-patterns from SECURITY.md were checked in the modified examples/dataset/make_dataset.py. No instances of torch.load(weights_only=False), numpy.load(allow_pickle=True), hardcod...

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/dataset/make_dataset.py`:
- Around line 279-285: The code calls .strip() on ds[i]["prompt_id"] which can
raise if prompt_id is missing or not a string; change to first retrieve
prompt_id safely (e.g., use ds[i].get("prompt_id") or check key), normalize to a
string only when appropriate (if not isinstance(prompt_id, str) set prompt_id =
""), then call .strip(); keep the existing fallback that sets prompt_id =
id_for_conversation(msgs) when prompt_id is empty, and preserve the final
formatting that prefixes with f"ultrachat-{split_name}-{prompt_id}" so locate
this logic around the prompt_id handling in make_dataset.py (variables:
prompt_id, ds, msgs, id_for_conversation, split_name).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fff0e3cf-b26b-4b9d-a9e5-a225ae664e6d

📥 Commits

Reviewing files that changed from the base of the PR and between c88b62b and 777ac0d.

📒 Files selected for processing (1)

examples/dataset/make_dataset.py

coderabbitai · 2026-06-11T10:08:26Z

        prompt_id = ds[i]["prompt_id"].strip()
-        if prompt:
-            msgs = [{"role": "user", "content": prompt}]
-            if not prompt_id:
-                prompt_id = id_for_conversation(msgs)
-            prompt_id = f"ultrachat-{split_name}-{prompt_id}"
-            yield {"conversation_id": prompt_id, "conversations": msgs}
+        msgs = ds[i]["messages"]
+        if not msgs:
+            continue
+        if not prompt_id:
+            prompt_id = id_for_conversation(msgs)
+        prompt_id = f"ultrachat-{split_name}-{prompt_id}"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle missing/non-string prompt_id before calling .strip()

Line 279 can crash (AttributeError/KeyError) when a row has null/missing prompt_id, so the new fallback ID logic is never reached. Normalize first, then strip.

Proposed fix

- prompt_id = ds[i]["prompt_id"].strip() + raw_prompt_id = ds[i].get("prompt_id") + prompt_id = raw_prompt_id.strip() if isinstance(raw_prompt_id, str) else "" msgs = ds[i]["messages"] if not msgs: continue if not prompt_id: prompt_id = id_for_conversation(msgs)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/dataset/make_dataset.py` around lines 279 - 285, The code calls .strip() on ds[i]["prompt_id"] which can raise if prompt_id is missing or not a string; change to first retrieve prompt_id safely (e.g., use ds[i].get("prompt_id") or check key), normalize to a string only when appropriate (if not isinstance(prompt_id, str) set prompt_id = ""), then call .strip(); keep the existing fallback that sets prompt_id = id_for_conversation(msgs) when prompt_id is empty, and preserve the final formatting that prefixes with f"ultrachat-{split_name}-{prompt_id}" so locate this logic around the prompt_id handling in make_dataset.py (variables: prompt_id, ds, msgs, id_for_conversation, split_name).

codecov · 2026-06-11T10:14:18Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.22%. Comparing base (2c52e7b) to head (777ac0d).
⚠️ Report is 28 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1680      +/-   ##
==========================================
- Coverage   77.48%   75.22%   -2.27%     
==========================================
  Files         489      511      +22     
  Lines       54415    59882    +5467     
==========================================
+ Hits        42165    45045    +2880     
- Misses      12250    14837    +2587

Flag	Coverage Δ
examples	`41.54% <ø> (-1.38%)`	⬇️
regression	`14.88% <ø> (+0.07%)`	⬆️
unit	`54.34% <ø> (+0.33%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jzh26 requested a review from a team as a code owner June 11, 2026 10:04

jzh26 requested a review from ChenhanYu June 11, 2026 10:04

coderabbitai Bot reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix conversation loading logic in UltraChat dataset#1680

Fix conversation loading logic in UltraChat dataset#1680
jzh26 wants to merge 1 commit into
NVIDIA:mainfrom
jzh26:ultrachat

jzh26 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Uh oh!

codecov Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jzh26 commented Jun 11, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jzh26 commented Jun 11, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 11, 2026 •

edited

Loading

codecov Bot commented Jun 11, 2026 •

edited

Loading