Skip to content

Fix conversation loading logic in UltraChat dataset#1680

Open
jzh26 wants to merge 1 commit into
NVIDIA:mainfrom
jzh26:ultrachat
Open

Fix conversation loading logic in UltraChat dataset#1680
jzh26 wants to merge 1 commit into
NVIDIA:mainfrom
jzh26:ultrachat

Conversation

@jzh26

@jzh26 jzh26 commented Jun 11, 2026

Copy link
Copy Markdown

What does this PR do?

Type of change: Bug fix.

Previously, only the first user prompt was extracted from each example, discarding all subsequent turns. UltraChat stores full multi-turn conversations in the "messages" field, so switching to that field preserves the complete dialogue rather than truncating to a single user message.

Usage

# Add a code snippet demonstrating how to use this

Testing

python make_dataset.py -f test_cfg.yaml --full
python make_dataset.py -f test_cfg.yaml

      - name: "ultrachat"
        splits:
          train_gen: 100
          train_sft: 100

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A
  • Did you get Claude approval on this PR?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • Refactor
    • Updated the UltraChat dataset conversation loader to construct conversations directly from message data and standardized conversation ID formatting based on prompt information.

Previously, only the first user prompt was extracted from each example,
discarding all subsequent turns. UltraChat stores full multi-turn
conversations in the "messages" field, so switching to that field
preserves the complete dialogue rather than truncating to a single
user message.

Signed-off-by: jzh26 <226629529+jzh26@users.noreply.github.com>
@jzh26 jzh26 requested a review from a team as a code owner June 11, 2026 10:04
@jzh26 jzh26 requested a review from ChenhanYu June 11, 2026 10:04
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The PR updates the UltraChat conversation loader in examples/dataset/make_dataset.py to construct conversations directly from each dataset row's messages field instead of the prompt field. The new logic skips empty message samples, derives or generates prompt_id, and formats conversation identifiers as ultrachat-{split_name}-{prompt_id}. The dataset mixing, deduplication, and output flow remain unchanged.

Changes

UltraChat Messages Loading

Layer / File(s) Summary
UltraChat message-based conversation loading
examples/dataset/make_dataset.py
The _load_ultrachat_conversations function now extracts conversation turns from each row's messages field, skips rows with empty messages, resolves prompt_id from the row data or generates it from message content, and yields standardized conversation records with ultrachat-{split_name}-{prompt_id} identifiers.

🎯 2 (Simple) | ⏱️ ~8 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the primary change: fixing the conversation loading logic in the UltraChat dataset by extracting full multi-turn conversations from the messages field instead of just the first prompt.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed All six security anti-patterns from SECURITY.md were checked in the modified examples/dataset/make_dataset.py. No instances of torch.load(weights_only=False), numpy.load(allow_pickle=True), hardcod...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/dataset/make_dataset.py`:
- Around line 279-285: The code calls .strip() on ds[i]["prompt_id"] which can
raise if prompt_id is missing or not a string; change to first retrieve
prompt_id safely (e.g., use ds[i].get("prompt_id") or check key), normalize to a
string only when appropriate (if not isinstance(prompt_id, str) set prompt_id =
""), then call .strip(); keep the existing fallback that sets prompt_id =
id_for_conversation(msgs) when prompt_id is empty, and preserve the final
formatting that prefixes with f"ultrachat-{split_name}-{prompt_id}" so locate
this logic around the prompt_id handling in make_dataset.py (variables:
prompt_id, ds, msgs, id_for_conversation, split_name).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: fff0e3cf-b26b-4b9d-a9e5-a225ae664e6d

📥 Commits

Reviewing files that changed from the base of the PR and between c88b62b and 777ac0d.

📒 Files selected for processing (1)
  • examples/dataset/make_dataset.py

Comment on lines 279 to +285
prompt_id = ds[i]["prompt_id"].strip()
if prompt:
msgs = [{"role": "user", "content": prompt}]
if not prompt_id:
prompt_id = id_for_conversation(msgs)
prompt_id = f"ultrachat-{split_name}-{prompt_id}"
yield {"conversation_id": prompt_id, "conversations": msgs}
msgs = ds[i]["messages"]
if not msgs:
continue
if not prompt_id:
prompt_id = id_for_conversation(msgs)
prompt_id = f"ultrachat-{split_name}-{prompt_id}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle missing/non-string prompt_id before calling .strip()

Line 279 can crash (AttributeError/KeyError) when a row has null/missing prompt_id, so the new fallback ID logic is never reached. Normalize first, then strip.

Proposed fix
-        prompt_id = ds[i]["prompt_id"].strip()
+        raw_prompt_id = ds[i].get("prompt_id")
+        prompt_id = raw_prompt_id.strip() if isinstance(raw_prompt_id, str) else ""
         msgs = ds[i]["messages"]
         if not msgs:
             continue
         if not prompt_id:
             prompt_id = id_for_conversation(msgs)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/dataset/make_dataset.py` around lines 279 - 285, The code calls
.strip() on ds[i]["prompt_id"] which can raise if prompt_id is missing or not a
string; change to first retrieve prompt_id safely (e.g., use
ds[i].get("prompt_id") or check key), normalize to a string only when
appropriate (if not isinstance(prompt_id, str) set prompt_id = ""), then call
.strip(); keep the existing fallback that sets prompt_id =
id_for_conversation(msgs) when prompt_id is empty, and preserve the final
formatting that prefixes with f"ultrachat-{split_name}-{prompt_id}" so locate
this logic around the prompt_id handling in make_dataset.py (variables:
prompt_id, ds, msgs, id_for_conversation, split_name).

@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.22%. Comparing base (2c52e7b) to head (777ac0d).
⚠️ Report is 28 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1680      +/-   ##
==========================================
- Coverage   77.48%   75.22%   -2.27%     
==========================================
  Files         489      511      +22     
  Lines       54415    59882    +5467     
==========================================
+ Hits        42165    45045    +2880     
- Misses      12250    14837    +2587     
Flag Coverage Δ
examples 41.54% <ø> (-1.38%) ⬇️
regression 14.88% <ø> (+0.07%) ⬆️
unit 54.34% <ø> (+0.33%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant