Fix conversation loading logic in UltraChat dataset#1680
Conversation
Previously, only the first user prompt was extracted from each example, discarding all subsequent turns. UltraChat stores full multi-turn conversations in the "messages" field, so switching to that field preserves the complete dialogue rather than truncating to a single user message. Signed-off-by: jzh26 <226629529+jzh26@users.noreply.github.com>
📝 WalkthroughWalkthroughThe PR updates the UltraChat conversation loader in ChangesUltraChat Messages Loading
🎯 2 (Simple) | ⏱️ ~8 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/dataset/make_dataset.py`:
- Around line 279-285: The code calls .strip() on ds[i]["prompt_id"] which can
raise if prompt_id is missing or not a string; change to first retrieve
prompt_id safely (e.g., use ds[i].get("prompt_id") or check key), normalize to a
string only when appropriate (if not isinstance(prompt_id, str) set prompt_id =
""), then call .strip(); keep the existing fallback that sets prompt_id =
id_for_conversation(msgs) when prompt_id is empty, and preserve the final
formatting that prefixes with f"ultrachat-{split_name}-{prompt_id}" so locate
this logic around the prompt_id handling in make_dataset.py (variables:
prompt_id, ds, msgs, id_for_conversation, split_name).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: fff0e3cf-b26b-4b9d-a9e5-a225ae664e6d
📒 Files selected for processing (1)
examples/dataset/make_dataset.py
| prompt_id = ds[i]["prompt_id"].strip() | ||
| if prompt: | ||
| msgs = [{"role": "user", "content": prompt}] | ||
| if not prompt_id: | ||
| prompt_id = id_for_conversation(msgs) | ||
| prompt_id = f"ultrachat-{split_name}-{prompt_id}" | ||
| yield {"conversation_id": prompt_id, "conversations": msgs} | ||
| msgs = ds[i]["messages"] | ||
| if not msgs: | ||
| continue | ||
| if not prompt_id: | ||
| prompt_id = id_for_conversation(msgs) | ||
| prompt_id = f"ultrachat-{split_name}-{prompt_id}" |
There was a problem hiding this comment.
Handle missing/non-string prompt_id before calling .strip()
Line 279 can crash (AttributeError/KeyError) when a row has null/missing prompt_id, so the new fallback ID logic is never reached. Normalize first, then strip.
Proposed fix
- prompt_id = ds[i]["prompt_id"].strip()
+ raw_prompt_id = ds[i].get("prompt_id")
+ prompt_id = raw_prompt_id.strip() if isinstance(raw_prompt_id, str) else ""
msgs = ds[i]["messages"]
if not msgs:
continue
if not prompt_id:
prompt_id = id_for_conversation(msgs)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/dataset/make_dataset.py` around lines 279 - 285, The code calls
.strip() on ds[i]["prompt_id"] which can raise if prompt_id is missing or not a
string; change to first retrieve prompt_id safely (e.g., use
ds[i].get("prompt_id") or check key), normalize to a string only when
appropriate (if not isinstance(prompt_id, str) set prompt_id = ""), then call
.strip(); keep the existing fallback that sets prompt_id =
id_for_conversation(msgs) when prompt_id is empty, and preserve the final
formatting that prefixes with f"ultrachat-{split_name}-{prompt_id}" so locate
this logic around the prompt_id handling in make_dataset.py (variables:
prompt_id, ds, msgs, id_for_conversation, split_name).
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1680 +/- ##
==========================================
- Coverage 77.48% 75.22% -2.27%
==========================================
Files 489 511 +22
Lines 54415 59882 +5467
==========================================
+ Hits 42165 45045 +2880
- Misses 12250 14837 +2587
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: Bug fix.
Previously, only the first user prompt was extracted from each example, discarding all subsequent turns. UltraChat stores full multi-turn conversations in the "messages" field, so switching to that field preserves the complete dialogue rather than truncating to a single user message.
Usage
# Add a code snippet demonstrating how to use thisTesting
python make_dataset.py -f test_cfg.yaml --full
python make_dataset.py -f test_cfg.yaml
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit