Add backwards compatibility for transformers v4.57#684
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review infoConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds a Transformers v4/v5 compatibility layer for tokenizer special-token handling: introduces SPECIAL_TOKENS_KEY and get_extra_special_tokens(), updates tokenizer setup and callers/tests to use the version-agnostic key, and relaxes the transformers version constraint in requirements. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@requirements.txt`:
- Line 6: Replace the yanked version spec "transformers>=4.57.0" in
requirements.txt with a non-yanked stable floor (for example
"transformers>=4.56.2") so installs won't fail; update the line that currently
reads transformers>=4.57.0 to the chosen safe minimum and re-run pip install to
verify resolution.
In `@src/instructlab/training/tokenizer_utils.py`:
- Around line 10-19: Tests hard-code the legacy key "additional_special_tokens"
when calling tokenizer.add_special_tokens, which breaks with transformers v5;
import the centralized SPECIAL_TOKENS_KEY from
src.instructlab.training.tokenizer_utils and replace those hard-coded string
keys in tests (the failing calls around tests/unit/test_data_process.py and
tests/unit/test_unmask_messages.py) so calls like tokenizer.add_special_tokens({
"additional_special_tokens": [...] }) become tokenizer.add_special_tokens({
SPECIAL_TOKENS_KEY: [...] }), ensuring compatibility with both v4 and v5 (you
can also reference get_extra_special_tokens if needed to read tokens from a
tokenizer).
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
requirements.txtsrc/instructlab/training/data_process.pysrc/instructlab/training/tokenizer_utils.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/instructlab/training/tokenizer_utils.py`:
- Around line 10-14: The split call that computes _TRANSFORMERS_V5 currently
uses transformers.__version__.split(".")[0] and triggers pylint C0207; change
the split to use a maxsplit of 1 (i.e., transformers.__version__.split(".",
maxsplit=1)[0]) so only the major version is parsed; update the expression that
defines _TRANSFORMERS_V5 accordingly where it appears.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/instructlab/training/data_process.pysrc/instructlab/training/tokenizer_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/instructlab/training/data_process.py
RobotSail
left a comment
There was a problem hiding this comment.
LGMT, just check the coderabbit comments
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unit/test_data_process.py (1)
524-526: Remove overwrittenencode.side_effectassignment.At Line 524,
self.mock_tokenizer.encode.side_effectis set, but it is replaced at Line 556 before test execution. This dead write adds noise and can confuse future maintenance.♻️ Suggested cleanup
- self.mock_tokenizer.encode.side_effect = lambda text, add_special_tokens=False: [ - hash(text) % 1000 for _ in text.split() - ]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/test_data_process.py` around lines 524 - 526, The assignment to self.mock_tokenizer.encode.side_effect near the top of the test is dead because it is overwritten later; remove that earlier assignment (the lambda creating token ids via hash(text) % 1000) and keep only the later, intended side_effect so tests are not confused by a redundant setup on self.mock_tokenizer.encode; ensure any needed behaviour is provided by the remaining side_effect or by a shared helper used by the tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tests/unit/test_data_process.py`:
- Around line 524-526: The assignment to self.mock_tokenizer.encode.side_effect
near the top of the test is dead because it is overwritten later; remove that
earlier assignment (the lambda creating token ids via hash(text) % 1000) and
keep only the later, intended side_effect so tests are not confused by a
redundant setup on self.mock_tokenizer.encode; ensure any needed behaviour is
provided by the remaining side_effect or by a shared helper used by the tests.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
requirements.txttests/unit/test_data_process.pytests/unit/test_unmask_messages.py
🚧 Files skipped from review as they are similar to previous changes (1)
- requirements.txt
The v5 compatibility changes renamed additional_special_tokens to extra_special_tokens and pinned transformers>=5.0.0. This breaks the LoRA training path in
training_hub, which depends on unsloth and still requires transformers v4.57.
This PR adds version-conditional handling so both v4.57 and v5+ work:
transformers version
Summary by CodeRabbit
Refactor
Tests
Chore