Skip to content

Comments

Add backwards compatibility for transformers v4.57#684

Merged
RobotSail merged 6 commits intomainfrom
v4-backwards-compat
Feb 25, 2026
Merged

Add backwards compatibility for transformers v4.57#684
RobotSail merged 6 commits intomainfrom
v4-backwards-compat

Conversation

@Maxusmusti
Copy link
Collaborator

@Maxusmusti Maxusmusti commented Feb 23, 2026

The v5 compatibility changes renamed additional_special_tokens to extra_special_tokens and pinned transformers>=5.0.0. This breaks the LoRA training path in
training_hub, which depends on unsloth and still requires transformers v4.57.

This PR adds version-conditional handling so both v4.57 and v5+ work:

  • Adds SPECIAL_TOKENS_KEY constant and get_extra_special_tokens() helper to tokenizer_utils.py that resolve to the correct API based on the installed
    transformers version
  • Updates all extra_special_tokens usages in tokenizer_utils.py and data_process.py to use the compat helpers
  • Lowers the version floor in requirements.txt from >=5.0.0 to >=4.57.0

Summary by CodeRabbit

  • Refactor

    • Made tokenizer handling version-agnostic so special tokens are detected and applied consistently across different Transformers releases.
  • Tests

    • Updated unit tests to use centralized special-token handling and simplified related mocks.
  • Chore

    • Adjusted Transformers compatibility baseline in project requirements.

@Maxusmusti Maxusmusti requested a review from RobotSail February 23, 2026 23:40
@coderabbitai
Copy link

coderabbitai bot commented Feb 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c44b13 and 69f067a.

📒 Files selected for processing (1)
  • tests/unit/test_data_process.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/unit/test_data_process.py

📝 Walkthrough

Walkthrough

Adds a Transformers v4/v5 compatibility layer for tokenizer special-token handling: introduces SPECIAL_TOKENS_KEY and get_extra_special_tokens(), updates tokenizer setup and callers/tests to use the version-agnostic key, and relaxes the transformers version constraint in requirements.

Changes

Cohort / File(s) Summary
Dependency Update
requirements.txt
Relaxed transformers constraint from >=5.0.0 to >=4.57.1.
Tokenizer Compatibility
src/instructlab/training/tokenizer_utils.py
Added _TRANSFORMERS_V5 flag and SPECIAL_TOKENS_KEY; added get_extra_special_tokens(tokenizer) and made tokenizer setup functions version-agnostic (use constant key, return annotated tokenizer).
Integration / Callers
src/instructlab/training/data_process.py
Imported SPECIAL_TOKENS_KEY and replaced literal special-token dict keys with the constant in add_special_tokens calls.
Tests
tests/unit/test_data_process.py, tests/unit/test_unmask_messages.py
Updated tests to import and use SPECIAL_TOKENS_KEY instead of literal special-token keys.
Metadata / Manifests
pyproject.toml, manifest_file
No behavioral changes; present in diff metadata.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 I hopped through tokens, old and new,
I nudged a key so both could chew.
A tiny shim, a gentle tweak,
Now v4 and v5 both speak.
Carrots, code, compatibility too!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the main objective: adding backwards compatibility for transformers v4.57. It accurately reflects the core change—introducing version-conditional handling to support both v4.57 and v5+.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch v4-backwards-compat

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@Maxusmusti Maxusmusti self-assigned this Feb 23, 2026
@mergify mergify bot added ci-failure dependencies Pull requests that update a dependency file labels Feb 23, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@requirements.txt`:
- Line 6: Replace the yanked version spec "transformers>=4.57.0" in
requirements.txt with a non-yanked stable floor (for example
"transformers>=4.56.2") so installs won't fail; update the line that currently
reads transformers>=4.57.0 to the chosen safe minimum and re-run pip install to
verify resolution.

In `@src/instructlab/training/tokenizer_utils.py`:
- Around line 10-19: Tests hard-code the legacy key "additional_special_tokens"
when calling tokenizer.add_special_tokens, which breaks with transformers v5;
import the centralized SPECIAL_TOKENS_KEY from
src.instructlab.training.tokenizer_utils and replace those hard-coded string
keys in tests (the failing calls around tests/unit/test_data_process.py and
tests/unit/test_unmask_messages.py) so calls like tokenizer.add_special_tokens({
"additional_special_tokens": [...] }) become tokenizer.add_special_tokens({
SPECIAL_TOKENS_KEY: [...] }), ensuring compatibility with both v4 and v5 (you
can also reference get_extra_special_tokens if needed to read tokens from a
tokenizer).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c517712 and 5b7972f.

📒 Files selected for processing (3)
  • requirements.txt
  • src/instructlab/training/data_process.py
  • src/instructlab/training/tokenizer_utils.py

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/instructlab/training/tokenizer_utils.py`:
- Around line 10-14: The split call that computes _TRANSFORMERS_V5 currently
uses transformers.__version__.split(".")[0] and triggers pylint C0207; change
the split to use a maxsplit of 1 (i.e., transformers.__version__.split(".",
maxsplit=1)[0]) so only the major version is parsed; update the expression that
defines _TRANSFORMERS_V5 accordingly where it appears.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5b7972f and 257805c.

📒 Files selected for processing (2)
  • src/instructlab/training/data_process.py
  • src/instructlab/training/tokenizer_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/instructlab/training/data_process.py

Copy link
Member

@RobotSail RobotSail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT, just check the coderabbit comments

@mergify mergify bot added the one-approval label Feb 24, 2026
@mergify mergify bot added the testing Relates to testing label Feb 25, 2026
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unit/test_data_process.py (1)

524-526: Remove overwritten encode.side_effect assignment.

At Line 524, self.mock_tokenizer.encode.side_effect is set, but it is replaced at Line 556 before test execution. This dead write adds noise and can confuse future maintenance.

♻️ Suggested cleanup
-        self.mock_tokenizer.encode.side_effect = lambda text, add_special_tokens=False: [
-            hash(text) % 1000 for _ in text.split()
-        ]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_data_process.py` around lines 524 - 526, The assignment to
self.mock_tokenizer.encode.side_effect near the top of the test is dead because
it is overwritten later; remove that earlier assignment (the lambda creating
token ids via hash(text) % 1000) and keep only the later, intended side_effect
so tests are not confused by a redundant setup on self.mock_tokenizer.encode;
ensure any needed behaviour is provided by the remaining side_effect or by a
shared helper used by the tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/test_data_process.py`:
- Around line 524-526: The assignment to self.mock_tokenizer.encode.side_effect
near the top of the test is dead because it is overwritten later; remove that
earlier assignment (the lambda creating token ids via hash(text) % 1000) and
keep only the later, intended side_effect so tests are not confused by a
redundant setup on self.mock_tokenizer.encode; ensure any needed behaviour is
provided by the remaining side_effect or by a shared helper used by the tests.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 08027e3 and 8c44b13.

📒 Files selected for processing (3)
  • requirements.txt
  • tests/unit/test_data_process.py
  • tests/unit/test_unmask_messages.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • requirements.txt

@mergify mergify bot added ci-failure and removed ci-failure labels Feb 25, 2026
@RobotSail RobotSail merged commit 320b794 into main Feb 25, 2026
14 of 18 checks passed
@RobotSail RobotSail deleted the v4-backwards-compat branch February 25, 2026 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-failure dependencies Pull requests that update a dependency file one-approval testing Relates to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants