Add backwards compatibility for transformers v4.57 by Maxusmusti · Pull Request #684 · instructlab/training

Maxusmusti · 2026-02-23T23:40:21Z

The v5 compatibility changes renamed additional_special_tokens to extra_special_tokens and pinned transformers>=5.0.0. This breaks the LoRA training path in
training_hub, which depends on unsloth and still requires transformers v4.57.

This PR adds version-conditional handling so both v4.57 and v5+ work:

Adds SPECIAL_TOKENS_KEY constant and get_extra_special_tokens() helper to tokenizer_utils.py that resolve to the correct API based on the installed
transformers version
Updates all extra_special_tokens usages in tokenizer_utils.py and data_process.py to use the compat helpers
Lowers the version floor in requirements.txt from >=5.0.0 to >=4.57.0

Summary by CodeRabbit

Refactor
- Made tokenizer handling version-agnostic so special tokens are detected and applied consistently across different Transformers releases.
Tests
- Updated unit tests to use centralized special-token handling and simplified related mocks.
Chore
- Adjusted Transformers compatibility baseline in project requirements.

coderabbitai · 2026-02-23T23:40:41Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8c44b13 and 69f067a.

📒 Files selected for processing (1)

tests/unit/test_data_process.py

🚧 Files skipped from review as they are similar to previous changes (1)

tests/unit/test_data_process.py

📝 Walkthrough

Walkthrough

Adds a Transformers v4/v5 compatibility layer for tokenizer special-token handling: introduces SPECIAL_TOKENS_KEY and get_extra_special_tokens(), updates tokenizer setup and callers/tests to use the version-agnostic key, and relaxes the transformers version constraint in requirements.

Changes

Cohort / File(s)	Summary
Dependency Update `requirements.txt`	Relaxed transformers constraint from `>=5.0.0` to `>=4.57.1`.
Tokenizer Compatibility `src/instructlab/training/tokenizer_utils.py`	Added `_TRANSFORMERS_V5` flag and `SPECIAL_TOKENS_KEY`; added `get_extra_special_tokens(tokenizer)` and made tokenizer setup functions version-agnostic (use constant key, return annotated tokenizer).
Integration / Callers `src/instructlab/training/data_process.py`	Imported `SPECIAL_TOKENS_KEY` and replaced literal special-token dict keys with the constant in `add_special_tokens` calls.
Tests `tests/unit/test_data_process.py`, `tests/unit/test_unmask_messages.py`	Updated tests to import and use `SPECIAL_TOKENS_KEY` instead of literal special-token keys.
Metadata / Manifests `pyproject.toml`, `manifest_file`	No behavioral changes; present in diff metadata.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Add transformers v5 compatibility #681: Similar tokenizer special-token handling changes addressing Transformers v5's renamed token-key; likely overlaps on version-compatibility fixes.

Poem

🐰 I hopped through tokens, old and new,
I nudged a key so both could chew.
A tiny shim, a gentle tweak,
Now v4 and v5 both speak.
Carrots, code, compatibility too!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly summarizes the main objective: adding backwards compatibility for transformers v4.57. It accurately reflects the core change—introducing version-conditional handling to support both v4.57 and v5+.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch v4-backwards-compat

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@requirements.txt`:
- Line 6: Replace the yanked version spec "transformers>=4.57.0" in
requirements.txt with a non-yanked stable floor (for example
"transformers>=4.56.2") so installs won't fail; update the line that currently
reads transformers>=4.57.0 to the chosen safe minimum and re-run pip install to
verify resolution.

In `@src/instructlab/training/tokenizer_utils.py`:
- Around line 10-19: Tests hard-code the legacy key "additional_special_tokens"
when calling tokenizer.add_special_tokens, which breaks with transformers v5;
import the centralized SPECIAL_TOKENS_KEY from
src.instructlab.training.tokenizer_utils and replace those hard-coded string
keys in tests (the failing calls around tests/unit/test_data_process.py and
tests/unit/test_unmask_messages.py) so calls like tokenizer.add_special_tokens({
"additional_special_tokens": [...] }) become tokenizer.add_special_tokens({
SPECIAL_TOKENS_KEY: [...] }), ensuring compatibility with both v4 and v5 (you
can also reference get_extra_special_tokens if needed to read tokens from a
tokenizer).

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c517712 and 5b7972f.

📒 Files selected for processing (3)

requirements.txt
src/instructlab/training/data_process.py
src/instructlab/training/tokenizer_utils.py

requirements.txt

src/instructlab/training/tokenizer_utils.py

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/instructlab/training/tokenizer_utils.py`:
- Around line 10-14: The split call that computes _TRANSFORMERS_V5 currently
uses transformers.__version__.split(".")[0] and triggers pylint C0207; change
the split to use a maxsplit of 1 (i.e., transformers.__version__.split(".",
maxsplit=1)[0]) so only the major version is parsed; update the expression that
defines _TRANSFORMERS_V5 accordingly where it appears.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5b7972f and 257805c.

📒 Files selected for processing (2)

src/instructlab/training/data_process.py
src/instructlab/training/tokenizer_utils.py

🚧 Files skipped from review as they are similar to previous changes (1)

src/instructlab/training/data_process.py

src/instructlab/training/tokenizer_utils.py

RobotSail

LGMT, just check the coderabbit comments

…iles

coderabbitai

🧹 Nitpick comments (1)

tests/unit/test_data_process.py (1)

524-526: Remove overwritten encode.side_effect assignment.

At Line 524, self.mock_tokenizer.encode.side_effect is set, but it is replaced at Line 556 before test execution. This dead write adds noise and can confuse future maintenance.

♻️ Suggested cleanup

-        self.mock_tokenizer.encode.side_effect = lambda text, add_special_tokens=False: [
-            hash(text) % 1000 for _ in text.split()
-        ]

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/test_data_process.py` around lines 524 - 526, The assignment to
self.mock_tokenizer.encode.side_effect near the top of the test is dead because
it is overwritten later; remove that earlier assignment (the lambda creating
token ids via hash(text) % 1000) and keep only the later, intended side_effect
so tests are not confused by a redundant setup on self.mock_tokenizer.encode;
ensure any needed behaviour is provided by the remaining side_effect or by a
shared helper used by the tests.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@tests/unit/test_data_process.py`:
- Around line 524-526: The assignment to self.mock_tokenizer.encode.side_effect
near the top of the test is dead because it is overwritten later; remove that
earlier assignment (the lambda creating token ids via hash(text) % 1000) and
keep only the later, intended side_effect so tests are not confused by a
redundant setup on self.mock_tokenizer.encode; ensure any needed behaviour is
provided by the remaining side_effect or by a shared helper used by the tests.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 08027e3 and 8c44b13.

📒 Files selected for processing (3)

requirements.txt
tests/unit/test_data_process.py
tests/unit/test_unmask_messages.py

🚧 Files skipped from review as they are similar to previous changes (1)

requirements.txt

Add backwards compatibility for transformers v4.57

5b7972f

Maxusmusti requested a review from RobotSail February 23, 2026 23:40

Maxusmusti self-assigned this Feb 23, 2026

mergify bot added ci-failure dependencies Pull requests that update a dependency file labels Feb 23, 2026

Run ruff formatting

257805c

coderabbitai bot reviewed Feb 23, 2026

View reviewed changes

requirements.txt Outdated Show resolved Hide resolved

src/instructlab/training/tokenizer_utils.py Show resolved Hide resolved

Fix isort import ordering

2ce52b9

coderabbitai bot reviewed Feb 23, 2026

View reviewed changes

src/instructlab/training/tokenizer_utils.py Show resolved Hide resolved

Fix pylint use-maxsplit-arg warning

08027e3

RobotSail approved these changes Feb 24, 2026

View reviewed changes

mergify bot added the one-approval label Feb 24, 2026

Address CodeRabbit review: fix yanked version floor and update test f…

8c44b13

…iles

mergify bot added the testing Relates to testing label Feb 25, 2026

coderabbitai bot reviewed Feb 25, 2026

View reviewed changes

Fix ruff formatting to match CI ruff version

69f067a

mergify bot added ci-failure and removed ci-failure labels Feb 25, 2026

RobotSail merged commit 320b794 into main Feb 25, 2026
14 of 18 checks passed

RobotSail deleted the v4-backwards-compat branch February 25, 2026 16:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add backwards compatibility for transformers v4.57#684

Add backwards compatibility for transformers v4.57#684
RobotSail merged 6 commits intomainfrom
v4-backwards-compat

Maxusmusti commented Feb 23, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Feb 23, 2026 •

edited

Loading

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

RobotSail left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

Maxusmusti commented Feb 23, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RobotSail left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Maxusmusti commented Feb 23, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 23, 2026 •

edited

Loading