Fix ByteLevel-BPE tokenizers silently breaking in `LlamaTokenizer` by ansley · Pull Request #45345 · huggingface/transformers

ansley · 2026-04-09T14:31:40Z

The transformers V5 "rm slow tokenizers" refactor (#40936) aliased
LlamaTokenizerFast to LlamaTokenizer, whose __init__
unconditionally installs a SentencePiece Metaspace pre-tokenizer. This
is correct for classic Llama/Llama-2 models but silently breaks newer
models that use ByteLevel BPE under the same
tokenizer_class="LlamaTokenizerFast" label.

Reproduction

`transformers` V4 (correct behavior)

~/modular $ python3
Python 3.10.12 (main, Mar  3 2026, 11:56:32) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import transformers
>>> print(transformers.__version__)
4.57.3
>>> from transformers import AutoTokenizer
>>> tok = AutoTokenizer.from_pretrained("black-forest-labs/FLUX.2-dev", subfolder="tokenizer")
>>> print(tok.encode("a cat in a garden", add_special_tokens=False))
[1097, 7990, 1294, 1261, 26428]
>>> print(tok.decode(tok.encode("a cat in a garden", add_special_tokens=False)))
a cat in a garden
>>>

`transformers` V5 (incorrect behavior)

(venv) ~/modular $ python3
Python 3.13.11 (main, Dec  9 2025, 19:04:10) [Clang 21.1.4 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import transformers
>>> print(transformers.__version__)
5.2.0
>>> from transformers import AutoTokenizer
>>> tok = AutoTokenizer.from_pretrained("black-forest-labs/FLUX.2-dev", subfolder="tokenizer")
>>> print(tok.encode("a cat in a garden", add_special_tokens=False))
[1413, 8002, 1393, 38083]
>>> print(tok.decode(tok.encode("a cat in a garden", add_special_tokens=False)))
acatinagarden
>>>

`transformers` V5 with bugfix (correct behavior)

(venv) ~/transformers $ python3
Python 3.13.11 (main, Dec  9 2025, 19:04:10) [Clang 21.1.4 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import transformers
>>> print(transformers.__version__)
5.6.0.dev0
>>> from transformers import AutoTokenizer
>>> tok = AutoTokenizer.from_pretrained("black-forest-labs/FLUX.2-dev", subfolder="tokenizer")
>>> print(tok.encode("a cat in a garden", add_special_tokens=False))
[1097, 7990, 1294, 1261, 26428]
>>> print(tok.decode(tok.encode("a cat in a garden", add_special_tokens=False)))
a cat in a garden
>>>

Validation

I confirm that this is not a pure code agent PR.
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
No, it was not discussed. I ran into this bug at work, and the fix was clear to me, so I made the change without bothering to file an issue.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
No, I didn't feel like any documentation changes were relevant, since this is just a small bugfix.
Did you write any new necessary tests?
Yes, although tests were not necessary in this case.

ansley · 2026-04-09T14:33:21Z

cc @ArthurZucker @itazap for review

The `transformers` V5 "rm slow tokenizers" refactor (\huggingface#40936) aliased `LlamaTokenizerFast` to `LlamaTokenizer`, whose `__init__` unconditionally installs a SentencePiece Metaspace pre-tokenizer. This is correct for classic Llama/Llama-2 models but silently breaks newer models that use ByteLevel BPE under the same `tokenizer_class="LlamaTokenizerFast"` label.

github-actions · 2026-04-09T15:40:09Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: llama

github-actions · 2026-04-09T15:51:38Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45345&sha=9befb3

ansley · 2026-04-09T17:00:56Z

I'm unfamiliar with the transformers CI, so I'd appreciate some guidance. It looks like the failing test is test_from_pretrained_dynamic_processor. Is this a known issue? The test passes on both main and my feature branch on my local B200, so I was unable to repro. Thanks!

ArthurZucker · 2026-04-10T09:45:41Z

Hey! the issue is that the model "black-forest-labs/FLUX.2-dev" would need an update to use PreTrainedTokenizerFast as the tokenizer_class since its not a LlamaTokenizer. As you say LlamaTokenizer is for v1-2, v3 does not use it.

v5 did break this, but its fully documented in release notes! 🤗

ansley · 2026-04-10T12:45:24Z

@ArthurZucker Thanks for your fast response! I'll close this PR out, then. We (Modular) already have a fix in our internal codebase where we're swapping the tokenizer out, so it's good to know that that's the canonical solution

ansley force-pushed the ansley/tokenizer-bug branch from e5577e1 to 1ffe5e4 Compare April 9, 2026 14:53

ansley changed the title ~~Fix ByteLevel-BPE tokenizers breaking in subclasses with custom __init__~~ Fix ByteLevel-BPE tokenizers silently breaking in LlamaTokenizer Apr 9, 2026

ansley force-pushed the ansley/tokenizer-bug branch from 1ffe5e4 to 9befb3c Compare April 9, 2026 15:38

ansley closed this Apr 10, 2026

lauri9 mentioned this pull request Apr 20, 2026

Fix HunyuanVideo tokenizer on transformers v5 xdit-project/xDiT#685

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix ByteLevel-BPE tokenizers silently breaking in `LlamaTokenizer`#45345

Fix ByteLevel-BPE tokenizers silently breaking in `LlamaTokenizer`#45345
ansley wants to merge 1 commit intohuggingface:mainfrom
ansley:ansley/tokenizer-bug

ansley commented Apr 9, 2026 •

edited

Loading

Uh oh!

ansley commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

ansley commented Apr 9, 2026

Uh oh!

ArthurZucker commented Apr 10, 2026 •

edited

Loading

Uh oh!

ansley commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ansley commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reproduction

transformers V4 (correct behavior)

transformers V5 (incorrect behavior)

transformers V5 with bugfix (correct behavior)

Validation

Uh oh!

ansley commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

ansley commented Apr 9, 2026

Uh oh!

ArthurZucker commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ansley commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ansley commented Apr 9, 2026 •

edited

Loading

`transformers` V4 (correct behavior)

`transformers` V5 (incorrect behavior)

`transformers` V5 with bugfix (correct behavior)

ArthurZucker commented Apr 10, 2026 •

edited

Loading