Skip to content

Fix ByteLevel-BPE tokenizers silently breaking in LlamaTokenizer#45345

Closed
ansley wants to merge 1 commit intohuggingface:mainfrom
ansley:ansley/tokenizer-bug
Closed

Fix ByteLevel-BPE tokenizers silently breaking in LlamaTokenizer#45345
ansley wants to merge 1 commit intohuggingface:mainfrom
ansley:ansley/tokenizer-bug

Conversation

@ansley
Copy link
Copy Markdown

@ansley ansley commented Apr 9, 2026

The transformers V5 "rm slow tokenizers" refactor (#40936) aliased
LlamaTokenizerFast to LlamaTokenizer, whose __init__
unconditionally installs a SentencePiece Metaspace pre-tokenizer. This
is correct for classic Llama/Llama-2 models but silently breaks newer
models that use ByteLevel BPE under the same
tokenizer_class="LlamaTokenizerFast" label.

Reproduction

transformers V4 (correct behavior)

~/modular $ python3
Python 3.10.12 (main, Mar  3 2026, 11:56:32) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import transformers
>>> print(transformers.__version__)
4.57.3
>>> from transformers import AutoTokenizer
>>> tok = AutoTokenizer.from_pretrained("black-forest-labs/FLUX.2-dev", subfolder="tokenizer")
>>> print(tok.encode("a cat in a garden", add_special_tokens=False))
[1097, 7990, 1294, 1261, 26428]
>>> print(tok.decode(tok.encode("a cat in a garden", add_special_tokens=False)))
a cat in a garden
>>> 

transformers V5 (incorrect behavior)

(venv) ~/modular $ python3
Python 3.13.11 (main, Dec  9 2025, 19:04:10) [Clang 21.1.4 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import transformers
>>> print(transformers.__version__)
5.2.0
>>> from transformers import AutoTokenizer
>>> tok = AutoTokenizer.from_pretrained("black-forest-labs/FLUX.2-dev", subfolder="tokenizer")
>>> print(tok.encode("a cat in a garden", add_special_tokens=False))
[1413, 8002, 1393, 38083]
>>> print(tok.decode(tok.encode("a cat in a garden", add_special_tokens=False)))
acatinagarden
>>> 

transformers V5 with bugfix (correct behavior)

(venv) ~/transformers $ python3
Python 3.13.11 (main, Dec  9 2025, 19:04:10) [Clang 21.1.4 ] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import transformers
>>> print(transformers.__version__)
5.6.0.dev0
>>> from transformers import AutoTokenizer
>>> tok = AutoTokenizer.from_pretrained("black-forest-labs/FLUX.2-dev", subfolder="tokenizer")
>>> print(tok.encode("a cat in a garden", add_special_tokens=False))
[1097, 7990, 1294, 1261, 26428]
>>> print(tok.decode(tok.encode("a cat in a garden", add_special_tokens=False)))
a cat in a garden
>>> 

Validation

  • I confirm that this is not a pure code agent PR.
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
    No, it was not discussed. I ran into this bug at work, and the fix was clear to me, so I made the change without bothering to file an issue.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
    No, I didn't feel like any documentation changes were relevant, since this is just a small bugfix.
  • Did you write any new necessary tests?
    Yes, although tests were not necessary in this case.

@ansley
Copy link
Copy Markdown
Author

ansley commented Apr 9, 2026

cc @ArthurZucker @itazap for review

@ansley ansley force-pushed the ansley/tokenizer-bug branch from e5577e1 to 1ffe5e4 Compare April 9, 2026 14:53
@ansley ansley changed the title Fix ByteLevel-BPE tokenizers breaking in subclasses with custom __init__ Fix ByteLevel-BPE tokenizers silently breaking in LlamaTokenizer Apr 9, 2026
The `transformers` V5 "rm slow tokenizers" refactor (\huggingface#40936) aliased
`LlamaTokenizerFast` to `LlamaTokenizer`, whose `__init__`
unconditionally installs a SentencePiece Metaspace pre-tokenizer. This
is correct for classic Llama/Llama-2 models but silently breaks newer
models that use ByteLevel BPE under the same
`tokenizer_class="LlamaTokenizerFast"` label.
@ansley ansley force-pushed the ansley/tokenizer-bug branch from 1ffe5e4 to 9befb3c Compare April 9, 2026 15:38
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: llama

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45345&sha=9befb3

@ansley
Copy link
Copy Markdown
Author

ansley commented Apr 9, 2026

I'm unfamiliar with the transformers CI, so I'd appreciate some guidance. It looks like the failing test is test_from_pretrained_dynamic_processor. Is this a known issue? The test passes on both main and my feature branch on my local B200, so I was unable to repro. Thanks!

@ArthurZucker
Copy link
Copy Markdown
Collaborator

ArthurZucker commented Apr 10, 2026

Hey! the issue is that the model "black-forest-labs/FLUX.2-dev" would need an update to use PreTrainedTokenizerFast as the tokenizer_class since its not a LlamaTokenizer. As you say LlamaTokenizer is for v1-2, v3 does not use it.

v5 did break this, but its fully documented in release notes! 🤗

@ansley
Copy link
Copy Markdown
Author

ansley commented Apr 10, 2026

@ArthurZucker Thanks for your fast response! I'll close this PR out, then. We (Modular) already have a fix in our internal codebase where we're swapping the tokenizer out, so it's good to know that that's the canonical solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants