Fix ByteLevel-BPE tokenizers silently breaking in LlamaTokenizer#45345
Fix ByteLevel-BPE tokenizers silently breaking in LlamaTokenizer#45345ansley wants to merge 1 commit intohuggingface:mainfrom
LlamaTokenizer#45345Conversation
|
cc @ArthurZucker @itazap for review |
e5577e1 to
1ffe5e4
Compare
__init__LlamaTokenizer
The `transformers` V5 "rm slow tokenizers" refactor (\huggingface#40936) aliased `LlamaTokenizerFast` to `LlamaTokenizer`, whose `__init__` unconditionally installs a SentencePiece Metaspace pre-tokenizer. This is correct for classic Llama/Llama-2 models but silently breaks newer models that use ByteLevel BPE under the same `tokenizer_class="LlamaTokenizerFast"` label.
1ffe5e4 to
9befb3c
Compare
|
[For maintainers] Suggested jobs to run (before merge) run-slow: llama |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45345&sha=9befb3 |
|
I'm unfamiliar with the |
|
Hey! the issue is that the model v5 did break this, but its fully documented in release notes! 🤗 |
|
@ArthurZucker Thanks for your fast response! I'll close this PR out, then. We (Modular) already have a fix in our internal codebase where we're swapping the tokenizer out, so it's good to know that that's the canonical solution |
The
transformersV5 "rm slow tokenizers" refactor (#40936) aliasedLlamaTokenizerFasttoLlamaTokenizer, whose__init__unconditionally installs a SentencePiece Metaspace pre-tokenizer. This
is correct for classic Llama/Llama-2 models but silently breaks newer
models that use ByteLevel BPE under the same
tokenizer_class="LlamaTokenizerFast"label.Reproduction
transformersV4 (correct behavior)transformersV5 (incorrect behavior)transformersV5 with bugfix (correct behavior)Validation
Pull Request section?
to it if that's the case.
No, it was not discussed. I ran into this bug at work, and the fix was clear to me, so I made the change without bothering to file an issue.
documentation guidelines, and
here are tips on formatting docstrings.
No, I didn't feel like any documentation changes were relevant, since this is just a small bugfix.
Yes, although tests were not necessary in this case.