[Qwen3.5] Fix caching bug in GDN layer for autoregressive mode#3907
Open
Rohan-Bierneni wants to merge 1 commit into
Open
[Qwen3.5] Fix caching bug in GDN layer for autoregressive mode#3907Rohan-Bierneni wants to merge 1 commit into
Rohan-Bierneni wants to merge 1 commit into
Conversation
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
be2561a to
77c4785
Compare
Add mini config model support for q3.5 Wrong config name updated Remove special casing for caching since using existing kvcache class return kvcache instead of active_cache Use kvcache class and remove extra logic in decoders.py Add logic for proper batching of gdn caches Update for nnx issue when batch size > 1 Remove GDN specific cache Fixed linter issues Run linter on qwen3.py
19be3b5 to
573398b
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Previously when we enabled caching in the GDN layer for Qwen3-Next model bringup, there was a bug in autoregressive mode when running inference with api_server, our benchmarking tool. There was an issue with how api_server was managing the GDN caches resulting in repeated token output in decoding.
Now, I have fixed this bug by reusing the existing kvcache class in kvcache.py for storing the GDN recurrent_state and conv_state. It seems like the kvcache class is well integrated with api_server in terms of batching, padding, updating, etc.
For new models with special cache structures, for the fastest way to functional decoding, it seems like reusing the kvcache class as much as possible is the best solution since it is already integrated within our inference framework
Tests
I have run standard decode.py and have also run an inference benchmark on api_server for Qwen3.5, which uses the GDN
decode.py output: https://paste.googleplex.com/6084567757881344
Test run with api_server for Qwen3.5-35b-a3b:
Checklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.