Skip to content

Drop unused Gemma4TextAttention weights when sharing KV Cache#45328

Open
RyanMullins wants to merge 4 commits intohuggingface:mainfrom
RyanMullins:gemma-kv-cache-sharing
Open

Drop unused Gemma4TextAttention weights when sharing KV Cache#45328
RyanMullins wants to merge 4 commits intohuggingface:mainfrom
RyanMullins:gemma-kv-cache-sharing

Conversation

@RyanMullins
Copy link
Copy Markdown
Contributor

@RyanMullins RyanMullins commented Apr 8, 2026

What does this PR do?

Fixes #45242

  • Drops k_proj, k_norm, and v_proj weights for Gemma4TextAttention modules from the checkpoint if the layer shares KV cache values.

These changes can also be adapted to Gemma 3n if that's desirable.

I have a weights update I can push to drop the unused matrices from the checkpoints, but I want to wait until this PR lands before we push those.

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

  • I confirm that this is not a pure code agent PR.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case. On Slack.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@zucchini-nlp @Cyrilvallez

@RyanMullins RyanMullins force-pushed the gemma-kv-cache-sharing branch from 48be161 to 23b0072 Compare April 8, 2026 20:46
@lucianommartins
Copy link
Copy Markdown
Contributor

Hi @Cyrilvallez / HuggingFace team,

I talked to @RyanMullins about this PR and I and the team acknowledge that it is a critical problem to be fixed. But as it involves more architectural changes (ie. how to handle qkv, new model checkpoints, etc) we would like to ask holding this PR merging.

We (Google side) want to coordinate and validate those code/checkpoint changes across all internal and external stakeholders to make sure we won't cause harm to other customers using Gemma4 models while we sync this to other environments.

@Cyrilvallez
Copy link
Copy Markdown
Member

Hey @RyanMullins @lucianommartins! I'm not sure why, but this PR was mostly a full-on copy of my PR here #45312, which was just merged... Just seeing your message here about holding a bit @lucianommartins, very sorry for that, I did not see it before merge. However, the architectural change is quite minimal (only with `use_cache=False, which is not very common), so it should not cause any issues. Let me know if that's still a problem, and we can revert

@Cyrilvallez
Copy link
Copy Markdown
Member

I think @RyanMullins mostly wanted to update the weight conversion script here, and mistakenly brought my changes from the other PR

@Cyrilvallez
Copy link
Copy Markdown
Member

@RyanMullins @lucianommartins #45336 completely drop the weights, and silently skip loading them from the hub, so that the weights on the hub repos do not have to be changed, while keeping the memory benefits of dropping them when we use the model

@RyanMullins
Copy link
Copy Markdown
Contributor Author

Yeah I started with #45336 and then added the conditionals to drop the modules that aren't used when shared. There was a history difference between that PR and main yesterday, so I squashed it all and rebased on mine, hence the different histories here.

Landing #45336 without coordinating with vLLM is fine. But dropping the modules needs to be coordinated.

I'll rebase this morning and we can take it from there.

@RyanMullins RyanMullins force-pushed the gemma-kv-cache-sharing branch from 23b0072 to c363523 Compare April 9, 2026 12:22
@RyanMullins RyanMullins changed the title fix: KV cache sharing Drop unused submodules from Gemma4TextAttention layers that share KV Cache Apr 9, 2026
@RyanMullins
Copy link
Copy Markdown
Contributor Author

@Cyrilvallez sync'd this branch with upstream/main and it's showing the correct isolation of the attention submodules based on the shared KV cache values. LMK what you think.

@Cyrilvallez
Copy link
Copy Markdown
Member

#45336 also contains the required properties so that unexpected weights on the hub will not trigger huge warnings for users (without it, a huge loading report will pop up, saying that quite a lot of weights have not been used), so gonna merge this one instead!

@RyanMullins RyanMullins closed this Apr 9, 2026
@RyanMullins RyanMullins reopened this Apr 9, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45328&sha=a187bc

@RyanMullins RyanMullins force-pushed the gemma-kv-cache-sharing branch from a187bcc to 09156ad Compare April 9, 2026 14:07
@RyanMullins RyanMullins force-pushed the gemma-kv-cache-sharing branch from 09156ad to 2685bc7 Compare April 9, 2026 18:29
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 9, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: gemma4

@RyanMullins RyanMullins changed the title Drop unused submodules from Gemma4TextAttention layers that share KV Cache Drop unused Gemma4TextAttention weights when sharing KV Cache Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Gemma 4] use_cache=False corrupts attention computation, producing garbage logits

3 participants