Drop unused Gemma4TextAttention weights when sharing KV Cache by RyanMullins · Pull Request #45328 · huggingface/transformers

RyanMullins · 2026-04-08T20:43:42Z

What does this PR do?

Drops k_proj, k_norm, and v_proj weights for Gemma4TextAttention modules from the checkpoint if the layer shares KV cache values.

These changes can also be adapted to Gemma 3n if that's desirable.

I have a weights update I can push to drop the unused matrices from the checkpoints, but I want to wait until this PR lands before we push those.

Code Agent Policy

The Transformers repo is currently being overwhelmed by a large number of PRs and issue comments written by
code agents. We are currently bottlenecked by our ability to review and respond to them. As a result,
we ask that new users do not submit pure code agent PRs at this time.
You may use code agents in drafting or to help you diagnose issues. We'd also ask autonomous "OpenClaw"-like agents
not to open any PRs or issues for the moment.

PRs that appear to be fully agent-written will probably be closed without review, and we may block users who do this
repeatedly or maliciously.

This is a rapidly-evolving situation that's causing significant shockwaves in the open-source community. As a result,
this policy is likely to be updated regularly in the near future. For more information, please read CONTRIBUTING.md.

I confirm that this is not a pure code agent PR.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case. On Slack.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@zucchini-nlp @Cyrilvallez

lucianommartins · 2026-04-09T00:14:28Z

Hi @Cyrilvallez / HuggingFace team,

I talked to @RyanMullins about this PR and I and the team acknowledge that it is a critical problem to be fixed. But as it involves more architectural changes (ie. how to handle qkv, new model checkpoints, etc) we would like to ask holding this PR merging.

We (Google side) want to coordinate and validate those code/checkpoint changes across all internal and external stakeholders to make sure we won't cause harm to other customers using Gemma4 models while we sync this to other environments.

Cyrilvallez · 2026-04-09T08:14:01Z

Hey @RyanMullins @lucianommartins! I'm not sure why, but this PR was mostly a full-on copy of my PR here #45312, which was just merged... Just seeing your message here about holding a bit @lucianommartins, very sorry for that, I did not see it before merge. However, the architectural change is quite minimal (only with `use_cache=False, which is not very common), so it should not cause any issues. Let me know if that's still a problem, and we can revert

Cyrilvallez · 2026-04-09T08:15:48Z

I think @RyanMullins mostly wanted to update the weight conversion script here, and mistakenly brought my changes from the other PR

Cyrilvallez · 2026-04-09T08:46:29Z

@RyanMullins @lucianommartins #45336 completely drop the weights, and silently skip loading them from the hub, so that the weights on the hub repos do not have to be changed, while keeping the memory benefits of dropping them when we use the model

RyanMullins · 2026-04-09T10:08:58Z

Yeah I started with #45336 and then added the conditionals to drop the modules that aren't used when shared. There was a history difference between that PR and main yesterday, so I squashed it all and rebased on mine, hence the different histories here.

Landing #45336 without coordinating with vLLM is fine. But dropping the modules needs to be coordinated.

I'll rebase this morning and we can take it from there.

RyanMullins · 2026-04-09T12:29:02Z

@Cyrilvallez sync'd this branch with upstream/main and it's showing the correct isolation of the attention submodules based on the shared KV cache values. LMK what you think.

Cyrilvallez · 2026-04-09T13:20:26Z

#45336 also contains the required properties so that unexpected weights on the hub will not trigger huge warnings for users (without it, a huge loading report will pop up, saying that quite a lot of weights have not been used), so gonna merge this one instead!

github-actions · 2026-04-09T13:24:40Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=45328&sha=a187bc

github-actions · 2026-04-09T18:30:47Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: gemma4

RyanMullins force-pushed the gemma-kv-cache-sharing branch from 48be161 to 23b0072 Compare April 8, 2026 20:46

RyanMullins force-pushed the gemma-kv-cache-sharing branch from 23b0072 to c363523 Compare April 9, 2026 12:22

RyanMullins changed the title ~~fix: KV cache sharing~~ Drop unused submodules from Gemma4TextAttention layers that share KV Cache Apr 9, 2026

RyanMullins closed this Apr 9, 2026

RyanMullins reopened this Apr 9, 2026

RyanMullins force-pushed the gemma-kv-cache-sharing branch from a187bcc to 09156ad Compare April 9, 2026 14:07

RyanMullins added 4 commits April 9, 2026 18:07

fix: KV cache sharing

bf50b4a

fix: duplicate shared_kv_states in function params

a09b070

fix: module style

7d863e5

revert: modeling changes

2685bc7

RyanMullins force-pushed the gemma-kv-cache-sharing branch from 09156ad to 2685bc7 Compare April 9, 2026 18:29

RyanMullins changed the title ~~Drop unused submodules from Gemma4TextAttention layers that share KV Cache~~ Drop unused Gemma4TextAttention weights when sharing KV Cache Apr 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Drop unused Gemma4TextAttention weights when sharing KV Cache#45328

Drop unused Gemma4TextAttention weights when sharing KV Cache#45328
RyanMullins wants to merge 4 commits intohuggingface:mainfrom
RyanMullins:gemma-kv-cache-sharing

RyanMullins commented Apr 8, 2026 •

edited

Loading

Uh oh!

lucianommartins commented Apr 9, 2026

Uh oh!

Cyrilvallez commented Apr 9, 2026

Uh oh!

Cyrilvallez commented Apr 9, 2026

Uh oh!

Cyrilvallez commented Apr 9, 2026

Uh oh!

RyanMullins commented Apr 9, 2026

Uh oh!

RyanMullins commented Apr 9, 2026

Uh oh!

Cyrilvallez commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

RyanMullins commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Code Agent Policy

Before submitting

Who can review?

Uh oh!

lucianommartins commented Apr 9, 2026

Uh oh!

Cyrilvallez commented Apr 9, 2026

Uh oh!

Cyrilvallez commented Apr 9, 2026

Uh oh!

Cyrilvallez commented Apr 9, 2026

Uh oh!

RyanMullins commented Apr 9, 2026

Uh oh!

RyanMullins commented Apr 9, 2026

Uh oh!

Cyrilvallez commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

github-actions bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RyanMullins commented Apr 8, 2026 •

edited

Loading