Question about Gemma4 SWA on KCpp vs LlamaCpp #2098

SerialKicked · 2026-04-06T20:05:07Z

SerialKicked
Apr 6, 2026

Hello, I'm a bit confused by the extreme disparity in the K/V cache's VRAM cost of Gemma 4 31B between Kcpp and Llama.cpp.

On a RTX3090 24GB, and a Q4_K_M version of the model:

On KoboldCpp with FA, no context shift, FastForward, and SWA flags, the model runs with 24k tokens worth of context and still leaves me with >1GB of VRAM to spare and runs very well.
On Llama.cpp with the FA enabled, no context shit, (not sure Fast Forward is a server option), the only SWA-related flag it has is "--swa-full" (but its another can of worms). From what I gathered in the commits, SWA is autodetected in llama.cpp. Anyway, with or without that flag, I can barely fit a context size of 18-20K tokens while going borderline OOM. With -swa-full, it's so worse it's not even worth mentioning.

While i wouldn't doubt @LostRuins 's competences, I'm pretty sure I'm missing something here (and i feel I'm not the only one). Why is there such a massive difference in memory consumption between the 2 on this particular model?

Answered by SerialKicked

Apr 9, 2026

Okay, that's on me. I found what was the discrepancy coming from.

I never bothered enabling smart cache / checkpoints on KoboldCpp.

On Llama-server it's enabled by default, and worse, checkpoint / and RAM cache are broken with Gemma 4, leaking VRAM and RAM by the gallon, hence the memory usage discrepancy.

Once checkpoints and cache are disabled on LLama.cpp the VRAM and RAM are basically the same. Yay mystery solved!

View full answer

LostRuins · 2026-04-07T07:02:46Z

LostRuins
Apr 7, 2026
Maintainer

Yeah I found it odd too.

I tried including SWA into the autofit calculations to match llama.cpp - can you see if this test build works better or worse or the same as before (ready in 1 hour)

Windows: https://github.com/LostRuins/koboldcpp/actions/runs/24068765697
Linux: https://github.com/LostRuins/koboldcpp/actions/runs/24068769431

0 replies

LostRuins · 2026-04-07T10:02:54Z

LostRuins
Apr 7, 2026
Maintainer

Anyway, please ping me if issues arise. I have merged the latest llama.cpp changes.

0 replies

SerialKicked · 2026-04-07T13:05:16Z

SerialKicked
Apr 7, 2026
Author

I don't use autofit, I set all the layers to go to GPU (layer count value to 255 or whatever).

I mean, I don't exactly see it as a problem to be fixed on your side, given that you know, your version runs fasters and uses less VRAM. But yeah, the disparity in VRAM usage feels a bit too big to be put under the "meh, kobold is just better optimized" umbrella, so it'd be great to understand what's happening under the hood as there's a likely trade-off somewhere.

My assumption, as the VRAM usage disparity matches it, would be the KV's cache quantization. Obviously I didn't use the KV cache quant options (so I expect F16 in both backends), but with Gemma4 31B 24K context + SWA switch here, I get a KV cache size that would roughly be the same as if it was Q8 quant'd. So maybe your version overrides KV cache quant settings for some reason with SWA while mainline doesn't?

It's quite out my domain of competence to be honest. All I can do is run the respective builds against my software and check the RAM usage. Disparity is the same in the build you linked and the rolling one. Yours is still using a lot less VRAM (which, again, still kinda great, just suspicious :D)

Oh btw: CUDA 12 / Windows for reference

3 replies

LostRuins Apr 7, 2026
Maintainer

alright well, so long as it works for you.

SerialKicked Apr 7, 2026
Author

What? It's not just me. Something is obviously not behaving the way you guys (or the llama.cpp peeps) describe it when it comes to SWA implementation.

Okay, who implemented this useSWA flag? I'd love to chat with them. Normally it'd look at the source code and answer the question myself, but I can't find a single trace of it when doing a search in your source code (how does that even happen).

wbruna Apr 7, 2026

That search is broken:

koboldcpp$ git status | head -n1
HEAD detached at v1.111.2
koboldcpp$ grep -n useswa koboldcpp.py
1825: inputs.swa_support = args.useswa
7915: args.useswa = swa_var.get()==1
8154: swa_var.set(1 if "useswa" in dict and dict["useswa"] else 0)
8815: if "useswa" in dict and dict["useswa"]:
10715: advparser.add_argument("--useswa", help="If set, allows Sliding Window Attention (SWA) KV Cache, which saves memory but cannot be used with context shifting.", action='store_true')
koboldcpp$

On the other hand, searching for the backend flag swa_support apparently works:
https://github.com/search?q=repo%3ALostRuins%2Fkoboldcpp+swa_support&type=code

SerialKicked · 2026-04-09T20:33:48Z

SerialKicked
Apr 9, 2026
Author

Okay, that's on me. I found what was the discrepancy coming from.

I never bothered enabling smart cache / checkpoints on KoboldCpp.

On Llama-server it's enabled by default, and worse, checkpoint / and RAM cache are broken with Gemma 4, leaking VRAM and RAM by the gallon, hence the memory usage discrepancy.

Once checkpoints and cache are disabled on LLama.cpp the VRAM and RAM are basically the same. Yay mystery solved!

0 replies

Question about Gemma4 SWA on KCpp vs LlamaCpp #2098

Uh oh!

Uh oh!

SerialKicked Apr 6, 2026

Replies: 4 comments · 3 replies

Uh oh!

LostRuins Apr 7, 2026 Maintainer

Uh oh!

LostRuins Apr 7, 2026 Maintainer

Uh oh!

Uh oh!

SerialKicked Apr 7, 2026 Author

Uh oh!

LostRuins Apr 7, 2026 Maintainer

Uh oh!

SerialKicked Apr 7, 2026 Author

Uh oh!

wbruna Apr 7, 2026

Uh oh!

Uh oh!

SerialKicked Apr 9, 2026 Author

SerialKicked
Apr 6, 2026

Replies: 4 comments 3 replies

LostRuins
Apr 7, 2026
Maintainer

LostRuins
Apr 7, 2026
Maintainer

SerialKicked
Apr 7, 2026
Author

LostRuins Apr 7, 2026
Maintainer

SerialKicked Apr 7, 2026
Author

SerialKicked
Apr 9, 2026
Author