Question about Gemma4 SWA on KCpp vs LlamaCpp #2098
-
|
Hello, I'm a bit confused by the extreme disparity in the K/V cache's VRAM cost of Gemma 4 31B between Kcpp and Llama.cpp. On a RTX3090 24GB, and a Q4_K_M version of the model:
While i wouldn't doubt @LostRuins 's competences, I'm pretty sure I'm missing something here (and i feel I'm not the only one). Why is there such a massive difference in memory consumption between the 2 on this particular model? |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 3 replies
-
|
Yeah I found it odd too. I tried including SWA into the autofit calculations to match llama.cpp - can you see if this test build works better or worse or the same as before (ready in 1 hour) Windows: https://github.com/LostRuins/koboldcpp/actions/runs/24068765697 |
Beta Was this translation helpful? Give feedback.
-
|
Anyway, please ping me if issues arise. I have merged the latest llama.cpp changes. |
Beta Was this translation helpful? Give feedback.
-
|
I don't use autofit, I set all the layers to go to GPU (layer count value to 255 or whatever). I mean, I don't exactly see it as a problem to be fixed on your side, given that you know, your version runs fasters and uses less VRAM. But yeah, the disparity in VRAM usage feels a bit too big to be put under the "meh, kobold is just better optimized" umbrella, so it'd be great to understand what's happening under the hood as there's a likely trade-off somewhere. My assumption, as the VRAM usage disparity matches it, would be the KV's cache quantization. Obviously I didn't use the KV cache quant options (so I expect F16 in both backends), but with Gemma4 31B 24K context + SWA switch here, I get a KV cache size that would roughly be the same as if it was Q8 quant'd. So maybe your version overrides KV cache quant settings for some reason with SWA while mainline doesn't? It's quite out my domain of competence to be honest. All I can do is run the respective builds against my software and check the RAM usage. Disparity is the same in the build you linked and the rolling one. Yours is still using a lot less VRAM (which, again, still kinda great, just suspicious :D) Oh btw: CUDA 12 / Windows for reference |
Beta Was this translation helpful? Give feedback.
-
|
Okay, that's on me. I found what was the discrepancy coming from. I never bothered enabling smart cache / checkpoints on KoboldCpp. On Llama-server it's enabled by default, and worse, checkpoint / and RAM cache are broken with Gemma 4, leaking VRAM and RAM by the gallon, hence the memory usage discrepancy. Once checkpoints and cache are disabled on LLama.cpp the VRAM and RAM are basically the same. Yay mystery solved! |
Beta Was this translation helpful? Give feedback.
Okay, that's on me. I found what was the discrepancy coming from.
I never bothered enabling smart cache / checkpoints on KoboldCpp.
On Llama-server it's enabled by default, and worse, checkpoint / and RAM cache are broken with Gemma 4, leaking VRAM and RAM by the gallon, hence the memory usage discrepancy.
Once checkpoints and cache are disabled on LLama.cpp the VRAM and RAM are basically the same. Yay mystery solved!