Hi, I encountered buggy outputs from Qwen3-Coder-Next in second, third, (and so on) call of create_completion.
from llama_cpp import Llama
model = Llama(
"path/to/gguf/file.gguf",
n_gpu_layers=-1
)
prompt = """<|im_start|>system
You're a helpful AI assistant.
<|im_end|>
<|im_start|>user
Introduce yourself.
<|im_end|>
<|im_start|>assistant"""
res = model.create_completion(prompt, stop=["<|im_end|>"])
print(res["choices"][0]["text"])
The first call of the create_completion gave me;
Hello! I'm Qwen, a large-scale language model developed by Alibaba Cloud
, which I guess is Okay.
But in second call gave;
with logging says
WARN: memory_seq_rm(0, 23, -1) failed. Executing fallback: memory_seq_rm(0, 0, -1)
I guess there's some bug in the part managing KV cache.
Note that this also happens in the previous version 3.23.