Full tool-calling support, inference abort fixes, XML parsing, OpenAI streaming compliance#413
Open
devnen wants to merge 2 commits intotheroyallab:mainfrom
Open
Full tool-calling support, inference abort fixes, XML parsing, OpenAI streaming compliance#413devnen wants to merge 2 commits intotheroyallab:mainfrom
devnen wants to merge 2 commits intotheroyallab:mainfrom
Conversation
528325c to
a2c7d81
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What This PR Does
TabbyAPI's tool-calling system worked for simple cases but had a collection of bugs that surfaced the moment you pushed it harder — a different model family, a stricter client, or just hitting Stop at the wrong time. This PR fixes all of them in one pass, tested end-to-end against Kilo Code, Roo Code, and OpenCode with Qwen3-Coder-Next on a dual-GPU setup.
The changes fall into four areas:
tool_choiceis now respected, and a Jinja filter fix makes HuggingFace-native chat templates work out of the box.Built against commit
41511f5with ExLlamaV3 v0.0.22.Changes
1. XML Tool-Call Parsing
Problem: Qwen3-Coder models are trained to emit tool calls in XML format (
<function=name><parameter=key>value</parameter></function>). TabbyAPI's tool-calling system only supported JSON via constrained generation, causing XML tool calls to be dumped as plain text with notool_callsarray in the response.Solution:
from_xml()) alongside the existing JSON path, with anfrom_auto()dispatcher that tries JSON → JSON-in-wrapper → XML in sequence.tool_call_formatmetadata field to the Jinja template system, allowing templates to declare whether they expectjson,xml, orautoformat tool calls.contentfield for bare<function=patterns when the two-pass system doesn't trigger.json.loads()on string arguments) to prevent crashes in multi-turn tool conversations where clients sendargumentsas a JSON string but the Jinja template expects a dict.templates/tool_calls/qwen3_coder.jinja) based on the official Qwen3-Coder-Nextchat_template.jinjawith TabbyAPI metadata.Design notes:
<function=name>with=in the tag name, which is invalid XML. This matches the approach taken by vLLM, llama.cpp, and the official Qwen parser.json.loads()with string fallback, explicitly avoidingeval()(ref: CVE-2025-9141 in vLLM's parser).2. OpenAI Streaming Protocol Compliance
Problem: After adding XML parsing, non-streaming responses worked correctly, but strict clients (OpenCode / Vercel AI SDK) rejected streaming responses with
AI_TypeValidationError. The SSE chunks were missing the requiredindexfield on tool-call deltas, emittingrole: "user"instead of"assistant", merging tool-call data with the finish signal, and leaking null fields.Solution:
_build_tool_call_chunks()implementing a two-chunk emission pattern: one chunk with complete tool-call data (role: "assistant",tool_callsarray withindexvalues,finish_reason: null), followed by a separate finish chunk (delta: {},finish_reason: "tool_calls")._serialize_stream_chunk()for consistent serialization across all chunk types, usingexclude_none=Truewhile restoring semantically meaningfulfinish_reason: nullon intermediate chunks.stream_generate_chat_completion()to intercept tool-call generation results, parse them, and emit spec-compliant chunks before the normal chunk-building path._create_stream_chunk()since it is now handled upstream.3. Pydantic v2 Union Coercion Fix
Problem: The
indexfield added for streaming compliance was silently dropped during chunk construction. Pydantic v2's smart Union coercion converts dicts passed throughUnion[ChatCompletionMessage, dict]fields intoChatCompletionMessageinstances, which in turn coerces nested tool-call dicts intoToolCallinstances. Withextra='ignore'(the default), any keys not declared on the model are silently discarded.Solution:
index: Optional[int] = Nonedirectly to theToolCallmodel intypes/tools.py. This ensures the field survives Pydantic's coercion rather than being treated as an extra field.call_prefix convention (call_<24-char hex>), matching the format expected by some strict clients.4. Inference Abort Fixes
Problem: When a user presses "Stop" during streaming inference, TabbyAPI did not reliably stop generating tokens. The model continued running on the GPU after the client disconnected.
Three bugs were identified and fixed:
Bug 1 —
gen_queue.get()blocks disconnect detection:The consumer loop's
await gen_queue.get()blocks indefinitely when the queue is empty (during prefill, between tokens). While blocked, thedisconnect_task.done()check never re-executes.Fix: Replaced the blocking
get()withasyncio.wait()that races a queue get task against the disconnect task. Applied identically in bothstream_generate_completionandstream_generate_chat_completion.Bug 2 — ExLlamaV3 job registered too late in
active_job_ids:The
AsyncJobwas only assigned toself.active_job_ids[request_id]after the generation loop finished. During generation, the entry heldNone, whichwait_for_jobs()skips.Fix: Moved the assignment to immediately after
AsyncJob()construction, before the generation loop.Bug 3 —
GeneratorExitbypasses abort event:When
sse_starlettecrashes on a dropped TCP connection, it injectsGeneratorExitinto the async generator. The existingexcept CancelledErrorandexcept Exceptionhandlers don't catchGeneratorExit(aBaseException), soabort_event.set()is never called and inference continues.Fix: Moved
abort_event.set()anddisconnect_task.cancel()into afinallyblock, which executes on all exit paths includingGeneratorExit.Files Changed
backends/exllamav3/model.pyactive_job_idsassignment before generation loopcommon/templating.pytool_call_formatfield toTemplateMetadata, validationendpoints/OAI/types/tools.pyindex: Optional[int] = NonetoToolCall,call_prefix ID formatendpoints/OAI/utils/chat_completion.pyasyncio.wait-based disconnect detection,finallyblock for abortendpoints/OAI/utils/completion.pyasyncio.wait-based disconnect detection,finallyblock for abortendpoints/OAI/utils/tools.pyfrom_xml,from_auto,parse,extract_content_and_tools), think-block stripping, type coerciontemplates/tool_calls/qwen3_coder.jinjaTesting
Validated against:
Test scenarios: single and multiple tool calls, multi-turn tool conversations, mixed text + tool calls, streaming and non-streaming modes, client disconnect during inference.
Environment
Edit: Additional improvements after initial submission:
5. Broader Model Compatibility
tojsonJinja filter override so the model's built-in HuggingFace template works out of the box (the default sandboxed filter crashes ontojson(ensure_ascii=False)which Qwen3-Coder's template uses).{"name": ..., "arguments": ...}dicts without thefunctionwrapper, single objects instead of arrays, markdown-fenced JSON, and string-typed arguments. Previously only perfectly-formed OAI-shaped arrays were accepted.token_idsas a plain Python list, with robust handling for tensors and tuples from different ExLlamaV3 kernels.6.
tool_choiceSupportAdded support for the OpenAI
tool_choiceparameter:"none"skips tool generation entirely,"required"forces a tool-call pass even when the model doesn't emit the stop string, and named function choice ({"type": "function", "function": {"name": "..."}}) filters results to the specified function.7. Bug Fixes and Cleanup
generate_tool_calls()where the sharedpromptvariable was modified in-place during the loop. Withn > 1, each iteration would stack previous generations' text onto the prompt. Now uses a per-iteration local variable.tool_endextraction toTemplateMetadataso template-provided metadata isn't silently discarded.Additional files changed:
endpoints/OAI/types/chat_completion.pytool_choice,parallel_tool_callsfieldsendpoints/OAI/types/tools.pyNamedToolChoice,NamedToolFunctionmodels