Skip to content

Full tool-calling support, inference abort fixes, XML parsing, OpenAI streaming compliance#413

Open
devnen wants to merge 2 commits intotheroyallab:mainfrom
devnen:full-tool-calling-support
Open

Full tool-calling support, inference abort fixes, XML parsing, OpenAI streaming compliance#413
devnen wants to merge 2 commits intotheroyallab:mainfrom
devnen:full-tool-calling-support

Conversation

@devnen
Copy link

@devnen devnen commented Feb 14, 2026

What This PR Does

TabbyAPI's tool-calling system worked for simple cases but had a collection of bugs that surfaced the moment you pushed it harder — a different model family, a stricter client, or just hitting Stop at the wrong time. This PR fixes all of them in one pass, tested end-to-end against Kilo Code, Roo Code, and OpenCode with Qwen3-Coder-Next on a dual-GPU setup.

The changes fall into four areas:

  • Qwen3-Coder support — Qwen3-Coder emits tool calls in XML, not JSON. TabbyAPI had no XML parser, so those tool calls were silently discarded. This PR adds full XML parsing and a new official Jinja template for the model family.
  • Streaming compliance — Strict clients (OpenCode / Vercel AI SDK) were rejecting every streaming tool-call response due to missing fields and wrong chunk structure. The streaming path is now fully spec-compliant.
  • Stop actually stops — Three independent bugs meant the model kept running on the GPU after a client disconnect. All three are fixed; GPU inference now aborts reliably on Stop.
  • Broader compatibility — JSON parsing is hardened to handle the many ways real models deviate from the ideal output format, tool_choice is now respected, and a Jinja filter fix makes HuggingFace-native chat templates work out of the box.

Tested with: ExLlamaV3 v0.0.22 · Qwen3-Coder-Next-exl3-4.0bpw · Kilo Code v5.6.0 · Roo Code · OpenCode (Vercel AI SDK). Windows 10 OS

Built against commit 41511f5 with ExLlamaV3 v0.0.22.


Changes

1. XML Tool-Call Parsing

Problem: Qwen3-Coder models are trained to emit tool calls in XML format (<function=name><parameter=key>value</parameter></function>). TabbyAPI's tool-calling system only supported JSON via constrained generation, causing XML tool calls to be dumped as plain text with no tool_calls array in the response.

Solution:

  • Added a regex-based XML parser (from_xml()) alongside the existing JSON path, with an from_auto() dispatcher that tries JSON → JSON-in-wrapper → XML in sequence.
  • Added a tool_call_format metadata field to the Jinja template system, allowing templates to declare whether they expect json, xml, or auto format tool calls.
  • Modified the two-pass generation flow: when XML mode is active, the second pass skips the JSON schema constraint and lets the model generate its natural XML output unconstrained.
  • Added a fallback path that scans the content field for bare <function= patterns when the two-pass system doesn't trigger.
  • Added pre-template argument conversion (json.loads() on string arguments) to prevent crashes in multi-turn tool conversations where clients send arguments as a JSON string but the Jinja template expects a dict.
  • Added a new Jinja template (templates/tool_calls/qwen3_coder.jinja) based on the official Qwen3-Coder-Next chat_template.jinja with TabbyAPI metadata.

Design notes:

  • Regex parsing (not XML parsing) is used because the format uses <function=name> with = in the tag name, which is invalid XML. This matches the approach taken by vLLM, llama.cpp, and the official Qwen parser.
  • Type coercion uses json.loads() with string fallback, explicitly avoiding eval() (ref: CVE-2025-9141 in vLLM's parser).

2. OpenAI Streaming Protocol Compliance

Problem: After adding XML parsing, non-streaming responses worked correctly, but strict clients (OpenCode / Vercel AI SDK) rejected streaming responses with AI_TypeValidationError. The SSE chunks were missing the required index field on tool-call deltas, emitting role: "user" instead of "assistant", merging tool-call data with the finish signal, and leaking null fields.

Solution:

  • Added _build_tool_call_chunks() implementing a two-chunk emission pattern: one chunk with complete tool-call data (role: "assistant", tool_calls array with index values, finish_reason: null), followed by a separate finish chunk (delta: {}, finish_reason: "tool_calls").
  • Added _serialize_stream_chunk() for consistent serialization across all chunk types, using exclude_none=True while restoring semantically meaningful finish_reason: null on intermediate chunks.
  • Restructured the streaming loop in stream_generate_chat_completion() to intercept tool-call generation results, parse them, and emit spec-compliant chunks before the normal chunk-building path.
  • Removed tool-call handling from _create_stream_chunk() since it is now handled upstream.

3. Pydantic v2 Union Coercion Fix

Problem: The index field added for streaming compliance was silently dropped during chunk construction. Pydantic v2's smart Union coercion converts dicts passed through Union[ChatCompletionMessage, dict] fields into ChatCompletionMessage instances, which in turn coerces nested tool-call dicts into ToolCall instances. With extra='ignore' (the default), any keys not declared on the model are silently discarded.

Solution:

  • Added index: Optional[int] = None directly to the ToolCall model in types/tools.py. This ensures the field survives Pydantic's coercion rather than being treated as an extra field.
  • Updated the tool-call ID factory to use the call_ prefix convention (call_<24-char hex>), matching the format expected by some strict clients.

4. Inference Abort Fixes

Problem: When a user presses "Stop" during streaming inference, TabbyAPI did not reliably stop generating tokens. The model continued running on the GPU after the client disconnected.

Three bugs were identified and fixed:

Bug 1 — gen_queue.get() blocks disconnect detection:
The consumer loop's await gen_queue.get() blocks indefinitely when the queue is empty (during prefill, between tokens). While blocked, the disconnect_task.done() check never re-executes.

Fix: Replaced the blocking get() with asyncio.wait() that races a queue get task against the disconnect task. Applied identically in both stream_generate_completion and stream_generate_chat_completion.

Bug 2 — ExLlamaV3 job registered too late in active_job_ids:
The AsyncJob was only assigned to self.active_job_ids[request_id] after the generation loop finished. During generation, the entry held None, which wait_for_jobs() skips.

Fix: Moved the assignment to immediately after AsyncJob() construction, before the generation loop.

Bug 3 — GeneratorExit bypasses abort event:
When sse_starlette crashes on a dropped TCP connection, it injects GeneratorExit into the async generator. The existing except CancelledError and except Exception handlers don't catch GeneratorExit (a BaseException), so abort_event.set() is never called and inference continues.

Fix: Moved abort_event.set() and disconnect_task.cancel() into a finally block, which executes on all exit paths including GeneratorExit.


Files Changed

File Summary
backends/exllamav3/model.py Moved active_job_ids assignment before generation loop
common/templating.py Added tool_call_format field to TemplateMetadata, validation
endpoints/OAI/types/tools.py Added index: Optional[int] = None to ToolCall, call_ prefix ID format
endpoints/OAI/utils/chat_completion.py XML generation flow, streaming protocol compliance, asyncio.wait-based disconnect detection, finally block for abort
endpoints/OAI/utils/completion.py asyncio.wait-based disconnect detection, finally block for abort
endpoints/OAI/utils/tools.py XML parser (from_xml, from_auto, parse, extract_content_and_tools), think-block stripping, type coercion
templates/tool_calls/qwen3_coder.jinja New file — official Qwen3-Coder-Next template with TabbyAPI metadata

Testing

Validated against:

  • OpenCode (Vercel AI SDK) — strict Zod-based schema validation, previously rejected all streaming tool-call responses
  • Kilo Code v5.6.0 — OpenAI-compatible VSCode extension
  • Roo Code — OpenAI-compatible VSCode extension

Test scenarios: single and multiple tool calls, multi-turn tool conversations, mixed text + tool calls, streaming and non-streaming modes, client disconnect during inference.

Environment

  • ExLlamaV3 v0.0.22
  • Qwen3-Coder-Next-exl3-4.0bpw (80B params, 3B activated, MoE)
  • Dual GPU, 80K context window

Edit: Additional improvements after initial submission:

5. Broader Model Compatibility

  • Added a tojson Jinja filter override so the model's built-in HuggingFace template works out of the box (the default sandboxed filter crashes on tojson(ensure_ascii=False) which Qwen3-Coder's template uses).
  • Hardened JSON tool-call parsing to handle common model output variations: flat {"name": ..., "arguments": ...} dicts without the function wrapper, single objects instead of arrays, markdown-fenced JSON, and string-typed arguments. Previously only perfectly-formed OAI-shaped arrays were accepted.
  • Generation chunks now include token_ids as a plain Python list, with robust handling for tensors and tuples from different ExLlamaV3 kernels.

6. tool_choice Support

Added support for the OpenAI tool_choice parameter: "none" skips tool generation entirely, "required" forces a tool-call pass even when the model doesn't emit the stop string, and named function choice ({"type": "function", "function": {"name": "..."}}) filters results to the specified function.

7. Bug Fixes and Cleanup

  • Fixed a prompt mutation bug in generate_tool_calls() where the shared prompt variable was modified in-place during the loop. With n > 1, each iteration would stack previous generations' text onto the prompt. Now uses a per-iteration local variable.
  • Added tool_end extraction to TemplateMetadata so template-provided metadata isn't silently discarded.
  • Reduced debug logging verbosity in the XML parser — removed per-parameter and raw-text-dump log lines, kept summary-level logs and warnings.

Additional files changed:

File Summary
endpoints/OAI/types/chat_completion.py Added tool_choice, parallel_tool_calls fields
endpoints/OAI/types/tools.py Added NamedToolChoice, NamedToolFunction models

@devnen devnen force-pushed the full-tool-calling-support branch from 528325c to a2c7d81 Compare February 14, 2026 15:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments