Skip to content

Add audio, moderations, and tokenize/detokenize endpoint support#816

Open
ericcurtin wants to merge 1 commit intomainfrom
address-these-concerns
Open

Add audio, moderations, and tokenize/detokenize endpoint support#816
ericcurtin wants to merge 1 commit intomainfrom
address-these-concerns

Conversation

@ericcurtin
Copy link
Copy Markdown
Contributor

@ericcurtin ericcurtin commented Mar 31, 2026

Register previously missing OpenAI-compatible routes so they are no
longer rejected with 404:
/v1/audio/transcriptions, /v1/audio/translations (multipart/form-data)
/v1/audio/speech (JSON)
/v1/moderations
/tokenize and /detokenize (vLLM extension)

Add BackendModeAudio and handleAudioInference which extracts the model
field from multipart form data. vLLM passes audio requests through
natively; llama.cpp returns a descriptive error directing users to the
chat completions input_audio content-part instead. Address review
feedback: extract shared scheduleInference helper to restore parity
(auto-install progress, preload-only, recorder, origin tracking) between
handleOpenAIInference and handleAudioInference; fix multipart temp-file
leak (defer MultipartForm.RemoveAll); tighten /v1/audio/ path matching
to explicit HasSuffix checks; make Content-Type check case-insensitive.

Replace the Go routing layer with a Rust reverse proxy (axum + tokio)
compiled as a CGo-linked staticlib (router/). The Rust router owns:

  • All route registration and path aliasing (/v1/ -> /engines/, etc.)
  • CORS middleware matching the Go CorsMiddleware semantics
  • Path normalisation (NormalizePathLayer)
  • Static routes: GET / and GET /version

The deleted Go files (pkg/routing/router.go, pkg/routing/routing.go,
pkg/middleware/alias.go) are fully replaced. pkg/routing/service.go is
trimmed; main.go registers Ollama, Anthropic, and Responses handlers
directly on the backend mux.

The in-process CGo callback path (pkg/router/handler.go) replaces the
network proxy hop: Rust calls Go's http.Handler directly via a streaming
protocol — dmr_write_chunk()/dmr_close_stream() push response chunks
into a tokio::sync::mpsc channel as Go writes them, so streaming
endpoints like POST /models/create deliver progress in real time without
buffering the full response.

Rust shared utilities are extracted into a dmr-common workspace crate
(init_tracing, unix_now_secs). model-cli Rust code is deduplicated:
shared send_and_check free function, request_timeout helper,
apply_azure_version helper, build_app/run_gateway_async extracted to
handlers.rs.

Build system:

  • Cargo workspace root (Cargo.toml) unifies all Rust crates
  • make build-router-lib compiles the Rust staticlib before go build
  • Dockerfile installs Rust and builds libdmr_router.a in the builder
    stage
  • CI test job installs Rust toolchain and builds the library so
    go test -race works with CGo enabled
  • Platform-split CGo LDFLAGS: Darwin keeps -framework flags, Linux
    uses plain -lpthread/-ldl/-lm
  • pkg/router/router_stub.go provides a no-op implementation for
    CGO_ENABLED=0 builds (lint, cross-compilation)

Copy link
Copy Markdown
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • The new handleAudioInference logic duplicates much of the backend selection / model lookup / installer wait / runner load flow from handleOpenAIInference; consider extracting a shared helper to reduce drift risk if that logic changes in the future.
  • backendModeForRequest uses strings.Contains(path, "/v1/audio/") while other modes use suffix checks; tightening this to a suffix or more precise match (e.g., specific audio endpoints) would reduce the chance of misclassifying unrelated paths that happen to contain that segment.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new handleAudioInference logic duplicates much of the backend selection / model lookup / installer wait / runner load flow from handleOpenAIInference; consider extracting a shared helper to reduce drift risk if that logic changes in the future.
- backendModeForRequest uses strings.Contains(path, "/v1/audio/") while other modes use suffix checks; tightening this to a suffix or more precise match (e.g., specific audio endpoints) would reduce the chance of misclassifying unrelated paths that happen to contain that segment.

## Individual Comments

### Comment 1
<location path="pkg/inference/scheduling/http_handler.go" line_range="355-356" />
<code_context>
+	var modelName string
+	var upstreamBody []byte
+
+	contentType := r.Header.Get("Content-Type")
+	if strings.HasPrefix(contentType, "multipart/form-data") {
+		// Read the entire body for buffering before proxying.
+		body, ok := readRequestBody(w, r, maximumAudioInferenceRequestSize)
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Make Content-Type handling robust to case differences

HTTP media types are case-insensitive, but this check depends on the exact casing of "multipart/form-data". Normalize before checking, for example:

```go
contentType := strings.ToLower(r.Header.Get("Content-Type"))
if strings.HasPrefix(contentType, "multipart/form-data") {
    ...
}
```
so that variants like "Multipart/Form-Data; boundary=..." are still handled correctly.

```suggestion
	contentType := strings.ToLower(r.Header.Get("Content-Type"))
	if strings.HasPrefix(contentType, "multipart/form-data") {
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for audio processing modes (transcription, translation, and speech synthesis) by adding a new BackendModeAudio and a specialized handleAudioInference HTTP handler to manage multipart/form-data. Review feedback identifies a critical security risk where temporary files created during multipart parsing are not cleaned up, potentially leading to resource exhaustion. Additionally, the new audio handler lacks parity with existing inference handlers regarding installation progress streaming, request recording, and preload support.

@ericcurtin ericcurtin force-pushed the address-these-concerns branch 18 times, most recently from 63dd762 to b210e5a Compare March 31, 2026 19:36
@ericcurtin ericcurtin force-pushed the address-these-concerns branch from b210e5a to fb2a446 Compare March 31, 2026 20:10
Register previously missing OpenAI-compatible routes so they are no
longer rejected with 404:
  /v1/audio/transcriptions, /v1/audio/translations (multipart/form-data)
  /v1/audio/speech (JSON)
  /v1/moderations
  /tokenize and /detokenize (vLLM extension)

Add BackendModeAudio and handleAudioInference which extracts the model
field from multipart form data. vLLM passes audio requests through
natively; llama.cpp returns a descriptive error directing users to the
chat completions input_audio content-part instead. Address review
feedback: extract shared scheduleInference helper to restore parity
(auto-install progress, preload-only, recorder, origin tracking) between
handleOpenAIInference and handleAudioInference; fix multipart temp-file
leak (defer MultipartForm.RemoveAll); tighten /v1/audio/ path matching
to explicit HasSuffix checks; make Content-Type check case-insensitive.

Replace the Go routing layer with a Rust reverse proxy (axum + tokio)
compiled as a CGo-linked staticlib (router/). The Rust router owns:
  - All route registration and path aliasing (/v1/ -> /engines/, etc.)
  - CORS middleware matching the Go CorsMiddleware semantics
  - Path normalisation (NormalizePathLayer)
  - Static routes: GET / and GET /version

The deleted Go files (pkg/routing/router.go, pkg/routing/routing.go,
pkg/middleware/alias.go) are fully replaced. pkg/routing/service.go is
trimmed; main.go registers Ollama, Anthropic, and Responses handlers
directly on the backend mux.

The in-process CGo callback path (pkg/router/handler.go) replaces the
network proxy hop: Rust calls Go's http.Handler directly via a streaming
protocol — dmr_write_chunk()/dmr_close_stream() push response chunks
into a tokio::sync::mpsc channel as Go writes them, so streaming
endpoints like POST /models/create deliver progress in real time without
buffering the full response.

Rust shared utilities are extracted into a dmr-common workspace crate
(init_tracing, unix_now_secs). model-cli Rust code is deduplicated:
shared send_and_check free function, request_timeout helper,
apply_azure_version helper, build_app/run_gateway_async extracted to
handlers.rs.

Build system:
  - Cargo workspace root (Cargo.toml) unifies all Rust crates
  - make build-router-lib compiles the Rust staticlib before go build
  - Dockerfile installs Rust and builds libdmr_router.a in the builder
    stage
  - CI test job installs Rust toolchain and builds the library so
    go test -race works with CGo enabled
  - Platform-split CGo LDFLAGS: Darwin keeps -framework flags, Linux
    uses plain -lpthread/-ldl/-lm
  - pkg/router/router_stub.go provides a no-op implementation for
    CGO_ENABLED=0 builds (lint, cross-compilation)

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
@ericcurtin ericcurtin force-pushed the address-these-concerns branch from fb2a446 to a76c9f0 Compare March 31, 2026 20:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants