Add audio, moderations, and tokenize/detokenize endpoint support#816
Open
ericcurtin wants to merge 1 commit intomainfrom
Open
Add audio, moderations, and tokenize/detokenize endpoint support#816ericcurtin wants to merge 1 commit intomainfrom
ericcurtin wants to merge 1 commit intomainfrom
Conversation
Contributor
There was a problem hiding this comment.
Hey - I've found 1 issue, and left some high level feedback:
- The new handleAudioInference logic duplicates much of the backend selection / model lookup / installer wait / runner load flow from handleOpenAIInference; consider extracting a shared helper to reduce drift risk if that logic changes in the future.
- backendModeForRequest uses strings.Contains(path, "/v1/audio/") while other modes use suffix checks; tightening this to a suffix or more precise match (e.g., specific audio endpoints) would reduce the chance of misclassifying unrelated paths that happen to contain that segment.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- The new handleAudioInference logic duplicates much of the backend selection / model lookup / installer wait / runner load flow from handleOpenAIInference; consider extracting a shared helper to reduce drift risk if that logic changes in the future.
- backendModeForRequest uses strings.Contains(path, "/v1/audio/") while other modes use suffix checks; tightening this to a suffix or more precise match (e.g., specific audio endpoints) would reduce the chance of misclassifying unrelated paths that happen to contain that segment.
## Individual Comments
### Comment 1
<location path="pkg/inference/scheduling/http_handler.go" line_range="355-356" />
<code_context>
+ var modelName string
+ var upstreamBody []byte
+
+ contentType := r.Header.Get("Content-Type")
+ if strings.HasPrefix(contentType, "multipart/form-data") {
+ // Read the entire body for buffering before proxying.
+ body, ok := readRequestBody(w, r, maximumAudioInferenceRequestSize)
</code_context>
<issue_to_address>
**suggestion (bug_risk):** Make Content-Type handling robust to case differences
HTTP media types are case-insensitive, but this check depends on the exact casing of "multipart/form-data". Normalize before checking, for example:
```go
contentType := strings.ToLower(r.Header.Get("Content-Type"))
if strings.HasPrefix(contentType, "multipart/form-data") {
...
}
```
so that variants like "Multipart/Form-Data; boundary=..." are still handled correctly.
```suggestion
contentType := strings.ToLower(r.Header.Get("Content-Type"))
if strings.HasPrefix(contentType, "multipart/form-data") {
```
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces support for audio processing modes (transcription, translation, and speech synthesis) by adding a new BackendModeAudio and a specialized handleAudioInference HTTP handler to manage multipart/form-data. Review feedback identifies a critical security risk where temporary files created during multipart parsing are not cleaned up, potentially leading to resource exhaustion. Additionally, the new audio handler lacks parity with existing inference handlers regarding installation progress streaming, request recording, and preload support.
63dd762 to
b210e5a
Compare
b210e5a to
fb2a446
Compare
Register previously missing OpenAI-compatible routes so they are no
longer rejected with 404:
/v1/audio/transcriptions, /v1/audio/translations (multipart/form-data)
/v1/audio/speech (JSON)
/v1/moderations
/tokenize and /detokenize (vLLM extension)
Add BackendModeAudio and handleAudioInference which extracts the model
field from multipart form data. vLLM passes audio requests through
natively; llama.cpp returns a descriptive error directing users to the
chat completions input_audio content-part instead. Address review
feedback: extract shared scheduleInference helper to restore parity
(auto-install progress, preload-only, recorder, origin tracking) between
handleOpenAIInference and handleAudioInference; fix multipart temp-file
leak (defer MultipartForm.RemoveAll); tighten /v1/audio/ path matching
to explicit HasSuffix checks; make Content-Type check case-insensitive.
Replace the Go routing layer with a Rust reverse proxy (axum + tokio)
compiled as a CGo-linked staticlib (router/). The Rust router owns:
- All route registration and path aliasing (/v1/ -> /engines/, etc.)
- CORS middleware matching the Go CorsMiddleware semantics
- Path normalisation (NormalizePathLayer)
- Static routes: GET / and GET /version
The deleted Go files (pkg/routing/router.go, pkg/routing/routing.go,
pkg/middleware/alias.go) are fully replaced. pkg/routing/service.go is
trimmed; main.go registers Ollama, Anthropic, and Responses handlers
directly on the backend mux.
The in-process CGo callback path (pkg/router/handler.go) replaces the
network proxy hop: Rust calls Go's http.Handler directly via a streaming
protocol — dmr_write_chunk()/dmr_close_stream() push response chunks
into a tokio::sync::mpsc channel as Go writes them, so streaming
endpoints like POST /models/create deliver progress in real time without
buffering the full response.
Rust shared utilities are extracted into a dmr-common workspace crate
(init_tracing, unix_now_secs). model-cli Rust code is deduplicated:
shared send_and_check free function, request_timeout helper,
apply_azure_version helper, build_app/run_gateway_async extracted to
handlers.rs.
Build system:
- Cargo workspace root (Cargo.toml) unifies all Rust crates
- make build-router-lib compiles the Rust staticlib before go build
- Dockerfile installs Rust and builds libdmr_router.a in the builder
stage
- CI test job installs Rust toolchain and builds the library so
go test -race works with CGo enabled
- Platform-split CGo LDFLAGS: Darwin keeps -framework flags, Linux
uses plain -lpthread/-ldl/-lm
- pkg/router/router_stub.go provides a no-op implementation for
CGO_ENABLED=0 builds (lint, cross-compilation)
Signed-off-by: Eric Curtin <eric.curtin@docker.com>
fb2a446 to
a76c9f0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Register previously missing OpenAI-compatible routes so they are no
longer rejected with 404:
/v1/audio/transcriptions, /v1/audio/translations (multipart/form-data)
/v1/audio/speech (JSON)
/v1/moderations
/tokenize and /detokenize (vLLM extension)
Add BackendModeAudio and handleAudioInference which extracts the model
field from multipart form data. vLLM passes audio requests through
natively; llama.cpp returns a descriptive error directing users to the
chat completions input_audio content-part instead. Address review
feedback: extract shared scheduleInference helper to restore parity
(auto-install progress, preload-only, recorder, origin tracking) between
handleOpenAIInference and handleAudioInference; fix multipart temp-file
leak (defer MultipartForm.RemoveAll); tighten /v1/audio/ path matching
to explicit HasSuffix checks; make Content-Type check case-insensitive.
Replace the Go routing layer with a Rust reverse proxy (axum + tokio)
compiled as a CGo-linked staticlib (router/). The Rust router owns:
The deleted Go files (pkg/routing/router.go, pkg/routing/routing.go,
pkg/middleware/alias.go) are fully replaced. pkg/routing/service.go is
trimmed; main.go registers Ollama, Anthropic, and Responses handlers
directly on the backend mux.
The in-process CGo callback path (pkg/router/handler.go) replaces the
network proxy hop: Rust calls Go's http.Handler directly via a streaming
protocol — dmr_write_chunk()/dmr_close_stream() push response chunks
into a tokio::sync::mpsc channel as Go writes them, so streaming
endpoints like POST /models/create deliver progress in real time without
buffering the full response.
Rust shared utilities are extracted into a dmr-common workspace crate
(init_tracing, unix_now_secs). model-cli Rust code is deduplicated:
shared send_and_check free function, request_timeout helper,
apply_azure_version helper, build_app/run_gateway_async extracted to
handlers.rs.
Build system:
stage
go test -race works with CGo enabled
uses plain -lpthread/-ldl/-lm
CGO_ENABLED=0 builds (lint, cross-compilation)