Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds initial support for the OpenAI Responses API to OVMS LLM serving, wiring a new /v3/responses endpoint through request parsing and unary response serialization, with accompanying unit tests.
Changes:
- Introduces a new
Endpoint::RESPONSESand routes/v3/responses+/v3/v1/responsesto it. - Adds Responses-specific request parsing (
input,max_output_tokens, conflict check withmax_completion_tokens) and unary response JSON serialization (object: "response",output: [...]). - Extends unit tests to cover Responses parsing and unary response structure.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/test/http_openai_handler_test.cpp |
Adds unit tests validating Responses parsing and unary response output structure. |
src/llm/servable.cpp |
Routes /v3/responses to the new endpoint and prepares inputs for it. |
src/llm/apis/openai_completions.hpp |
Adds Endpoint::RESPONSES and declares parseResponsesPart(). |
src/llm/apis/openai_completions.cpp |
Implements Responses parsing and a new unary serialization path for Responses output format. |
Comments suppressed due to low confidence (2)
src/llm/apis/openai_completions.cpp:826
- For RESPONSES requests, the validation error message says
"max_tokens value should be greater than 0", but the endpoint-specific fields aremax_output_tokens/max_completion_tokens. Updating the message to reference the actual accepted field(s) will make it clearer for API consumers.
// specific part of max_tokens validation
if (request.maxTokens == 0) {
return absl::InvalidArgumentError("max_tokens value should be greater than 0");
}
src/llm/servable.cpp:76
- The invalid-endpoint error message lists
/v3/tokenize, butTokenizeParser::isTokenizeEndpoint()accepts any URI that ends withtokenize(including/v3/v1/tokenize). Consider listing both/v3/tokenizeand/v3/v1/tokenize(consistent with the other endpoints) or wording it as a suffix-based rule to avoid misleading users.
return absl::InvalidArgumentError("Wrong endpoint. Allowed endpoints: /v3/chat/completions, /v3/completions, /v3/responses, /v3/tokenize");
| // logprobs: bool; optional - defaults to false | ||
| it = doc.FindMember("logprobs"); | ||
| if (it != doc.MemberEnd() && !it->value.IsNull()) { | ||
| if (!it->value.IsBool()) | ||
| return absl::InvalidArgumentError("logprobs accepts values true or false"); | ||
| request.logprobschat = it->value.GetBool(); | ||
| } | ||
| if (request.logprobschat && request.stream) { | ||
| return absl::InvalidArgumentError("logprobs are not supported in streaming mode."); | ||
| } |
There was a problem hiding this comment.
For the RESPONSES endpoint, stream can currently be enabled via parseCommonPart(), but serializeStreamingChunk() / serializeStreamingUsageChunk() do not handle Endpoint::RESPONSES (they only emit chat/completions-specific JSON). This will produce malformed streaming output for /v3/responses requests. Either explicitly reject streaming for RESPONSES in parseResponsesPart() (return InvalidArgument when request.stream == true) or implement proper Responses streaming serialization before enabling stream for this endpoint.
| std::optional<uint32_t> maxCompletionTokens; | ||
| std::optional<uint32_t> maxOutputTokens; | ||
|
|
||
| // max_completion_tokens: uint; optional | ||
| it = doc.FindMember("max_completion_tokens"); | ||
| if (it != doc.MemberEnd() && !it->value.IsNull()) { | ||
| if (!it->value.IsUint()) { | ||
| if (it->value.IsUint64()) | ||
| return absl::InvalidArgumentError("max_completion_tokens value can't be greater than 4294967295"); | ||
| return absl::InvalidArgumentError("max_completion_tokens is not an unsigned integer"); | ||
| } | ||
| if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value()) | ||
| return absl::InvalidArgumentError(absl::StrCat("max_completion_tokens exceeds limit provided in graph config: ", maxTokensLimit.value())); | ||
| maxCompletionTokens = it->value.GetUint(); | ||
| } | ||
|
|
||
| // max_output_tokens: uint; optional | ||
| // OpenAI Responses API uses this field for output token limit. | ||
| it = doc.FindMember("max_output_tokens"); | ||
| if (it != doc.MemberEnd() && !it->value.IsNull()) { | ||
| if (!it->value.IsUint()) { | ||
| if (it->value.IsUint64()) | ||
| return absl::InvalidArgumentError("max_output_tokens value can't be greater than 4294967295"); | ||
| return absl::InvalidArgumentError("max_output_tokens is not an unsigned integer"); | ||
| } | ||
| if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value()) | ||
| return absl::InvalidArgumentError(absl::StrCat("max_output_tokens exceeds limit provided in graph config: ", maxTokensLimit.value())); | ||
| maxOutputTokens = it->value.GetUint(); | ||
| } | ||
|
|
||
| if (maxCompletionTokens.has_value() && maxOutputTokens.has_value() && maxCompletionTokens.value() != maxOutputTokens.value()) { | ||
| return absl::InvalidArgumentError("max_output_tokens and max_completion_tokens must match when both are provided"); | ||
| } | ||
| if (maxOutputTokens.has_value()) { | ||
| request.maxTokens = maxOutputTokens.value(); | ||
| } else if (maxCompletionTokens.has_value()) { | ||
| request.maxTokens = maxCompletionTokens.value(); | ||
| } |
There was a problem hiding this comment.
OpenAIChatCompletionsRequest::maxTokens is std::optional<int>, but in RESPONSES parsing you accept max_output_tokens/max_completion_tokens as uint32_t and then assign into request.maxTokens. Values > INT_MAX will overflow / become implementation-defined. Consider changing maxTokens to an unsigned/wider type (e.g., uint32_t/uint64_t) or clamp + validate against std::numeric_limits<int>::max() before assigning.
| if (request.maxTokens.has_value()) { | ||
| writer.String("max_output_tokens"); | ||
| writer.Int(request.maxTokens.value()); | ||
| } | ||
|
|
||
| writer.String("output"); | ||
| writer.StartArray(); | ||
| int outputIndex = 0; | ||
| for (const auto& parsedOutput : parsedOutputs) { | ||
| const std::string outputId = "msg-" + std::to_string(outputIndex++); | ||
|
|
||
| writer.StartObject(); | ||
| writer.String("id"); | ||
| writer.String(outputId.c_str()); | ||
| writer.String("type"); | ||
| writer.String("message"); | ||
| writer.String("role"); | ||
| writer.String("assistant"); | ||
| writer.String("status"); | ||
| writer.String("completed"); | ||
| writer.String("content"); | ||
| writer.StartArray(); | ||
| writer.StartObject(); | ||
| writer.String("type"); | ||
| writer.String("output_text"); | ||
| writer.String("text"); | ||
| writer.String(parsedOutput.content.c_str()); | ||
| writer.String("annotations"); | ||
| writer.StartArray(); | ||
| writer.EndArray(); | ||
| writer.EndObject(); | ||
| writer.EndArray(); | ||
| writer.EndObject(); | ||
| } | ||
| writer.EndArray(); | ||
|
|
||
| writer.String("usage"); | ||
| writer.StartObject(); | ||
| writer.String("input_tokens"); | ||
| writer.Int(usage.promptTokens); | ||
| writer.String("input_tokens_details"); | ||
| writer.StartObject(); | ||
| writer.String("cached_tokens"); | ||
| writer.Int(0); | ||
| writer.EndObject(); | ||
| writer.String("output_tokens"); | ||
| writer.Int(usage.completionTokens); | ||
| writer.String("output_tokens_details"); | ||
| writer.StartObject(); | ||
| writer.String("reasoning_tokens"); | ||
| writer.Int(0); | ||
| writer.EndObject(); | ||
| writer.String("total_tokens"); | ||
| writer.Int(usage.calculateTotalTokens()); | ||
| writer.EndObject(); |
There was a problem hiding this comment.
serializeResponsesUnaryResponse() writes numeric fields with writer.Int(...) (e.g., max_output_tokens, input_tokens, output_tokens, total_tokens). Several of these values come from size_t or may exceed signed 32-bit range, which can overflow and produce incorrect JSON. Prefer Int64/Uint64 (or Uint) for token counters/limits to match the validated range and the underlying types.
| const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count(); | ||
| const std::string responseId = "resp-" + std::to_string(createdAt); | ||
|
|
There was a problem hiding this comment.
responseId is derived only from createdAt in seconds ("resp-" + std::to_string(createdAt)), which can easily collide under concurrent load (multiple requests within the same second). Use a higher-resolution timestamp and/or add a per-process counter/UUID component to keep response IDs unique.
f60f8d9 to
7cf20dc
Compare
ea72634 to
299e2be
Compare
🛠 Summary
JIRA/Issue if applicable.
Describe the changes.
🧪 Checklist
``