Skip to content

Mkulakow/responses api#4039

Open
michalkulakowski wants to merge 8 commits intomainfrom
mkulakow/responses_api
Open

Mkulakow/responses api#4039
michalkulakowski wants to merge 8 commits intomainfrom
mkulakow/responses_api

Conversation

@michalkulakowski
Copy link
Collaborator

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

Copilot AI review requested due to automatic review settings March 4, 2026 14:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds initial support for the OpenAI Responses API to OVMS LLM serving, wiring a new /v3/responses endpoint through request parsing and unary response serialization, with accompanying unit tests.

Changes:

  • Introduces a new Endpoint::RESPONSES and routes /v3/responses + /v3/v1/responses to it.
  • Adds Responses-specific request parsing (input, max_output_tokens, conflict check with max_completion_tokens) and unary response JSON serialization (object: "response", output: [...]).
  • Extends unit tests to cover Responses parsing and unary response structure.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/test/http_openai_handler_test.cpp Adds unit tests validating Responses parsing and unary response output structure.
src/llm/servable.cpp Routes /v3/responses to the new endpoint and prepares inputs for it.
src/llm/apis/openai_completions.hpp Adds Endpoint::RESPONSES and declares parseResponsesPart().
src/llm/apis/openai_completions.cpp Implements Responses parsing and a new unary serialization path for Responses output format.
Comments suppressed due to low confidence (2)

src/llm/apis/openai_completions.cpp:826

  • For RESPONSES requests, the validation error message says "max_tokens value should be greater than 0", but the endpoint-specific fields are max_output_tokens / max_completion_tokens. Updating the message to reference the actual accepted field(s) will make it clearer for API consumers.
    // specific part of max_tokens validation
    if (request.maxTokens == 0) {
        return absl::InvalidArgumentError("max_tokens value should be greater than 0");
    }

src/llm/servable.cpp:76

  • The invalid-endpoint error message lists /v3/tokenize, but TokenizeParser::isTokenizeEndpoint() accepts any URI that ends with tokenize (including /v3/v1/tokenize). Consider listing both /v3/tokenize and /v3/v1/tokenize (consistent with the other endpoints) or wording it as a suffix-based rule to avoid misleading users.
        return absl::InvalidArgumentError("Wrong endpoint. Allowed endpoints: /v3/chat/completions, /v3/completions, /v3/responses, /v3/tokenize");

Comment on lines +773 to +782
// logprobs: bool; optional - defaults to false
it = doc.FindMember("logprobs");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsBool())
return absl::InvalidArgumentError("logprobs accepts values true or false");
request.logprobschat = it->value.GetBool();
}
if (request.logprobschat && request.stream) {
return absl::InvalidArgumentError("logprobs are not supported in streaming mode.");
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the RESPONSES endpoint, stream can currently be enabled via parseCommonPart(), but serializeStreamingChunk() / serializeStreamingUsageChunk() do not handle Endpoint::RESPONSES (they only emit chat/completions-specific JSON). This will produce malformed streaming output for /v3/responses requests. Either explicitly reject streaming for RESPONSES in parseResponsesPart() (return InvalidArgument when request.stream == true) or implement proper Responses streaming serialization before enabling stream for this endpoint.

Copilot uses AI. Check for mistakes.
Comment on lines +784 to +821
std::optional<uint32_t> maxCompletionTokens;
std::optional<uint32_t> maxOutputTokens;

// max_completion_tokens: uint; optional
it = doc.FindMember("max_completion_tokens");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsUint()) {
if (it->value.IsUint64())
return absl::InvalidArgumentError("max_completion_tokens value can't be greater than 4294967295");
return absl::InvalidArgumentError("max_completion_tokens is not an unsigned integer");
}
if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value())
return absl::InvalidArgumentError(absl::StrCat("max_completion_tokens exceeds limit provided in graph config: ", maxTokensLimit.value()));
maxCompletionTokens = it->value.GetUint();
}

// max_output_tokens: uint; optional
// OpenAI Responses API uses this field for output token limit.
it = doc.FindMember("max_output_tokens");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsUint()) {
if (it->value.IsUint64())
return absl::InvalidArgumentError("max_output_tokens value can't be greater than 4294967295");
return absl::InvalidArgumentError("max_output_tokens is not an unsigned integer");
}
if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value())
return absl::InvalidArgumentError(absl::StrCat("max_output_tokens exceeds limit provided in graph config: ", maxTokensLimit.value()));
maxOutputTokens = it->value.GetUint();
}

if (maxCompletionTokens.has_value() && maxOutputTokens.has_value() && maxCompletionTokens.value() != maxOutputTokens.value()) {
return absl::InvalidArgumentError("max_output_tokens and max_completion_tokens must match when both are provided");
}
if (maxOutputTokens.has_value()) {
request.maxTokens = maxOutputTokens.value();
} else if (maxCompletionTokens.has_value()) {
request.maxTokens = maxCompletionTokens.value();
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAIChatCompletionsRequest::maxTokens is std::optional<int>, but in RESPONSES parsing you accept max_output_tokens/max_completion_tokens as uint32_t and then assign into request.maxTokens. Values > INT_MAX will overflow / become implementation-defined. Consider changing maxTokens to an unsigned/wider type (e.g., uint32_t/uint64_t) or clamp + validate against std::numeric_limits<int>::max() before assigning.

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +191
if (request.maxTokens.has_value()) {
writer.String("max_output_tokens");
writer.Int(request.maxTokens.value());
}

writer.String("output");
writer.StartArray();
int outputIndex = 0;
for (const auto& parsedOutput : parsedOutputs) {
const std::string outputId = "msg-" + std::to_string(outputIndex++);

writer.StartObject();
writer.String("id");
writer.String(outputId.c_str());
writer.String("type");
writer.String("message");
writer.String("role");
writer.String("assistant");
writer.String("status");
writer.String("completed");
writer.String("content");
writer.StartArray();
writer.StartObject();
writer.String("type");
writer.String("output_text");
writer.String("text");
writer.String(parsedOutput.content.c_str());
writer.String("annotations");
writer.StartArray();
writer.EndArray();
writer.EndObject();
writer.EndArray();
writer.EndObject();
}
writer.EndArray();

writer.String("usage");
writer.StartObject();
writer.String("input_tokens");
writer.Int(usage.promptTokens);
writer.String("input_tokens_details");
writer.StartObject();
writer.String("cached_tokens");
writer.Int(0);
writer.EndObject();
writer.String("output_tokens");
writer.Int(usage.completionTokens);
writer.String("output_tokens_details");
writer.StartObject();
writer.String("reasoning_tokens");
writer.Int(0);
writer.EndObject();
writer.String("total_tokens");
writer.Int(usage.calculateTotalTokens());
writer.EndObject();
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serializeResponsesUnaryResponse() writes numeric fields with writer.Int(...) (e.g., max_output_tokens, input_tokens, output_tokens, total_tokens). Several of these values come from size_t or may exceed signed 32-bit range, which can overflow and produce incorrect JSON. Prefer Int64/Uint64 (or Uint) for token counters/limits to match the validated range and the underlying types.

Copilot uses AI. Check for mistakes.
Comment on lines +106 to +108
const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count();
const std::string responseId = "resp-" + std::to_string(createdAt);

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responseId is derived only from createdAt in seconds ("resp-" + std::to_string(createdAt)), which can easily collide under concurrent load (multiple requests within the same second). Use a higher-resolution timestamp and/or add a per-process counter/UUID component to keep response IDs unique.

Copilot uses AI. Check for mistakes.
@michalkulakowski michalkulakowski force-pushed the mkulakow/responses_api branch from f60f8d9 to 7cf20dc Compare March 5, 2026 14:13
@michalkulakowski michalkulakowski force-pushed the mkulakow/responses_api branch from ea72634 to 299e2be Compare March 5, 2026 17:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants