Mkulakow/responses api by michalkulakowski · Pull Request #4039 · openvinotoolkit/model_server

michalkulakowski · 2026-03-04T14:49:59Z

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

Unit tests added.
The documentation updated.
Change follows security best practices.
``

Copilot

Pull request overview

This PR adds initial support for the OpenAI Responses API to OVMS LLM serving, wiring a new /v3/responses endpoint through request parsing and unary response serialization, with accompanying unit tests.

Changes:

Introduces a new Endpoint::RESPONSES and routes /v3/responses + /v3/v1/responses to it.
Adds Responses-specific request parsing (input, max_output_tokens, conflict check with max_completion_tokens) and unary response JSON serialization (object: "response", output: [...]).
Extends unit tests to cover Responses parsing and unary response structure.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`src/test/http_openai_handler_test.cpp`	Adds unit tests validating Responses parsing and unary response output structure.
`src/llm/servable.cpp`	Routes `/v3/responses` to the new endpoint and prepares inputs for it.
`src/llm/apis/openai_completions.hpp`	Adds `Endpoint::RESPONSES` and declares `parseResponsesPart()`.
`src/llm/apis/openai_completions.cpp`	Implements Responses parsing and a new unary serialization path for Responses output format.

Comments suppressed due to low confidence (2)

src/llm/apis/openai_completions.cpp:826

For RESPONSES requests, the validation error message says "max_tokens value should be greater than 0", but the endpoint-specific fields are max_output_tokens / max_completion_tokens. Updating the message to reference the actual accepted field(s) will make it clearer for API consumers.

    // specific part of max_tokens validation
    if (request.maxTokens == 0) {
        return absl::InvalidArgumentError("max_tokens value should be greater than 0");
    }

src/llm/servable.cpp:76

The invalid-endpoint error message lists /v3/tokenize, but TokenizeParser::isTokenizeEndpoint() accepts any URI that ends with tokenize (including /v3/v1/tokenize). Consider listing both /v3/tokenize and /v3/v1/tokenize (consistent with the other endpoints) or wording it as a suffix-based rule to avoid misleading users.

        return absl::InvalidArgumentError("Wrong endpoint. Allowed endpoints: /v3/chat/completions, /v3/completions, /v3/responses, /v3/tokenize");

Copilot · 2026-03-04T14:54:45Z

src/llm/apis/openai_completions.cpp

+    // logprobs: bool; optional - defaults to false
+    it = doc.FindMember("logprobs");
+    if (it != doc.MemberEnd() && !it->value.IsNull()) {
+        if (!it->value.IsBool())
+            return absl::InvalidArgumentError("logprobs accepts values true or false");
+        request.logprobschat = it->value.GetBool();
+    }
+    if (request.logprobschat && request.stream) {
+        return absl::InvalidArgumentError("logprobs are not supported in streaming mode.");
+    }


For the RESPONSES endpoint, stream can currently be enabled via parseCommonPart(), but serializeStreamingChunk() / serializeStreamingUsageChunk() do not handle Endpoint::RESPONSES (they only emit chat/completions-specific JSON). This will produce malformed streaming output for /v3/responses requests. Either explicitly reject streaming for RESPONSES in parseResponsesPart() (return InvalidArgument when request.stream == true) or implement proper Responses streaming serialization before enabling stream for this endpoint.

Copilot · 2026-03-04T14:54:45Z

src/llm/apis/openai_completions.cpp

+    std::optional<uint32_t> maxCompletionTokens;
+    std::optional<uint32_t> maxOutputTokens;
+
+    // max_completion_tokens: uint; optional
+    it = doc.FindMember("max_completion_tokens");
+    if (it != doc.MemberEnd() && !it->value.IsNull()) {
+        if (!it->value.IsUint()) {
+            if (it->value.IsUint64())
+                return absl::InvalidArgumentError("max_completion_tokens value can't be greater than 4294967295");
+            return absl::InvalidArgumentError("max_completion_tokens is not an unsigned integer");
+        }
+        if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value())
+            return absl::InvalidArgumentError(absl::StrCat("max_completion_tokens exceeds limit provided in graph config: ", maxTokensLimit.value()));
+        maxCompletionTokens = it->value.GetUint();
+    }
+
+    // max_output_tokens: uint; optional
+    // OpenAI Responses API uses this field for output token limit.
+    it = doc.FindMember("max_output_tokens");
+    if (it != doc.MemberEnd() && !it->value.IsNull()) {
+        if (!it->value.IsUint()) {
+            if (it->value.IsUint64())
+                return absl::InvalidArgumentError("max_output_tokens value can't be greater than 4294967295");
+            return absl::InvalidArgumentError("max_output_tokens is not an unsigned integer");
+        }
+        if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value())
+            return absl::InvalidArgumentError(absl::StrCat("max_output_tokens exceeds limit provided in graph config: ", maxTokensLimit.value()));
+        maxOutputTokens = it->value.GetUint();
+    }
+
+    if (maxCompletionTokens.has_value() && maxOutputTokens.has_value() && maxCompletionTokens.value() != maxOutputTokens.value()) {
+        return absl::InvalidArgumentError("max_output_tokens and max_completion_tokens must match when both are provided");
+    }
+    if (maxOutputTokens.has_value()) {
+        request.maxTokens = maxOutputTokens.value();
+    } else if (maxCompletionTokens.has_value()) {
+        request.maxTokens = maxCompletionTokens.value();
+    }


OpenAIChatCompletionsRequest::maxTokens is std::optional<int>, but in RESPONSES parsing you accept max_output_tokens/max_completion_tokens as uint32_t and then assign into request.maxTokens. Values > INT_MAX will overflow / become implementation-defined. Consider changing maxTokens to an unsigned/wider type (e.g., uint32_t/uint64_t) or clamp + validate against std::numeric_limits<int>::max() before assigning.

Copilot · 2026-03-04T14:54:46Z

src/llm/apis/openai_completions.cpp

+    if (request.maxTokens.has_value()) {
+        writer.String("max_output_tokens");
+        writer.Int(request.maxTokens.value());
+    }
+
+    writer.String("output");
+    writer.StartArray();
+    int outputIndex = 0;
+    for (const auto& parsedOutput : parsedOutputs) {
+        const std::string outputId = "msg-" + std::to_string(outputIndex++);
+
+        writer.StartObject();
+        writer.String("id");
+        writer.String(outputId.c_str());
+        writer.String("type");
+        writer.String("message");
+        writer.String("role");
+        writer.String("assistant");
+        writer.String("status");
+        writer.String("completed");
+        writer.String("content");
+        writer.StartArray();
+        writer.StartObject();
+        writer.String("type");
+        writer.String("output_text");
+        writer.String("text");
+        writer.String(parsedOutput.content.c_str());
+        writer.String("annotations");
+        writer.StartArray();
+        writer.EndArray();
+        writer.EndObject();
+        writer.EndArray();
+        writer.EndObject();
+    }
+    writer.EndArray();
+
+    writer.String("usage");
+    writer.StartObject();
+    writer.String("input_tokens");
+    writer.Int(usage.promptTokens);
+    writer.String("input_tokens_details");
+    writer.StartObject();
+    writer.String("cached_tokens");
+    writer.Int(0);
+    writer.EndObject();
+    writer.String("output_tokens");
+    writer.Int(usage.completionTokens);
+    writer.String("output_tokens_details");
+    writer.StartObject();
+    writer.String("reasoning_tokens");
+    writer.Int(0);
+    writer.EndObject();
+    writer.String("total_tokens");
+    writer.Int(usage.calculateTotalTokens());
+    writer.EndObject();


serializeResponsesUnaryResponse() writes numeric fields with writer.Int(...) (e.g., max_output_tokens, input_tokens, output_tokens, total_tokens). Several of these values come from size_t or may exceed signed 32-bit range, which can overflow and produce incorrect JSON. Prefer Int64/Uint64 (or Uint) for token counters/limits to match the validated range and the underlying types.

Copilot · 2026-03-04T14:54:46Z

src/llm/apis/openai_completions.cpp

+    const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count();
+    const std::string responseId = "resp-" + std::to_string(createdAt);
+


responseId is derived only from createdAt in seconds ("resp-" + std::to_string(createdAt)), which can easily collide under concurrent load (multiple requests within the same second). Use a higher-resolution timestamp and/or add a per-process counter/UUID component to keep response IDs unique.

Copilot AI review requested due to automatic review settings March 4, 2026 14:50

Copilot started reviewing on behalf of michalkulakowski March 4, 2026 14:50 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

michalkulakowski force-pushed the mkulakow/responses_api branch from f60f8d9 to 7cf20dc Compare March 5, 2026 14:13

michalkulakowski and others added 5 commits March 5, 2026 18:42

Responses api init

96fc999

fix

5a12817

style

6168f25

fix

2050306

style

299e2be

michalkulakowski force-pushed the mkulakow/responses_api branch from ea72634 to 299e2be Compare March 5, 2026 17:42

Michal Kulakowski added 3 commits March 6, 2026 10:42

remove redundant tests

f03f34d

update tools parsing in responses

e18b165

fix

970667c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mkulakow/responses api#4039

Mkulakow/responses api#4039
michalkulakowski wants to merge 8 commits intomainfrom
mkulakow/responses_api

michalkulakowski commented Mar 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Copilot AI Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count();
		const std::string responseId = "resp-" + std::to_string(createdAt);

Conversation

michalkulakowski commented Mar 4, 2026

🛠 Summary

🧪 Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants