Fix tuple schema compatibility with OpenAI structured output API#619
Fix tuple schema compatibility with OpenAI structured output API#619
Conversation
… env var Replace provider-specific environment variable checks (OPENAI_API_KEY, ANTHROPIC_API_KEY) and skip markers (requires_openai, requires_anthropic) with a single EFFECTFUL_LLM_MODEL environment variable that controls which model is used for all LLM integration tests. - Remove hardcoded model names from all test files in favor of LLM_MODEL read from EFFECTFUL_LLM_MODEL env var - Replace requires_openai/requires_anthropic markers with requires_llm - Remove model parametrization that was cross-provider; tests now use whichever model the env var specifies - Use litellm.supports_vision() to conditionally skip vision tests - Remove default model from LiteLLMProvider (make model required) - Update CI workflow to pass EFFECTFUL_LLM_MODEL as a matrix parameter, making it easy to add parallel CI stages for different providers - Rename/remove fixture files to match updated test names Closes #589
…ename test - Move LLM_MODEL and requires_llm definitions to tests/conftest.py; all four LLM test files now import from there - Fix import ordering in test_handlers_llm_encoding.py (stdlib before local) - Rename test_agent_tool_names_are_openai_compatible_integration to test_agent_tool_names_are_valid_integration since it's no longer OpenAI-specific
- Restore LiteLLMProvider default model via env var fallback so existing
callers (template.py, test_handlers_llm_template.py, notebook) are not
broken: model=os.environ.get("EFFECTFUL_LLM_MODEL", "gpt-4o")
- Move LLM_MODEL/requires_llm from conftest.py to tests/_llm_helpers.py
since conftest.py should not be imported directly
- Fix import placement in test_handlers_llm_provider.py (was between
module-level constants instead of grouped with imports)
- Remove redundant @requires_llm on vision tests where the skipif
condition already covers the not-LLM_MODEL case
The env var belongs in test infrastructure, not the library API. LiteLLMProvider should have a clean, explicit default.
Remove provider-specific environment variable checks and skip markers from LLM tests in favor of a single EFFECTFUL_LLM_MODEL env var. - Add LLM_MODEL and requires_llm to tests/conftest.py; LLM_MODEL defaults to "gpt-4o-mini" and is overridable via EFFECTFUL_LLM_MODEL - Live tests (tool calling, encoding, agent tool names) use LLM_MODEL and are gated by requires_llm which checks for any provider API key - Integration tests use make_provider() which returns a live LiteLLMProvider when API keys are available, else falls back to ReplayLiteLLMProvider for offline replay from fixtures - Replay-only tests (simple_prompt_multiple_models, cross_endpoint, caching) keep hardcoded model names and always run since they never call the API - Update CI workflow to pass EFFECTFUL_LLM_MODEL as a matrix parameter for easy parallel stages with different providers Closes #589
…asisresearch/effectful into dn-fully-parametric-model-test
…asisresearch/effectful into dn-fully-parametric-model-test
Three fixes in encoding.py: 1. TupleEncodable.encode() now returns a TupleItems model instance (not a raw tuple), and deserialize() returns the model directly. This fixes pydantic validation in litellm integration tests for NamedTuple and fixed-tuple types. 2. Add _TupleSafeJsonSchema that overrides pydantic's tuple_schema() to produce object schemas (item_0, item_1 properties) instead of prefixItems arrays. Applied via _BoxEncoding.model_json_schema() so dataclasses containing tuple fields produce OpenAI-compatible schemas. 3. SequenceEncodable.encode() returns a list (not tuple) to preserve encode idempotency — nested_type on a list dispatches to the sequence encoder, avoiding a mismatch with TupleEncodable. Also adds test_handlers_llm_encoding.py back to CI workflow.
Revert encoding.py to master and remove encoding tests from CI workflow. Tuple schema fixes will be in a dedicated PR.
Three fixes in encoding.py: 1. TupleEncodable.encode() returns a TupleItems model instance (not a raw tuple), and deserialize() returns the model directly. This fixes pydantic validation in litellm integration tests for NamedTuple and fixed-tuple types. Extract shared field access into _extract_items(). 2. Add _TupleSafeJsonSchema that overrides pydantic's tuple_schema() to produce object schemas (item_0, item_1 properties) instead of prefixItems arrays. Applied via _BoxEncoding.model_json_schema() so dataclasses containing tuple fields produce OpenAI-compatible schemas. Can be removed once #584 replaces the Encodable system. 3. SequenceEncodable.encode()/deserialize() return lists (not tuples) to preserve encode idempotency — nested_type on a list dispatches to the sequence encoder, avoiding a mismatch with TupleEncodable.
Use Pydantic's Annotated extension points (PlainValidator, PlainSerializer, WithJsonSchema) to handle tuples — adapted from _pydantic_type_tuple in #584. - _safe_tuple_type: rewrites fixed-length tuple[T1, T2, ...] into an Annotated type with custom validation (item_N dict → tuple), serialization (tuple → item_N dict), and JSON schema (object with item_0/item_1 properties instead of prefixItems). Single mechanism for schema, validation, and serialization. - _rewrite_tuple_annotations: recursively applies _safe_tuple_type to dataclass fields, creating an Annotated proxy so nested tuples inside objects also get safe schemas. - TupleEncodable.encode() returns TupleItems model instances; deserialize() returns the model directly. Shared field access via _extract_items(). - SequenceEncodable.encode()/deserialize() return lists to preserve encode idempotency with the new TupleEncodable return type.
be46f6a to
7aad873
Compare
Use Pydantic's Annotated extension points (PlainValidator, PlainSerializer, WithJsonSchema) to handle standalone tuple types — adapted from _pydantic_type_tuple in #584. - _safe_tuple_type: rewrites fixed-length tuple[T1, T2, ...] into an Annotated type with custom validation (item_N dict -> tuple), serialization (tuple -> item_N dict), and JSON schema (object properties instead of prefixItems). Single mechanism for schema, validation, and serialization. - TupleEncodable.encode() returns TupleItems model instances; deserialize() returns the model directly; shared field access via _extract_items() - _inline_refs applied to tool parameter schemas to avoid $ref rejection - SequenceEncodable.encode()/deserialize() return lists to preserve encode idempotency with the new TupleEncodable return type Dataclasses containing tuple fields (dc-with-tuple) are xfailed — fixing them requires recursive type rewriting that would duplicate the systematic approach in #584. Better to wait for that PR than to build a piecemeal shadow type system on top of Pydantic.
bd50bda to
afe8338
Compare
eb8680
left a comment
There was a problem hiding this comment.
I am a bit confused by the purpose of this PR. There are no new tests that were failing without and passing with the behavior changes, and there are existing tests xfailing with the changes that previously passed. Did we not fix the OpenAI API tuple encoding already? What is new here?
It also seems to have introduced a fragile new invariant that Encodable.encode must never return tuples or values that contain tuples that are not fields of some enclosing BaseModel, but that's not documented anywhere and can't be checked statically by a type checker.
| # Dataclass with tuple field: Pydantic produces prefixItems schema that | ||
| # OpenAI rejects. Proper fix requires recursive type rewriting (#584). | ||
| if case_id == "dc-with-tuple": | ||
| marks.append(_provider_response_format_xfail) | ||
| # SequenceEncodable.enc is a generic alias, not a BaseModel — litellm rejects it. | ||
| if case_id in ("tuple-bare", "tuple-variadic"): | ||
| marks.append(_provider_response_format_xfail) | ||
| # LLM may return a URL instead of base64 for image tuples. | ||
| if case_id == "tuple-img-str": | ||
| marks.append(_provider_response_format_xfail) |
There was a problem hiding this comment.
Why are these new xfails being added? If this PR fixes the issue, I would expect the relevant tests to fail before the PR and pass after, but this seems to be saying the opposite?
There was a problem hiding this comment.
Sorry I should have noted this down. During #617 I recognized that test_llm.yml (the live integration workflow with real API keys) runs test_handlers_llm_provider.py, test_handlers_llm_tool_calling_*.py, but not test_handlers_llm_encoding.py. So the provider-path tests in test_handlers_llm_encoding.py were never run with a real API key in CI.
I was having this plan to fix this:
dc-with-tuple: Thinking of deferring to ReplaceEncodableimplementations with Pydantic #584, should say so explicitly in the xfail reason string.tuple-bare / tuple-variadic- SequenceEncodable.enc is a list[T] generic alias, not a BaseModel subclass, so litellm rejects it as response_format. I'm thinking that I can create another PR if the structure introduced here work.tuple-img-str- This is supported. Let me remove this xfail.
| return tuple(self.element_encoder.encode(elem) for elem in value) | ||
| # Return a list so that nested_type routes to the sequence dispatcher | ||
| # (not the tuple dispatcher), preserving encode idempotency. | ||
| return list(self.element_encoder.encode(elem) for elem in value) |
There was a problem hiding this comment.
Why is this change and similar ones necessary?
There was a problem hiding this comment.
The encode idempotency law (encode(encode(v)) == encode(v)) requires that nested_type on the output dispatches back to the same Encodable. If encode returns a tuple, nested_type infers tuple[T1, T2, ...] and dispatches to TupleEncodable instead of SequenceEncodable, breaking idempotency. Returning a list keeps the dispatch consistent. Agreed this is fragile; I'll add a note to the Encodable.encode contract and consider whether the dispatcher should be more explicit.
There was a problem hiding this comment.
(Noting that this needed to change due to the safe_tuple_type fix)
|
The PR splits out two distinct things, which I should have been clearer about in the description:
The invariant is a must given how @nested_type.register
def _(value: tuple):
if type(value) != tuple or len(value) == 0:
return nested_type.dispatch(collections.abc.Sequence)(value)
else:
return Box(tuple[tuple(nested_type(item).value for item in value)])If I'll add a docstring to |
Replace the hand-rolled Encodable ABC and its 10+ subclasses with a recursive type rewriter (TypeToPydanticType) that produces Annotated wrappers with Pydantic validators/serializers. This fixes #626 (tuple fields in dataclasses) and #631 (Callable fields in dataclasses) by recursively rewriting field annotations before Pydantic sees them. - Add TypeEvaluator ABC to unification.py for recursive type traversal - Rewrite encoding.py: Encodable[T] now returns Annotated types via TypeToPydanticType, used with pydantic.TypeAdapter - Update completions.py: Encodable.define() -> TypeAdapter(Encodable[T]) - Update template.py: extract _collect_tools to completions.py - Update all test files for new API
|
Update: Fixes #626, #631. Supersedes #584.
Why
|
- Add field rewriting for dataclass and BaseModel types whose fields contain special types (Callable, tuple, Image, etc.). Uses a proxy Pydantic model for validation/serialization while preserving the original type identity on roundtrip. Fixes #631. - Add additionalProperties:false to all object schemas in _inline_refs for OpenAI strict mode compatibility. - Update test_handlers_llm_provider.py for new encoding API: response_format=Encodable.define(T) -> response_type=T, tools= -> env= - Add composite type regression tests for #626 and #631. - Fix ruff formatting in test_internals_unification.py.
- Remove unused type: ignore in unification.py
- Fix _call_assistant mock signature (tools/response_format -> env/response_type)
- Fix mock tool call args to remove old {"value": ...} wrapping
- Fix test_synthesized_function_roundtrip to use dump_python(mode="json")
- Rebuild all provider test fixtures with gpt-4o for new API format
- Fix caching test fixtures for deterministic replay
- Update template formatting tests: int values no longer wrapped in {"value": N}
- Fix tool call mock args in template test (remove {"value": ...} wrapping)
- Add _strict_json_schema() to completions.py (provider boundary) - Apply to tool specs in call_assistant - Wrap response_format schema in object wrapper for encoding integration tests - Apply _strict_json_schema to tool spec tests
- Enhance _strict_json_schema to inline $ref/$defs recursively - Add items fallback for arrays (required by OpenAI strict mode) - Handle prefixItems → items conversion for fixed-length tuples - Use dual type: ignore[attr-defined,unused-ignore] for CI/local mypy compat - Fix xfail pattern matching to use startswith for precise case filtering
Image types can't roundtrip through LLM (returns URLs, not data URIs). Tool/DecodedToolCall types have schemas incompatible with OpenAI strict mode when used as nested parameter types. Apply PROVIDER_CASES (with xfails) to tool-as-param and tool-as-return tests, and extend the pattern to catch composite image types (e.g. tuple-img-str, list-img).
Return types don't affect the tool spec schema sent to OpenAI, so all cases pass without xfails. Only tool-as-param needs PROVIDER_CASES.
- response_model: xfail img/tool/dtc (LLM can't return images, Tool schema incompatible with strict mode) - tool-as-param: xfail only tool/dtc (image types produce valid param schemas) - tool-as-return: no xfails needed (return type not in tool spec)
The mypy type-check tests call mypy in-process. When xdist schedules
them on separate workers simultaneously, mypy's C extensions segfault.
Use xdist_group("mypy") to ensure all mypy tests run on the same
worker, and set --dist loadgroup as default to respect the grouping.
eb8680
left a comment
There was a problem hiding this comment.
I don't think we should be attempting to transform arbitrary record types. One motivation for the new Encodable interface in #584 was to avoid the need for this, at least to start.
There is also a bunch of schema-munging happening that is likely not necessary.
| enc: pydantic.TypeAdapter[Any] = pydantic.TypeAdapter( | ||
| Encodable[type(tool)] # type: ignore[misc] | ||
| ) | ||
| tool_spec = _strict_json_schema( |
There was a problem hiding this comment.
This should happen internally when applying Encodable to Tool. Tests and user code should never have to invoke _strict_json_schema directly.
| enc: pydantic.TypeAdapter[Any] = pydantic.TypeAdapter( | ||
| Encodable[type(tool)] # type: ignore[misc] | ||
| ) | ||
| tool_spec = _strict_json_schema( |
There was a problem hiding this comment.
This should happen internally when applying Encodable to Tool. Tests and user code should never have to invoke _strict_json_schema directly.
| PROVIDER_CASES = _cases_with_provider_xfails(ROUNDTRIP_CASES) | ||
|
|
||
| # response_model: image types can't roundtrip (LLM returns URLs, not data URIs), | ||
| # and Tool/DecodedToolCall schemas are incompatible with OpenAI strict mode. |
There was a problem hiding this comment.
This comment about Tool seems like a bug, it should be fixed rather than xfailed.
There was a problem hiding this comment.
On master, Tool/DecodedToolCall were xfailed for response_format integration tests. We maintain the same xfails. We still support encoding/deserializing them for tool calling.
Tool as response_format fails because OpenAI rejects ChatCompletionToolParam's schema (bare "type": "object" without additionalProperties: false).
Fixing it is non-trivial and gory because we'd have to write a recursive schema traversal and add that additionalProperties annotation.
I tried finding tools from Litellm, OpenAI, or Pydantic, but none of them provide exact tools for this.
| elif hasattr(tool_spec_obj, "model_dump"): | ||
| return dict(tool_spec_obj.model_dump()) | ||
| raise TypeError(f"Unexpected encoded tool spec type: {type(tool_spec_obj)}") | ||
| # tool-as-param: only Tool/DecodedToolCall schemas fail (nested function spec). |
There was a problem hiding this comment.
This comment about Tool seems like a bug, it should be fixed rather than xfailed.
| ) | ||
| response = litellm.completion( | ||
| model=EFFECTFUL_LLM_MODEL, | ||
| response_format={ |
There was a problem hiding this comment.
This should probably be a Pydantic model rather than a raw JSON schema.
|
|
||
| @TypeToPydanticType.register(object) | ||
| def _pydantic_type_base(ty: type) -> Any: | ||
| if dataclasses.is_dataclass(ty) and isinstance(ty, type): |
There was a problem hiding this comment.
I am skeptical of the correctness of this transformation for general polymorphic dataclasses, and it has almost no test coverage. I would strongly prefer to remove it and require users to add Encodable annotations manually unless/until we come up with a more convincing specification for this kind of automated behavior.
There was a problem hiding this comment.
I updated it to only handle base type (return the type) instead of rewriting record types.
|
|
||
| def decode(self, encoded_value: str) -> str: | ||
| return encoded_value | ||
| def _rewrite_fields( |
There was a problem hiding this comment.
I don't think this is correct in general, especially not shared across different record type families, and I suspect it would be very difficult to cover all the edge cases, especially once polymorphism is included. I would strongly prefer to remove it from this PR in order to keep the changes tractable and sound.
There was a problem hiding this comment.
Got it. I removed it and added dataclass encodable in PairManual (dataclass with tuple) in the test.
| @dataclass | ||
| class NamedTupleEncodable[T](TupleEncodable[T]): | ||
| """Tuple encodable that reconstructs the original NamedTuple type on decode.""" | ||
| def _strict_json_schema(schema: dict) -> dict: |
There was a problem hiding this comment.
There seem to be two versions of this function. I suspect this transformation is not actually necessary to begin with given that it has not been a problem before - perhaps litellm or pydantic enforces it somehow.
There was a problem hiding this comment.
I found one provided by OpenAI and replaced it!
| value: T | ||
|
|
||
|
|
||
| def _strict_json_schema(schema: dict) -> dict: |
There was a problem hiding this comment.
There seem to be two versions of this function (this one and one in encoding.py). I also suspect it is not necessary or at least is already available in some litellm utility, otherwise nothing would have worked before with the OpenAI API.
|
|
||
| Applies all transformations required by the OpenAI strict-mode API: | ||
|
|
||
| * Inlines ``$ref``/``$defs`` (OpenAI does not support JSON Schema references). |
There was a problem hiding this comment.
We have never had issues with this before, either, which strengthens my suspicion that litellm or pydantic are doing this somewhere already. The _inline_refs I originally implemented in #584 was a workaround for a Pydantic bug, not a necessary invariant for OpenAI API compatibility.
…operties_true Symmetric to litellm.utils._remove_additional_properties (which removes false for Vertex AI). Strips additionalProperties: true from litellm/OpenAI models that use extra="allow", then lets _ensure_strict_json_schema apply false. Also switch DecodedToolCall schema to OpenAI's ChatCompletionMessageToolCall (has actual fields: id, function, type) instead of litellm's (empty dict). Remove tool/dtc xfails — schemas are now strict-mode compatible.
Litellm's ChatCompletionMessageToolCall has no fields (extra="allow"), so model_validate is a no-op. Switch to OpenAI's type which validates id, function.name, function.arguments fields. Also fix serializer to use type="function" (OpenAI const) and json.dumps for arguments.
…codedToolCall _ensure_strict_json_schema forces all properties into required, breaking schemas with optional fields (e.g. ChatCompletionToolParam's parameters, description, strict, cache_control). _make_strict_safe makes optional properties nullable before requiring them, so they're safe for OpenAI strict mode. Also removes _ensure_strict_json_schema from Image and Callable handlers where it was unnecessary, and provides encoded tool context in the response_model integration test so the LLM can produce valid DecodedToolCall instances.
Tool/DecodedToolCall as response_model or tool parameter are fundamentally incompatible with OpenAI strict mode (optional fields, bare "type": "object" without properties). This matches master's behavior which also xfailed these cases. Removes _make_strict_safe, restores _ensure_strict_json_schema from openai for Tool/DecodedToolCall WithJsonSchema schemas.
Only used for Tool/DecodedToolCall WithJsonSchema schemas which are already xfailed in integration tests. No test depends on it.
|
I did more cleaning up. Now the dataclass encodable is defined by user, it will be fine as long as user have dataclass fields being encodable: @dataclass
class _PairManual:
values: Encodable[tuple[int, str]]
count: intBut there is still one test marked xfail (both here and on master), which is the round-trip Tool and DecodedToolCall. On master, Tool/DecodedToolCall were xfailed for response_format integration tests. We maintain the same xfails. We still support encoding/deserializing them for tool calling. Tool as I tried finding tools from Litellm, OpenAI, or Pydantic, but none of them provide exact tools for this. |
Addresses the prefixItems rejection from OpenAI for fixed-length tuple types (newly rediscovered in #617) by porting the Annotated-based approach.
Changes:
_safe_tuple_type: rewrites tuple[T1, T2, ...] intoAnnotated[..., PlainValidator, PlainSerializer, WithJsonSchema]so schema, validation, and serialization are handled in one place via Pydantic's extension API_rewrite_tuple_annotations: applies_safe_tuple_typerecursively fordataclasses containing tuple fields
TupleEncodable.encode/deserializenow useTupleItemsmodel instances consistently;shared field access via
_extract_itemsSequenceEncodable.encode/deserializereturn lists to preserve encode idempotency_inline_refs: inlines $ref pointers in WithJsonSchema values (workaround forTypeAdapterfails whenobject.__get_pydantic_json_schema__has$refand$defspydantic/pydantic#12145)_safe_tuple_typeis a self-contained port of_pydantic_type_tuplefrom #584, scoped to the current Encodable system._rewrite_tuple_annotationshandles the dataclass-with-tuple-field case until #584 lands.