Skip to content

feat: add cache_control breakpoints to compiler for Anthropic prompt caching #37

@pitimon

Description

@pitimon

Problem

OpenKB compiler reuses a "base context A" (system + document) across N+M+2 LLM calls per document (summary → concepts-plan → N create + M update concept pages). Without cache_control markers, every call re-bills the full document content as input tokens.

For Anthropic Sonnet 4.5 (via OpenRouter or direct), prompt caching can reduce input cost by ~90% on cached prefix and reduce TTFT. Minimum cacheable size is 1,024 tokens, easily exceeded by typical document content.

Proposal

Add cache_control: {"type": "ephemeral"} markers at two breakpoints in openkb/agent/compiler.py:

  1. End of doc_msg in compile_short_doc + compile_long_doc — caches system + doc for all downstream calls (summary, plan, every concept).
  2. End of assistant summary message in _compile_concepts (3 call sites: plan, create, update) — caches system + doc + summary for all concept generation calls.

Two breakpoints, well within Anthropic's max-4 limit.

Compatibility

  • Anthropic / OpenRouter→Anthropic: cache_control honored.
  • OpenAI: list-of-blocks content format is valid (Vision API uses it); cache_control silently ignored.
  • Other providers: LiteLLM normalizes/strips unknown fields.

Side fix

_llm_call_async currently does not forward **kwargs while _llm_call does (asymmetry noted in memory #82886). Add **kwargs for parity.

Out of scope

  • OpenRouter Response Caching (X-OpenRouter-Cache: true) — different mechanism, evaluated separately.
  • Refactoring messages into a dedicated builder module — keep patch surgical.

Test plan

  • Existing pytest suite passes (mocks accept *args, **kwargs).
  • New assertion: completion payload contains cache_control block on the doc_msg.
  • Manual smoke against a real Anthropic key: observe cached_tokens in prompt_tokens_details on calls 2..N.

References

  • Memory observation S11144 (3-5 line patch feasibility, OpenKB compiler audit).
  • CLAUDE.md compiler architecture: "Designed around prompt-cache reuse: a single base context A reused across summary → concept-plan → concept-page calls."

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions