How to implement LLM response caching? #4286

nmileticpan · 2026-01-27T18:29:42Z

nmileticpan
Jan 27, 2026

I see in documentation that LLM response caching should be implemented using callbacks or plugins.

I wanted to write LlmResponse to cache with LlmRequest as a key in after_model_callback but LlmRequest is not available in that function.
Idea then was to save LlmRequest to callback context state in before_model_callback but issue is that there is a race condition with ParallelAgent which with a different LLM call overwriting state variable before after_model_callback is called for the intended LLM call.

Is there any example how LLM response caching should be implemented?

Thank you for your help.

adk-bot · 2026-01-27T18:31:18Z

adk-bot
Jan 27, 2026
Maintainer

Response from ADK Answering Agent (experimental, answer may be inaccurate)

TLDR: To implement LLM response caching, use the before_model_callback to check for a cached response and return it, and the after_model_callback to save a new response to the cache. This avoids the race condition you encountered.

Hello! Thank you for your question. Here is a more detailed explanation of how you can implement LLM response caching based on ADK's design patterns.

The recommended approach is to use a plugin with before_model_callback and after_model_callback. Here’s how it solves the issues you described:

Accessing LlmRequest: The before_model_callback function receives both the callback_context and the llm_request object. This is the correct place to generate a cache key and check your cache.
Serving a Cached Response: If you find a cached LlmResponse in your before_model_callback, you can return it directly. ADK is designed to skip the actual LLM call if an LlmResponse object is returned from this callback. [1]
Caching a New Response: The after_model_callback receives the llm_response object after a successful (and non-cached) call to the LLM. You can then use this to populate your cache.
Avoiding Race Conditions: The race condition you saw with ParallelAgent happens when you rely on a shared state variable that can be overwritten by concurrent operations. By generating a stable cache key directly from the llm_request content within the before_model_callback, you can avoid this issue. For example, you could create a hash of the prompt and model parameters.

Here is a conceptual example of a caching plugin:

import hashlib
from google.adk.services import llm
from google.adk.plugins import plugin

class CachingPlugin(plugin.BasePlugin):
    def __init__(self, cache):
        self._cache = cache  # Your caching mechanism (e.g., a dictionary, Redis client)

    def _get_cache_key(self, llm_request: llm.LlmRequest) -> str:
        # Create a stable key from the request content
        # A real implementation might need to be more robust
        prompt = "".join([m.text for m in llm_request.messages])
        return hashlib.md5(prompt.encode()).hexdigest()

    def before_model_callback(
        self, callback_context, llm_request: llm.LlmRequest
    ) -> llm.LlmResponse | None:
        cache_key = self._get_cache_key(llm_request)
        if cache_key in self._cache:
            print(f"Cache HIT for key: {cache_key}")
            return self._cache[cache_key]
        print(f"Cache MISS for key: {cache_key}")
        return None  # Proceed with the LLM call

    def after_model_callback(
        self, callback_context, llm_response: llm.LlmResponse
    ) -> None:
        # This part is a bit tricky since LlmRequest is not here.
        # A robust solution might involve storing the key in the context
        # during the before_model_callback, but you must use a unique ID
        # to avoid race conditions.
        # For simplicity in this example, we re-generate the key,
        # assuming the response contains enough info or we stored the key.

        # A better way is to store the key in the context with a unique ID
        # from the context itself, to avoid race conditions.
        pass # See documentation for more advanced patterns.

For more details, you can refer to the official documentation on callbacks and plugins:

[Callbacks Design Patterns and Best Practices][2]
[Plugins Documentation][3]
[BasePlugin Source Code][4]

I hope this helps you build an effective caching mechanism!

This response was generated by an automated assistant.

[1] https://google.github.io/adk-docs/callbacks/
[2] https://google.github.io/adk-docs/callbacks/design-patterns-and-best-practices/
[3] https://google.github.io/adk-docs/plugins/
[4] https://github.com/google/adk-python/blob/main/src/google/adk/plugins/base_plugin.py

1 reply

nmiletic Jan 27, 2026

bot response is not helpful. could a human please take a look?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to implement LLM response caching? #4286

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to implement LLM response caching? #4286

Uh oh!

nmileticpan Jan 27, 2026

Replies: 1 comment · 1 reply

Uh oh!

adk-bot Jan 27, 2026 Maintainer

Uh oh!

nmiletic Jan 27, 2026

nmileticpan
Jan 27, 2026

Replies: 1 comment 1 reply

adk-bot
Jan 27, 2026
Maintainer