Skip to content

Conversation

@AbgarSim
Copy link

@AbgarSim AbgarSim commented Dec 18, 2025

[BEAM-34076] GitHub Issue Added TTL-based caching for BigQuery table definitions

Description

This change implements a thread-safe caching mechanism for BigQuery table definitions
in the BigQueryWrapper class to address issue BEAM-34076. The implementation uses caching strategy to reduce BigQuery API calls and thus optimising the flow.

Changes in the codebase

The solution separates the table metadata lookup into two distinct responsibilities:
1. An uncached lookup method that performs the actual tables.get call and is protected by retry logic with exponential backoff.
2. A cached, thread-safe wrapper method that stores table metadata in a TTL cache and reuses it for subsequent requests.

Caching is implemented using the cachetools
The cached method is now the primary entry point for callers. If the requested table metadata is already present in the cache and valid, it is returned immediately. Otherwise, the uncached method is invoked, and the result is stored in the cache.

Cache configurations added:

  • cache maxsize - 1024
  • ttl seconds - 300 (5 minutes)

Additional information

  • This change is intentionally minimal and non-breaking.
  • The cache reduces API traffic and improves performance in hot paths without altering existing behavior.
  • Retry logic is isolated to the uncached method to avoid masking persistent errors while still handling transient failures gracefully.
  • Thread safety is preserved to support concurrent access scenarios.

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @AbgarSim, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the BigQueryWrapper class by integrating a Time-To-Live (TTL) caching mechanism for BigQuery table definitions. The primary goal is to minimize redundant BigQuery API calls for frequently accessed table metadata, thereby boosting application performance and efficiency. The implementation ensures thread safety and maintains existing retry logic for uncached lookups, providing a robust and optimized solution for managing BigQuery interactions.

Highlights

  • BigQuery Table Definition Caching: Introduced a TTL-based, thread-safe caching mechanism for BigQuery table definitions within the BigQueryWrapper class to reduce API calls and optimize performance.
  • Caching Implementation: The solution separates table metadata lookup into an uncached method (_get_table_uncached) with retry logic and a cached, thread-safe wrapper method (get_table) using the cachetools library.
  • Cache Configuration: The cache is configured with a maximum size of 1024 entries and a Time-To-Live (TTL) of 300 seconds (5 minutes).
  • Impact: This change is minimal and non-breaking, aiming to reduce BigQuery API traffic and improve performance in hot paths without altering existing behavior.
  • Testing: A new unit test has been added to verify that get_table correctly invokes the underlying API call only once for subsequent requests to the same table, demonstrating the caching functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@AbgarSim AbgarSim changed the title [BEAM-34076] Added TTL caching for BigQuery table definitions Added TTL caching for BigQuery table definitions Dec 18, 2025
@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@AbgarSim AbgarSim force-pushed the feat/bigquery-get-table-cache branch 3 times, most recently from 99e7b61 to 86b8300 Compare December 19, 2025 11:05
@sjvanrossum
Copy link
Contributor

I think I've reviewed a similar PR before, but I can't recall which one.
This seems useful, but note that these changes only introduce a distinct cache per instance of BigQueryWrapper.
It's worth checking where those instances are created throughout the project and whether they're shared across replicas of a DoFn or not.

If every replica of a DoFn creates their own cache (e.g., deserialized from its ParDo payload, created during the setup phase of its lifecycle), then the effects of these changes are somewhat limited.
If that's the case then a class method with caching behavior could improve that, but may require some changes to the cache key to avoid clashes between distinct (by value, not by reference) BigQuery clients.

@sjvanrossum
Copy link
Contributor

  • Retry logic is isolated to the uncached method to avoid masking persistent errors while still handling transient failures gracefully.

I don't think this needs to be handled any different since raising an exception during a cache entry refresh should prevent the result from being cached.

@AbgarSim AbgarSim force-pushed the feat/bigquery-get-table-cache branch from 86b8300 to f8923ba Compare December 22, 2025 21:46
@AbgarSim
Copy link
Author

I think I've reviewed a similar PR before, but I can't recall which one. This seems useful, but note that these changes only introduce a distinct cache per instance of BigQueryWrapper. It's worth checking where those instances are created throughout the project and whether they're shared across replicas of a DoFn or not.

If every replica of a DoFn creates their own cache (e.g., deserialized from its ParDo payload, created during the setup phase of its lifecycle), then the effects of these changes are somewhat limited. If that's the case then a class method with caching behaviour could improve that, but may require some changes to the cache key to avoid clashes between distinct (by value, not by reference) BigQuery clients.

Hi @sjvanrossum thanks for the review!
Checked, and seems that BigQueryWrapper is indeed instantiated multiple times, as such moved the TTLCache instance to class level, to allow for a shared cache between instances, also added test_get_table_shared_cache_across_wrapper_instances to test for this particular scenario.

I also I've overwritten the TTLCache in a private class _NonNoneTTLCache to cover for a corner case: when table is created and the get_or_create_table is called, there is sometimes a delay when get_table returns None and then right away will return the new table, as such caching None value should be explicitly disabled here. I'm not as experience with python ecosystem and coding conventions so please advice if this is a good place to define this class.

@AbgarSim AbgarSim force-pushed the feat/bigquery-get-table-cache branch 3 times, most recently from b4d061e to d31719b Compare December 22, 2025 21:57
@AbgarSim AbgarSim force-pushed the feat/bigquery-get-table-cache branch from d31719b to 0d85c11 Compare December 22, 2025 21:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants