Added TTL caching for BigQuery table definitions #37145

AbgarSim · 2025-12-18T16:54:08Z

[BEAM-34076] GitHub Issue Added TTL-based caching for BigQuery table definitions

Description

This change implements a thread-safe caching mechanism for BigQuery table definitions
in the BigQueryWrapper class to address issue BEAM-34076. The implementation uses caching strategy to reduce BigQuery API calls and thus optimising the flow.

Changes in the codebase

The solution separates the table metadata lookup into two distinct responsibilities:
1. An uncached lookup method that performs the actual tables.get call and is protected by retry logic with exponential backoff.
2. A cached, thread-safe wrapper method that stores table metadata in a TTL cache and reuses it for subsequent requests.

Caching is implemented using the cachetools
The cached method is now the primary entry point for callers. If the requested table metadata is already present in the cache and valid, it is returned immediately. Otherwise, the uncached method is invoked, and the result is stored in the cache.

Cache configurations added:

cache maxsize - 1024
ttl seconds - 300 (5 minutes)

Additional information

This change is intentionally minimal and non-breaking.
The cache reduces API traffic and improves performance in hot paths without altering existing behavior.
Retry logic is isolated to the uncached method to avoid masking persistent errors while still handling transient failures gracefully.
Thread safety is preserved to support concurrent access scenarios.

GitHub Actions Tests Status (on master branch)

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

gemini-code-assist · 2025-12-18T16:54:25Z

Summary of Changes

Hello @AbgarSim, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances the BigQueryWrapper class by integrating a Time-To-Live (TTL) caching mechanism for BigQuery table definitions. The primary goal is to minimize redundant BigQuery API calls for frequently accessed table metadata, thereby boosting application performance and efficiency. The implementation ensures thread safety and maintains existing retry logic for uncached lookups, providing a robust and optimized solution for managing BigQuery interactions.

Highlights

BigQuery Table Definition Caching: Introduced a TTL-based, thread-safe caching mechanism for BigQuery table definitions within the BigQueryWrapper class to reduce API calls and optimize performance.
Caching Implementation: The solution separates table metadata lookup into an uncached method (_get_table_uncached) with retry logic and a cached, thread-safe wrapper method (get_table) using the cachetools library.
Cache Configuration: The cache is configured with a maximum size of 1024 entries and a Time-To-Live (TTL) of 300 seconds (5 minutes).
Impact: This change is minimal and non-breaking, aiming to reduce BigQuery API traffic and improve performance in hot paths without altering existing behavior.
Testing: A new unit test has been added to verify that get_table correctly invokes the underlying API call only once for subsequent requests to the same table, demonstrating the caching functionality.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-12-18T18:11:03Z

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

sjvanrossum · 2025-12-19T15:44:43Z

I think I've reviewed a similar PR before, but I can't recall which one.
This seems useful, but note that these changes only introduce a distinct cache per instance of BigQueryWrapper.
It's worth checking where those instances are created throughout the project and whether they're shared across replicas of a DoFn or not.

If every replica of a DoFn creates their own cache (e.g., deserialized from its ParDo payload, created during the setup phase of its lifecycle), then the effects of these changes are somewhat limited.
If that's the case then a class method with caching behavior could improve that, but may require some changes to the cache key to avoid clashes between distinct (by value, not by reference) BigQuery clients.

sjvanrossum · 2025-12-19T15:53:27Z

Retry logic is isolated to the uncached method to avoid masking persistent errors while still handling transient failures gracefully.

I don't think this needs to be handled any different since raising an exception during a cache entry refresh should prevent the result from being cached.

AbgarSim · 2025-12-22T21:52:04Z

I think I've reviewed a similar PR before, but I can't recall which one. This seems useful, but note that these changes only introduce a distinct cache per instance of BigQueryWrapper. It's worth checking where those instances are created throughout the project and whether they're shared across replicas of a DoFn or not.

If every replica of a DoFn creates their own cache (e.g., deserialized from its ParDo payload, created during the setup phase of its lifecycle), then the effects of these changes are somewhat limited. If that's the case then a class method with caching behaviour could improve that, but may require some changes to the cache key to avoid clashes between distinct (by value, not by reference) BigQuery clients.

Hi @sjvanrossum thanks for the review!
Checked, and seems that BigQueryWrapper is indeed instantiated multiple times, as such moved the TTLCache instance to class level, to allow for a shared cache between instances, also added test_get_table_shared_cache_across_wrapper_instances to test for this particular scenario.

I also I've overwritten the TTLCache in a private class _NonNoneTTLCache to cover for a corner case: when table is created and the get_or_create_table is called, there is sometimes a delay when get_table returns None and then right away will return the new table, as such caching None value should be explicitly disabled here. I'm not as experience with python ecosystem and coding conventions so please advice if this is a good place to define this class.

github-actions bot added python io gcp labels Dec 18, 2025

AbgarSim changed the title ~~[BEAM-34076] Added TTL caching for BigQuery table definitions~~ Added TTL caching for BigQuery table definitions Dec 18, 2025

AbgarSim force-pushed the feat/bigquery-get-table-cache branch 3 times, most recently from 99e7b61 to 86b8300 Compare December 19, 2025 11:05

AbgarSim force-pushed the feat/bigquery-get-table-cache branch from 86b8300 to f8923ba Compare December 22, 2025 21:46

AbgarSim force-pushed the feat/bigquery-get-table-cache branch 3 times, most recently from b4d061e to d31719b Compare December 22, 2025 21:57

[BEAM-34076] Added TTL caching for BigQuery table definitions

0d85c11

AbgarSim force-pushed the feat/bigquery-get-table-cache branch from d31719b to 0d85c11 Compare December 22, 2025 21:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added TTL caching for BigQuery table definitions #37145

Added TTL caching for BigQuery table definitions #37145

AbgarSim commented Dec 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 18, 2025

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

sjvanrossum commented Dec 19, 2025

Uh oh!

sjvanrossum commented Dec 19, 2025

Uh oh!

AbgarSim commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Added TTL caching for BigQuery table definitions #37145

Are you sure you want to change the base?

Added TTL caching for BigQuery table definitions #37145

Conversation

AbgarSim commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[BEAM-34076] GitHub Issue Added TTL-based caching for BigQuery table definitions

Description

Changes in the codebase

Additional information

GitHub Actions Tests Status (on master branch)

Uh oh!

gemini-code-assist bot commented Dec 18, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Dec 18, 2025

Uh oh!

sjvanrossum commented Dec 19, 2025

Uh oh!

sjvanrossum commented Dec 19, 2025

Uh oh!

AbgarSim commented Dec 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AbgarSim commented Dec 18, 2025 •

edited

Loading