Skip to content

[FLINK-39016][Runtime/REST] Add configurable TTL for ExecutionGraph cache independent of web refresh interval#27509

Merged
reswqa merged 1 commit intoapache:masterfrom
Myracle:FLINK-39016-executionGraph-cache-ttl-config
Mar 31, 2026
Merged

[FLINK-39016][Runtime/REST] Add configurable TTL for ExecutionGraph cache independent of web refresh interval#27509
reswqa merged 1 commit intoapache:masterfrom
Myracle:FLINK-39016-executionGraph-cache-ttl-config

Conversation

@Myracle
Copy link
Copy Markdown
Contributor

@Myracle Myracle commented Feb 3, 2026

What is the purpose of the change

This pull request introduces a new configuration option web.execution-graph.cache-ttl that allows users to configure the TTL (Time-to-Live) for the ExecutionGraph cache independently from the web.refresh-interval.
Previously, the ExecutionGraph cache TTL was implicitly tied to the web refresh interval, which made it difficult for users who need real-time job state synchronization (e.g., for monitoring dashboards or orchestration systems) to get fresh ExecutionGraph data without affecting the overall web UI refresh behavior.
With this change, users can:

  • Set web.execution-graph.cache-ttl to a small value (or 0) to always fetch fresh ExecutionGraph data
  • Keep the web UI refresh interval at a reasonable value for normal dashboard usage
  • Maintain backward compatibility by defaulting to the web.refresh-interval when not explicitly configured

Brief change log

  • Added new configuration option web.execution-graph.cache-ttl in WebOptions with fallback to web.refresh-interval
  • Extended RestHandlerConfiguration to parse and hold the new ExecutionGraph cache TTL parameter with non-negative validation
  • Updated RestEndpointFactory.createExecutionGraphCache() to use the dedicated TTL configuration instead of refresh interval
  • Added unit tests covering default fallback behavior, custom values, zero value, and negative value validation
  • Updated web configuration documentation to include the new option

Verifying this change

This change added tests and can be verified as follows:

  • Added RestHandlerConfigurationTest#testExecutionGraphCacheTTLDefault() to verify the default fallback to web.refresh-interval
  • Added RestHandlerConfigurationTest#testExecutionGraphCacheTTLCustomValue() to verify independent configuration of cache TTL
  • Added RestHandlerConfigurationTest#testExecutionGraphCacheTTLZeroValue() to verify zero value support for real-time synchronization scenarios
  • Added RestHandlerConfigurationTest#testExecutionGraphCacheTTLNegativeValue() to verify negative value validation throws IllegalArgumentException

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
  • The serializers: (no)
  • The runtime per-record code paths (performance sensitive): (no)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (no)
  • The S3 file system connector: (no)

Documentation

  • Does this pull request introduce a new feature? (yes)
  • If yes, how is the feature documented? (docs / JavaDocs )

@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented Feb 3, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

* Time-to-live for cached ExecutionGraph. If not set, defaults to the value of {@link
* #REFRESH_INTERVAL}.
*
* <p>Setting this to 0 (or a very small value) means the cache will always fetch fresh data,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the comments it says "0 (or a very small value)" but the other comments only talk of 0. I think we should be explicit in the text as to what the smallest number is that would cause caching.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! You're right that the Javadoc and withDescription were inconsistent. Looking at the DefaultExecutionGraphCache implementation, only a value of exactly 0 deterministically disables caching (since currentTime < currentTime is always false). A "very small value" would only probabilistically result in cache misses, not guarantee them. I've updated the Javadoc to remove the ambiguous "(or a very small value)" phrasing, making it consistent with the withDescription text — both now explicitly state that setting this to 0 disables caching.

<td><h5>web.execution-graph.cache-ttl</h5></td>
<td style="word-wrap: break-word;">(none)</td>
<td>Duration</td>
<td>Time-to-live for cached ExecutionGraph. If not set, defaults to the value of '<code class="highlighter-rouge">web.refresh-interval</code>'. Setting this to 0 means the cache will always fetch fresh data, which is useful for real-time state synchronization scenarios.</td>
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only reference to web.refresh-interval is see in the docs is web.refresh-interval.

The config parameter starts with web. implies it is for the Web UI. There are no other parameters starting with web.execution-graph - this is a Flink internal concept. How should I understand the context of where this config option implies?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! You are absolutely right. The web. prefix is reserved for Web UI-facing configurations, and execution-graph is a Flink internal concept that doesn't fit in this namespace. I've moved the config option to RestOptions with the key rest.cache.execution-graph.timeout, following the same naming convention as the existing rest.cache.checkpoint-statistics.timeout. The documentation has been updated accordingly.

@github-actions github-actions Bot added the community-reviewed PR has been reviewed by the community. label Feb 3, 2026
@Myracle Myracle force-pushed the FLINK-39016-executionGraph-cache-ttl-config branch 2 times, most recently from 2a894f6 to b13a852 Compare March 10, 2026 08:44
* Tests that ExecutionGraph cache TTL can be set to zero for real-time state synchronization.
*/
@Test
void testExecutionGraphCacheTTLZeroValue() {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This zero value test can be merged into testExecutionGraphCacheTTLCustomValue.

@Varka808
Copy link
Copy Markdown

Varka808 commented Mar 25, 2026

Please rebase the latest master branch and squash all commits.

@Myracle Myracle force-pushed the FLINK-39016-executionGraph-cache-ttl-config branch from b13a852 to 46cf2fc Compare March 25, 2026 08:39
@Myracle Myracle force-pushed the FLINK-39016-executionGraph-cache-ttl-config branch from 46cf2fc to 3744421 Compare March 27, 2026 06:34
Copy link
Copy Markdown
Member

@reswqa reswqa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM.

@reswqa reswqa merged commit e890caf into apache:master Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants