Skip to content

Fix Prometheus gauge collision for cache_size metric#9729

Open
brucearctor wants to merge 1 commit intotemporalio:mainfrom
brucearctor:fix/prometheus-gauge-collision-cache-size
Open

Fix Prometheus gauge collision for cache_size metric#9729
brucearctor wants to merge 1 commit intotemporalio:mainfrom
brucearctor:fix/prometheus-gauge-collision-cache-size

Conversation

@brucearctor
Copy link
Copy Markdown

What Changed

The replication progress cache was reusing MutableStateCacheTypeTagValue as its cache_type metrics tag, causing its cache_size gauge to overwrite the mutable state cache's gauge via Prometheus last-write-wins semantics.

This made cache_size{cache_type="mutablestate"} always report 128000 (the replication progress cache's default size) instead of the actual mutable state cache capacity — particularly misleading when cacheSizeBasedLimit: true is configured.

Fix

  • Add a new ReplicationProgressCacheTypeTagValue = "replication_progress" constant in common/metrics/metric_defs.go
  • Use it in service/history/replication/progress_cache.go instead of MutableStateCacheTypeTagValue
  • Add a unit test verifying the progress cache uses a distinct cache_type tag value

Both caches now report their cache_size independently via distinct cache_type label values.

Why

This is a one-line behavioral change (plus the new constant and test). It only affects metric label values — no logic, API surface, or persistence changes.

Fixes #9600

The replication progress cache was reusing MutableStateCacheTypeTagValue
as its metrics tag, causing its cache_size gauge to overwrite the mutable
state cache's gauge (Prometheus last-write-wins). This made it impossible
to monitor the actual mutable state cache capacity.

Add a dedicated ReplicationProgressCacheTypeTagValue constant and use it
in the progress cache, so both caches report their cache_size independently
via distinct cache_type label values.

Fixes temporalio#9600
@brucearctor brucearctor requested review from a team as code owners March 28, 2026 01:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prometheus Gauge Collision for cache_size

1 participant