Improve histogram, summary performance under contention by striping observationCount#1794
Merged
zeitlinger merged 2 commits intoprometheus:mainfrom Feb 6, 2026
Merged
Conversation
8cd7e50 to
4c2146c
Compare
…bservationCount Signed-off-by: Jack Berg <34418638+jack-berg@users.noreply.github.com>
4c2146c to
ad1e7cb
Compare
jack-berg
commented
Jan 21, 2026
prometheus-metrics-core/src/main/java/io/prometheus/metrics/core/metrics/Buffer.java
Outdated
Show resolved
Hide resolved
Contributor
There was a problem hiding this comment.
Pull request overview
This pull request improves the performance of Histogram and Summary metrics under high contention by implementing striping for the observationCount field in the Buffer class. The optimization replaces a single AtomicLong with an array of AtomicLong instances, distributing contention across multiple counters based on thread ID.
Changes:
- Introduced
stripedObservationCountsarray sized by available processors to distribute contention - Modified
append()method to select a stripe based on thread ID for lock-free counting - Updated
run()method to aggregate counts across all stripes during collection
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
prometheus-metrics-core/src/main/java/io/prometheus/metrics/core/metrics/Buffer.java
Outdated
Show resolved
Hide resolved
prometheus-metrics-core/src/main/java/io/prometheus/metrics/core/metrics/Buffer.java
Outdated
Show resolved
Hide resolved
prometheus-metrics-core/src/main/java/io/prometheus/metrics/core/metrics/Buffer.java
Outdated
Show resolved
Hide resolved
prometheus-metrics-core/src/main/java/io/prometheus/metrics/core/metrics/Buffer.java
Outdated
Show resolved
Hide resolved
zeitlinger
approved these changes
Feb 5, 2026
Member
|
@jack-berg one of the commits is not signed off |
Signed-off-by: Jack Berg <34418638+jack-berg@users.noreply.github.com>
482d709 to
de15a97
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Was working on improving the performance of opentelemetry-java metrics under high contention, and realized that the same strategy I identified to help over there helps for the prometheus implementation as well!
The idea here is recognizing that
Buffer.observationCountis the bottleneck under contention. In contrast to the other histogram / summaryLongAdderfields,Buffer.observationCountisAtomicLongwhich performs much worse thanLongAdderunder high contention. Its necessary that the type isAtomicLongbecause the CAS APIs accommodate the two way communication that the record / collect paths need to signal that a collection has started and all records have successfully completed (preventing partial writes).However, we can "have our cake and eat it to" by striping
Buffer.observationCountinto many instances, such that the contention on any instance is reduced. This is actually whatLongAdderdoes under the covers. This implementation stripes it intoRuntime.getRuntime().availableProcessors()instances, and usesThread.currentThread().getId()) % stripedObservationCounts.lengthto select which instance any particular record thread should use.Performance increase is substantial. Here's the before and after of
HistogramBenchmarkon my machine (Apple M4 Mac Pro w/ 48gb RAM):Before:
After: