Skip to content

Fix another race condition that causes lost metrics#137

Open
pwinckles wants to merge 3 commits intoperformancecopilot:mainfrom
pwinckles:132-race2
Open

Fix another race condition that causes lost metrics#137
pwinckles wants to merge 3 commits intoperformancecopilot:mainfrom
pwinckles:132-race2

Conversation

@pwinckles
Copy link
Contributor

@pwinckles pwinckles commented Feb 21, 2025

This fixes another race condition that causes metrics to be lost when they are updated while the writer is stopped and after the monitorable has been re-added to the writer. In this case, the update needs to set the initial value on PcpValueInfo so that the correct value is written when the writer is started.

Additionally, while I was working on this, I noticed that if you have thousands of monitorables, then re-adding them in PcpMonitorBridge.startMonitoring() can take a very long time. For example, on one system I was seeing it take >20 seconds to add 40k monitorables. To help alleviate this problem, I added a cache to the metric name mapper and validator.

Resolves #132

This fixes another race condition that causes metrics to be lost when
they are updated while the writer is stopped and after the monitorable
has been re-added to the writer. In this case, the update needs to set
the initial value on `PcpValueInfo` so that the correct value is
written when the writer is started.
Adds a cache to the `MetricNameMapper` and `MetricNameValidator`. This
improves performance of mapping metric names because a regex no longer
has to be executed over and over again. The validator cache helps
because the metric name no longer has to be encoded over and over
again.

Neither of these caches expire entries, but this is not an issue
because it's currently not possible to remove monitorables from
Parfait.
@pwinckles pwinckles requested a review from tallpsmith February 21, 2025 16:40
@pwinckles
Copy link
Contributor Author

@tallpsmith @natoscott

Any thoughts on a path forward here?

@natoscott
Copy link
Member

@pwinckles it conceptually makes sense to me - I'm guessing retired life is treating @tallpsmith very well and we wont hear from him, so please go ahead and merge when you are happy with your PRs (I don't have the java experience to give you deep and meaningful reviews).

@tallpsmith
Copy link
Contributor

@pwinckles it conceptually makes sense to me - I'm guessing retired life is treating @tallpsmith very well and we wont hear from him, so please go ahead and merge when you are happy with your PRs (I don't have the java experience to give you deep and meaningful reviews).

Currently a glorified kids Uber driver:

...and general house-wife. Not easy to find a chunk of time to sit and concentrate at the moment! :(

@pwinckles
Copy link
Contributor Author

@tallpsmith Is this still on your radar?

@tallpsmith
Copy link
Contributor

ok yeah totally dropped the ball (down a big hole) on this, I'll get on it, thanks.

Copy link
Contributor

@tallpsmith tallpsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just a few comments/questions there

import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;

public class CachingMetricNameMapper implements MetricNameMapper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a good idea, but just wondering if there's backstory/observation on why this would be good performance improvement?

I mean, every time I see a cache now, I think of the phrase "There's only 2 hard things in Computer Science: Naming, Cache Invalidation, and off-by-one errors".

Was there something in the wild that you spotted that prompted adding this cache layer in? (I honestly can't see a problem with adding it here, but would prefer to know what the driver might be).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pwinckles are you able to put some sage words to this one ? (I forgot I even wrote this that long ago, but still valid questions? At the very least some documentation here as to the reasons this is highly useful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tallpsmith Yeah, I was seeing reworld issues. Some of the software I work on runs in a resource constrained environment, and I noticed that if you have thousands of monitorables, then re-adding them in PcpMonitorBridge.startMonitoring() can take a very long time. For example, on one system I was seeing it take >20 seconds to add 40k monitorables. This related to parsing the metric names repeatedly.

Because it's not currently possible for metric to be removed from parfait after its been added, there's no risk of the cache getting out of sync or leaking. It's just the normal tradeoff of memory vs cpu.

I'll add a comment to the class later today.

return initialValue;
}

public void setInitialValue(Object initialValue) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this package-private at least? I don't want this to be public and get abused, I still like the idea of aiming to make these immutable again so limiting exposure by keeping this inside the package will protect us in the long run?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make this package-private. As commented above, the main risk in regards to swapping in a new PcpValueInfo instance with an updated initial value is that perMetricByteBuffers is keyed off of it. Let me play around with it today and see if I can reasonably get it working though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, while I too value immutability, I don't think making initialValue mutable here is that bad for two reasons.

  1. initialValue is only ever used once when the metric is first written
  2. PcpValueInfo is already mutable due to tracking the offset.

As such, I think I would be inclined to taking the path of least resistance here, and just making initialValue mutable, with the setter package-private, if that is okay with you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Race condition causes metric updates to be lost

3 participants