-
Notifications
You must be signed in to change notification settings - Fork 487
storage: make decode_and_mfp yield more often #22358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| // work. However, in practice, we've been decode_and_mfp be a source of | ||
| // interactivity loss during rehydration, so we now also count each mfp | ||
| // evaluation against our fuel. | ||
| *work += 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if we want to move this up even further... both FetchedPart::next and until.less_equal(&time) can filter out an arbitrary number of tuples before producing a candidate for mfp to evaluate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point!
Not sure what we could do about the filtering inside FetchedPart::next but maybe that's okay because it's only looking at the timestamp (we always return after decoding key and val, which is presumably the expensive part).
The until.less_equal(&time) is probably not a terribly selective filter in the rehydration case, but might be worth it in the spirit of being defensive. We'd probably be double counting work then in the common case, no idea if we would feel bad about that. Maybe we should change it to be "1 work per decoded key/valplusmfp_output.len().saturating_sub(2)` (the latter basically being how many "extra" outputs are generated)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double counting seems fine to me, it's all an approximation right now. I think I'd feel better about that if the yield threshold were configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other thought: we could add a debug! that outputs the time spent before yielding and # of inputs and # of outputs (or create additional metrics for those) to guide any future tuning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that there's a decent amount of hesitation around this, I've decided to keep the change as minimal as possible. Adding something for each mfp eval is pretty critical, so I think we have to live with that change, but I think moving the filtering is a bit speculative.
If possible, I'd also love to punt on any additional metrics or introspection in this PR because I don't really have a ton of time to devote to it atm, but I really think we want the knob in prod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not qualified to speak to persist_source specifics, so just some general thoughts:
- How much is flow control a concern here? In compute operators, yielding more often means letting upstream operators push in more data, leading to potentially higher memory usage. Is this true for the
persist_sourceas well? There are no upstream operators to apersist_source, but it does consist of multiple operators and it looks like the data fetching is not fueled. - Have you considered using time-based yielding? That might provide more consistent results given that it's hard to know how expensive the decode-and-MFP step is for any given data schema and MFP.
Anyway, this seems like a good incremental change, provided we don't see performance regressions in tests.
My feeling is that, if flow control is a problem, we have other tools to solve it in the source. So personally I'm most interested in what the answer would be assuming flow control is not a problem.
I think the original hope of exposing this knob was that the caller (ie. you) would consider time-based yielding... hence why this is configured in eg. To put the question back to you: how often should an operator yield? |
|
(That said... I think this tuning is probably an improvement since the current one seems empirically bad! But I'm not sure what's up with those CI failures...) |
Yeah, changing this setting is pretty scary, because of the unknown consequences for flow control and throughput. I don't know if we have a good way to estimate these consequences before we release to prod.
Good question! I think from a user's perspective waiting for more than a second for your query to return becomes annoying. A dataflow can have a bunch of expensive operators that get scheduled in sequence, so ideally each operator would yield after way below 1 second. I would naively say 10ms sounds good, but who knows what impact that has on throughput... For joins I want to try an approach where the yielding behavior is configurable through an LD flag (#22391). I hope this will let us try different values quickly and with little risk. We could consider that for |
We've seen evidence in production of this operator often going more than 5s without yielding, especially during rehydration and large unindexed SELECTs. This affects interactivity as well as the time between when a timely-driven Future is woken and when it is scheduled again. Attempt to fix this with two tweaks to our yield heuristics. First, make the fuel between yields configurable through LD. Second, also count each mfp evaluation against our fuel, in case it contains a restrictive filter.
|
Okay, in an effort to get something that we can land for the release cut tomorrow, I've switched this to be a knob in Launch Darkly, but otherwise left it as minimal as possible (including using the current value as the deafault for the config). This means that without an override in LD, the PR is a no-op except that we've started counting an mfp evaluation as a unit of work equivalent to an output. If we think that's still too much of a risk, we could make it something like The test failures from before were spooky, so I think we should get in another nightlies run with the config overridden to 100_000 before using it in prod. However, I'm going to start by kicking off a nightlies run with the PR exactly as it is to sanity check that the +1 doesn't break anything |
|
Looks like these nightly failures are known flakes on main |
bkirwi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! If the new definition of work causes issues we can probably tune our way past them with the knob.
teskje
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
TFTRs! |
We've seen evidence in production of this operator often going more than 5s without yielding, especially during rehydration and large unindexed SELECTs. This affects interactivity as well as the time between when a timely-driven Future is woken and when it is scheduled again.
Attempt to fix this with two tweaks to our yield heuristics. First, decrease the fuel between yields from 1_000_000 to 100_000. Second, also count each mfp evaluation against our fuel, in case it contains a restrictive filter.
The new 100_000 constant comes from eyeballing a replica in prod that was non-interactive. At the time, it was decoding ~300k rows per second, so 100k doesn't seem terribly low in comparison.
Touches MaterializeInc/database-issues#6497
Motivation
Tips for reviewer
Gonna sanity check the any performance regression from this with a nightlies run, in particular the feature bench.
For context, one of the affected replicas is spending about 3.4 cores out of 6 to decode a little under 300k rows per second (~11 μs each).
Checklist
$T ⇔ Proto$Tmapping (possibly in a backwards-incompatible way), then it is tagged with aT-protolabel.