Skip to content

Conversation

@scwhittle
Copy link
Contributor

Improve the map_fn test by using common utilities.
Add support to benchmarks to add profiling. I was originally trying to optimize the code other ways but most of the overhead appears to be due to cython/Python interactions so the fast-past cache for repeated process invocations is important.


Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:

  • Mention the appropriate issue in your description (for example: addresses #123), if applicable. This will automatically add a link to the pull request in the issue. If you would like the issue to automatically close on merging the pull request, comment fixes #<ISSUE NUMBER> instead.
  • Update CHANGES.md with noteworthy changes.
  • If this contribution is large, please file an Apache Individual Contributor License Agreement.

See the Contributor Guide for more tips on how to make review process smoother.

To check the build health, please visit https://github.com/apache/beam/blob/master/.test-infra/BUILD_STATUS.md

GitHub Actions Tests Status (on master branch)

Build python source distribution and wheels
Python tests
Java tests
Go tests

See CI.md for more information about GitHub Actions CI or the workflows README to see a list of phrases to trigger workflows.

@scwhittle
Copy link
Contributor Author

fixes #28776

@github-actions
Copy link
Contributor

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

@scwhittle scwhittle force-pushed the global_side_input_benchmark branch 4 times, most recently from 49ce3b2 to fe6df98 Compare May 21, 2024 21:06
@github-actions
Copy link
Contributor

Assigning reviewers. If you would like to opt out of this review, comment assign to next reviewer:

R: @shunping for label python.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

@scwhittle scwhittle force-pushed the global_side_input_benchmark branch 3 times, most recently from 356234f to 890eb57 Compare May 22, 2024 17:28
scwhittle added 2 commits May 22, 2024 20:41
- add fixed window test
- add side input tests
- improve by using benchmark helpers
@scwhittle scwhittle force-pushed the global_side_input_benchmark branch from 890eb57 to f8009d8 Compare May 22, 2024 18:41
@scwhittle
Copy link
Contributor Author

fixed lint issue, remaining test failures seem unrelated but they are rerunning anyway

@scwhittle
Copy link
Contributor Author

scwhittle commented May 22, 2024

@robertwb @tvalentyn as you have context from before and had performance concerns.
From the benchmark, clearing on finish_bundle doesn't appear to impact runtime.
You were correct that doing it every time slowed down the process time by about 25%. I tried to optimize by using a placeholder approach similar to other placeholders but it still was slow, I believe due to going back and forth between cython/python as shown by cython annotate see
see test.html.txt
(uploading as text to workaround github filter). It might be something to investigate to improve performance if desired for process()

@tvalentyn
Copy link
Contributor

Thanks, @scwhittle !

doing it every time slowed down the process time by about 25%.

Have you observed this regression on Dataflow runner as well?

From the benchmark, clearing on finish_bundle doesn't appear to impact runtime.

Seems like a great idea! Have you checked whether Dataflow runtime has changed on any of the perf tests using side-inputs ?

@scwhittle
Copy link
Contributor Author

Without the sdk cache enabled (not on by default), removing the PerWindowInvoker cache mans that there is lots of FnApi traffic to fetch side inputs which contributed to latency in Dataflow runner.

I think that we should postpone making this always on until the sdk cache is enabled by default. If that is too far out we could modify this to not clear after every bundle but modify finish_bundle to clear it only after some timeout.

Enabling the sdk cache will let the runner control the refresh via the side-input cache token.

@github-actions
Copy link
Contributor

github-actions bot commented Jun 7, 2024

Reminder, please take a look at this pr: @shunping

@robertwb
Copy link
Contributor

I agree, and this will likely slow things down a lot more on a real runner without SDK cache than is shown in the microbenchmarks. How hard would it be to only clear the cache if it's reached a certain age (at least until we have sdk cache is enabled by default).

@github-actions
Copy link
Contributor

Reminder, please take a look at this pr: @shunping

@scwhittle
Copy link
Contributor Author

It would be pretty easy to do so but it could still cause latency regressions on pipelines that have a global side input that is not refreshing.

I was looking into doing so only if the pcollection generating the side input was unbounded as a non-updating side input seems like it would be bounded. I was using the AsSideInput pcollection but then it looked like some additional plumbing was necessary to keep this metadata in SideInputData which is what is unpickled and used for execution. I didn't get a chance to get it all working yet.

@robertwb Do you forsee issues with that approach before I work further on completing it? IIUC the pickling just needs to be consistent between pipeline submission and execution so it wouldn't be an update compatibility issue.

@tvalentyn
Copy link
Contributor

waiting on author

@tvalentyn
Copy link
Contributor

tvalentyn commented Jun 18, 2024

FWIW I am planning to look into enabling state cache again next quarter

@github-actions
Copy link
Contributor

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 18, 2024
@github-actions
Copy link
Contributor

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@scwhittle
Copy link
Contributor Author

Recreated with #37123 as I can't reopen after force push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants