-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Add OOM handling for RunInference with model manager #37557
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
fd80b51
Add OOM protection handling for RunInference
AMOOOMA 2eab947
Make sure we release the model regardless
AMOOOMA bd8ca7e
Add testing coverage
AMOOOMA ab0620d
Lint and make logging optional
AMOOOMA 6f8103b
Pass verbose logging setting from model manager to estimator
AMOOOMA 71e0684
Lint
AMOOOMA 3acb042
Prevent flakes on EOFError
AMOOOMA fcb49c4
Enforce batch size
AMOOOMA File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new test has a few issues that could be improved:
Non-determinism: The use of
random.random()makes this test non-deterministic, which can lead to flaky builds. It's better to make tests deterministic.Confusing structure: The
assert_thatcall is inside thewith self.assertRaises(Exception):block. This means the assertion is only checked if the test is about to fail anyway because no exception was raised. It's clearer to have separate tests for success and failure cases.Incomplete test coverage: The
OOMProtectedFnspecifically looks for'out of memory'and'CUDA'in the exception string to trigger the memory cleanup logic. TheMemoryError("Simulated OOM")raised here does not contain'out of memory', so the cleanup path is not actually being tested.I'm suggesting a change to make this test deterministic and to correctly test the OOM cleanup path by raising a more specific error. This will make the test more reliable and ensure the new functionality is properly verified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the suggestion is good, but there is a 20% chance this test succeeds. Could we drop the batch size to 1? That would help since we'd get 4 run_inference calls instead of just one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes good catch! I didn't notice the batch size, so was assuming 0.2^4, will update!