-
Notifications
You must be signed in to change notification settings - Fork 4.2k
fix: race condition in shared runtime services #37825
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for the pull request, @marslanabdulrauf! This repository is currently maintained by Once you've gone through the following steps feel free to tag them in a comment and let them know that your changes are ready for engineering review. 🔘 Get product approvalIf you haven't already, check this list to see if your contribution needs to go through the product review process.
🔘 Provide contextTo help your reviewers and other members of the community understand the purpose and larger context of your changes, feel free to add as much of the following information to the PR description as you can:
🔘 Get a green buildIf one or more checks are failing, continue working on your changes until this is no longer the case and your build turns green. DetailsWhere can I find more information?If you'd like to get more details on all aspects of the review process for open source pull requests (OSPRs), check out the following resources: When can I expect my changes to be merged?Our goal is to get community contributions seen and reviewed as efficiently as possible. However, the amount of time that it takes to review and merge a PR can vary significantly based on factors such as:
💡 As a result it may take up to several weeks or months to complete a review and merge your PR. |
2beec34 to
a086e36
Compare
04f6f37 to
67a9f9e
Compare
| Create the proper runtime for this course | ||
| """ | ||
| services = self.services | ||
| services = self.services.copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this helps. Isn't this just a shallow copy? So the copy and the original still point at the same shared objects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There were 3-4 common objects, I haven't explored all of them but yeah they might be sharing the same objects. Let me deep copy this service object
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, based on the discussions below, if you folks have verified that it's fine with a shallow copy, let's keep it at that. Thank you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @ormsbee @marslanabdulrauf if possible, can either of you add a verbose comment about the fix and why are we using copy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are adding a detailed commit message
If you still think, I can add more details here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@farhaanbukhsh, @marslanabdulrauf: I think a short message above this is reasonable. How about:
A single SplitMongoModuleStore may create many SplitModuleStoreRuntimes, each of which will later modify its internal dict of services on a per-item and often per-user basis. Therefore, it's critical that we make a new copy of our baseline services dict here, so that each runtime is free to add and replace its services without impacting other runtimes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ormsbee @marslanabdulrauf that works, thanks a lot
67a9f9e to
e7a699b
Compare
|
One of my higher level concerns is that I don't know if this is happening because of something weird that the XBlock runtime or grading system is specifically doing, or whether it's a more general problem with an underlying piece of infrastructure like the |
|
This was discussed in yesterday's BTR meeting, but to reiterate here:
Thank you for investigating this. |
|
So, as I understand it, So it makes sense that the services dict should be shallow copied, so the runtime inherits the modulestore user-agnostic services, but cannot add user-specific services to the modulestore services dict. But this |
|
Oh, I see. I was missing the fact that we're adding services for the user (since the user_service is added at the SplitMongoModuleStore level). Then yes, shallow copy makes sense. @bradenmacdonald: If the issue is around content library content (e.g. this user-specific initialization of the library_tools service in |
|
But another data point is that something changed around early/mid November that seemed to trigger this. I suppose it could have been content-specific (some big course opening then), but I want to poke around a bit and see if any code landed around then that could have caused this. |
|
@ormsbee I suspect now that the reason this |
|
Okay, if that's really it then, then the shallow copy sounds fine as a short term thing. In the longer term, I wonder if there's really even a need to keep the global around, or if we can always re-instantiate the entire |
|
Yeah, it's doing a lot less work than it used to. Most of the time spent is spent setting up the MongoDB connection, and PyMongo uses its own connection pooling on the backend, so we don't even really need to worry about that. We should just yank it for Verawood. |
|
Maybe another tangent, but there is a big refactoring PR that touches this stuff that merged at the end of October: #35523 I don't see anything in it that I think could cause this, but given that the release where MIT started seeing issues was on Nov 12th, the timing is suspicious. FYI @kdmccormick (It's also entirely possible that it was triggered by a particular set of content that a course was using at that time, and was not linked to any recent change.) Yet another tangent: I made a PR to play with removing the global caching of modulestore altogether. It seems to work fine--the tests that I'm seeing break right now are query count related. It's not a serious contender for patching the release though. |
|
FYI for folks watching the ticket: @feoh gave an update earlier today saying that the Granian switchover didn't happen until Dec 22, which means it was not involved in these errors. |
e7a699b to
837f222
Compare
|
@marslanabdulrauf: Please pull in my commit from #37850 I'm planning to use the following commit message for the squashed commit: |
feanil
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes make sense, I think this is good to go once the small comment suggestion that Dave made in the comment on line 3286 has been added.
e0182f6 to
6b3624b
Compare
There is a singleton SplitMongoModuleStore instance that is returned whenever we call the ubiquitous modulestore() function (wrapped in a MixedModuleStore). During initialization, SplitMongoModuleStore sets up a small handful of XBlock runtime services that are intended to be shared globally: i18n, fs, cache. When we get an individual block back from the store using get_item(), SplitMongoModuleStore creates a SplitModuleStoreRuntime using SplitMongoModuleStore.create_runtime(). These runtimes are intended to be modified on a per-item, and later per-user basis (using prepare_runtime_for_user()). Prior to this commit, the create_runtime() method was assigning the globally shared SplitMongoModuleStore.services dict directly to the newly instantiated SplitModuleStoreRuntime. This meant that even though each block had its own _services dict, they were all in fact pointing to the same underlying object. This exposed us to a risk of multiple threads contaminating each other's SplitModuleStoreRuntime services when deployed under load in multithreaded mode. We believe this led to a race condition that caused student submissions to be mis-scored in some cases. This commit makes a copy of the SplitMongoModuleStore.services dict for each SplitModuleStoreRuntime. The baseline global services are still shared, but other per-item and per-user services are now better isolated from each other. This commit also includes a small modification to the PartitionService, which up until this point had relied on the (incorrect) shared instance behavior. The details are provided in the comments in the PartitionService __init__(). It's worth noting that the historical rationale for having a singleton ModuleStore instance is that the ModuleStore used to be extremely expensive to initialize. This was because at one point, the init process required reading entire XML-based courses into memory, or pre-computing complex field inheritance caches. This is no longer the case, and SplitMongoModuleStore initialization is in the 1-2 ms range, with most of that being for PyMongo's connection setup. We should try to fully remove the global singleton in the Verawood release cycle in order to make this kind of bug less likely.
There is a singleton SplitMongoModuleStore instance that is returned whenever we call the ubiquitous modulestore() function (wrapped in a MixedModuleStore). During initialization, SplitMongoModuleStore sets up a small handful of XBlock runtime services that are intended to be shared globally: i18n, fs, cache. When we get an individual block back from the store using get_item(), SplitMongoModuleStore creates a SplitModuleStoreRuntime using SplitMongoModuleStore.create_runtime(). These runtimes are intended to be modified on a per-item, and later per-user basis (using prepare_runtime_for_user()). Prior to this commit, the create_runtime() method was assigning the globally shared SplitMongoModuleStore.services dict directly to the newly instantiated SplitModuleStoreRuntime. This meant that even though each block had its own _services dict, they were all in fact pointing to the same underlying object. This exposed us to a risk of multiple threads contaminating each other's SplitModuleStoreRuntime services when deployed under load in multithreaded mode. We believe this led to a race condition that caused student submissions to be mis-scored in some cases. This commit makes a copy of the SplitMongoModuleStore.services dict for each SplitModuleStoreRuntime. The baseline global services are still shared, but other per-item and per-user services are now better isolated from each other. This commit also includes a small modification to the PartitionService, which up until this point had relied on the (incorrect) shared instance behavior. The details are provided in the comments in the PartitionService __init__(). It's worth noting that the historical rationale for having a singleton ModuleStore instance is that the ModuleStore used to be extremely expensive to initialize. This was because at one point, the init process required reading entire XML-based courses into memory, or pre-computing complex field inheritance caches. This is no longer the case, and SplitMongoModuleStore initialization is in the 1-2 ms range, with most of that being for PyMongo's connection setup. We should try to fully remove the global singleton in the Verawood release cycle in order to make this kind of bug less likely.
Related ticket
https://github.com/mitodl/hq/issues/9621 (MIT Internal)
Discussion
https://discuss.openedx.org/t/recalculate-subsection-grade-v3-is-submitted-with-the-wrong-user-id/17873/12?u=muhammad_arslan
Description
This pull request makes a minor change to how services are handled when creating a runtime for a course. Instead of using the original
self.servicesdictionary directly, the code now uses a copy of it to prevent unintended side effects from modifications.servicesdictionary increate_runtimeto avoid mutating the originalself.serviceswhen creating a runtime.Steps to reproduce the issue:
Follow the steps mentioned in the discussion post: https://discuss.openedx.org/t/recalculate-subsection-grade-v3-is-submitted-with-the-wrong-user-id/17873/12?u=muhammad_arslan
Testing instructions
Follow the same steps and now each user should have their own submission