fix(mm): fix COW race condition in multi-core environment#31
fix(mm): fix COW race condition in multi-core environment#31sunhaosheng wants to merge 6 commits intoStarry-OS:devfrom
Conversation
In SMP environment, when multiple processes sharing the same COW page trigger write faults simultaneously, a race condition could occur: 1. Process A sees count=2, decrements to 1, releases lock, starts copying 2. Process B sees count=1, decides not to copy, releases lock 3. Process B calls pt.protect() to add write permission to original page 4. Process B starts writing to the original page 5. Process A is copying a page that's being modified 6. Process A gets inconsistent data -> heap corruption -> SIGSEGV Fix: Keep the FRAME_TABLE lock held during the entire page copy operation to ensure no other process can see count=1 and start modifying the source page while we're copying it.
| // Hold the FRAME_TABLE lock during the entire operation to avoid race conditions. | ||
| // This includes page copying to prevent another process from modifying the page | ||
| // while we're copying it. | ||
| let mut table = FRAME_TABLE.lock(); |
There was a problem hiding this comment.
IS it necessory to lock global? Or Is there another way that only lock on this page ?
There was a problem hiding this comment.
There’s indeed a coarse-grained locking issue: a single global lock that covers the whole copy step blocks refcount operations for unrelated pages on SMP. To keep the existing APIs, I plan to change the FRAME_TABLE value into a structure with a per-frame lock: grab the global lock only to locate the FrameState, then lock just that frame while adjusting the count and copying. Different frames can then proceed in parallel. If that approach sounds good, I’ll implement it next—happy to hear any better ideas you might have.
There was a problem hiding this comment.
The suggestion that @li041 mentioned might solve this issue. Could you please take a look?
|
The race is caused by exposing |
You're right—the refcount update can be delayed so that the copy runs while count >= 2, preventing the “only one reference left” path from ever firing mid-copy. My plan is to tweak FRAME_TABLE access like this:
|
In SMP environment, when multiple processes sharing the same COW page trigger write faults simultaneously, a race condition could occur:
Fix: Keep the FRAME_TABLE lock held during the entire page copy operation to ensure no other process can see count=1 and start modifying the source page while we're copying it.