Skip to content

fix(mm): fix COW race condition in multi-core environment#31

Closed
sunhaosheng wants to merge 6 commits intoStarry-OS:devfrom
sunhaosheng:arceos-dev-pr
Closed

fix(mm): fix COW race condition in multi-core environment#31
sunhaosheng wants to merge 6 commits intoStarry-OS:devfrom
sunhaosheng:arceos-dev-pr

Conversation

@sunhaosheng
Copy link
Copy Markdown

In SMP environment, when multiple processes sharing the same COW page trigger write faults simultaneously, a race condition could occur:

  1. Process A sees count=2, decrements to 1, releases lock, starts copying
  2. Process B sees count=1, decides not to copy, releases lock
  3. Process B calls pt.protect() to add write permission to original page
  4. Process B starts writing to the original page
  5. Process A is copying a page that's being modified
  6. Process A gets inconsistent data -> heap corruption -> SIGSEGV

Fix: Keep the FRAME_TABLE lock held during the entire page copy operation to ensure no other process can see count=1 and start modifying the source page while we're copying it.

In SMP environment, when multiple processes sharing the same COW page
trigger write faults simultaneously, a race condition could occur:

1. Process A sees count=2, decrements to 1, releases lock, starts copying
2. Process B sees count=1, decides not to copy, releases lock
3. Process B calls pt.protect() to add write permission to original page
4. Process B starts writing to the original page
5. Process A is copying a page that's being modified
6. Process A gets inconsistent data -> heap corruption -> SIGSEGV

Fix: Keep the FRAME_TABLE lock held during the entire page copy operation
to ensure no other process can see count=1 and start modifying the source
page while we're copying it.
Comment thread modules/axmm/src/backend/cow.rs Outdated
// Hold the FRAME_TABLE lock during the entire operation to avoid race conditions.
// This includes page copying to prevent another process from modifying the page
// while we're copying it.
let mut table = FRAME_TABLE.lock();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IS it necessory to lock global? Or Is there another way that only lock on this page ?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There’s indeed a coarse-grained locking issue: a single global lock that covers the whole copy step blocks refcount operations for unrelated pages on SMP. To keep the existing APIs, I plan to change the FRAME_TABLE value into a structure with a per-frame lock: grab the global lock only to locate the FrameState, then lock just that frame while adjusting the count and copying. Different frames can then proceed in parallel. If that approach sounds good, I’ll implement it next—happy to hear any better ideas you might have.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suggestion that @li041 mentioned might solve this issue. Could you please take a look?

@li041
Copy link
Copy Markdown

li041 commented Dec 17, 2025

The race is caused by exposing refcount == 1 before the COW copy completes. How about delaying the refcount decrement?

@sunhaosheng
Copy link
Copy Markdown
Author

The race is caused by exposing refcount == 1 before the COW copy completes. How about delaying the refcount decrement?

You're right—the refcount update can be delayed so that the copy runs while count >= 2, preventing the “only one reference left” path from ever firing mid-copy. My plan is to tweak FRAME_TABLE access like this:

  • First lock: read the current count and decide whether a copy is needed, but don’t decrement yet.
  • Drop the lock, allocate a new frame, and copy the old data.
  • After the copy, lock again to decrement the old frame’s count, increment the new frame’s count, and finally remap.
    That way the copy phase never exposes count == 1, and we avoid holding the global lock for the entire copy. Does this approach make sense to you? If you have a better idea I’m happy to follow your guidance.

@AsakuraMizu AsakuraMizu marked this pull request as draft January 10, 2026 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants