Skip to content

Fix stale PID race in ConsumerRegistry.unregister_name/1#3979

Closed
alco wants to merge 2 commits intomainfrom
alco/stale-consumer-registry
Closed

Fix stale PID race in ConsumerRegistry.unregister_name/1#3979
alco wants to merge 2 commits intomainfrom
alco/stale-consumer-registry

Conversation

@alco
Copy link
Member

@alco alco commented Mar 9, 2026

Summary

  • Replace the no-op unregister_name/1 with an atomic :ets.match_delete/2 that removes the ETS entry only if the pid matches the calling (dying) process
  • A naive lookup-then-delete approach is still racey because the entry can be overwritten between the two ETS calls; match_delete is atomic
  • Add unit tests covering: basic cleanup, skip-if-different-pid, and the full race scenario (old process unregisters after replacement has registered)

Closes #3864

Test plan

  • mix test test/electric/shapes/consumer_registry_test.exs — 16 tests pass (3 new)
  • Full test suite passes in CI

🤖 Generated with Claude Code

@alco alco added the claude label Mar 9, 2026
@pkg-pr-new
Copy link

pkg-pr-new bot commented Mar 9, 2026

Open in StackBlitz

npm i https://pkg.pr.new/@electric-sql/react@3979
npm i https://pkg.pr.new/@electric-sql/client@3979
npm i https://pkg.pr.new/@electric-sql/y-electric@3979

commit: ebdb61c

@codecov
Copy link

codecov bot commented Mar 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 88.74%. Comparing base (0e376c4) to head (537c8dd).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #3979       +/-   ##
===========================================
+ Coverage   75.75%   88.74%   +12.98%     
===========================================
  Files          11       25       +14     
  Lines         693     2425     +1732     
  Branches      172      612      +440     
===========================================
+ Hits          525     2152     +1627     
- Misses        167      271      +104     
- Partials        1        2        +1     
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client 93.93% <ø> (?)
packages/y-electric 56.05% <ø> (ø)
typescript 88.74% <ø> (+12.98%) ⬆️
unit-tests 88.74% <ø> (+12.98%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@blacksmith-sh

This comment has been minimized.

@alco alco force-pushed the alco/stale-consumer-registry branch from dbb2f9f to 44ce5d6 Compare March 9, 2026 12:28
@claude
Copy link

claude bot commented Mar 9, 2026

Claude Code Review

Summary

This PR fixes the ConsumerRegistry.unregister_name/1 race condition by replacing the deliberate no-op with an atomic :ets.match_delete/2 call, preventing stale PID removal of a replacement consumer's ETS entry. The PR is in excellent shape and all previous review issues have been addressed.

What's Working Well

  • Atomic fix: The match_delete approach is strictly better than the issue's proposed two-step lookup+delete, eliminating the TOCTOU window entirely.
  • self() correctness: Called in the dying process context by OTP cleanup machinery, so self() correctly identifies the dying PID — and the comment makes this explicit.
  • Double-removal safety: For {:shutdown, :cleanup} and {:shutdown, :suspend} paths, remove_consumer already deletes the ETS entry before unregister_name fires. match_delete on a missing key is a no-op, so there's no double-removal crash risk.
  • Clean diff: The PR now only touches the three relevant files — the race fix, its tests, and the changeset. No unrelated noise.
  • Test coverage: The three new tests cover basic cleanup, skip-if-different-pid, and the exact race scenario (old process unregisters after replacement has registered). Clear and well-structured.

Issues Found

No outstanding issues.

Issue Conformance

Issue #3864 is well-specified with root cause analysis, three affected code paths, and a proposed fix. This PR implements a strictly better version of the proposed fix (atomic match_delete vs. the TOCTOU-prone lookup+delete). Scope is appropriately narrow. Code path #3 from the issue (restore_shape_and_dependencies/4 dead-PID trust) was noted as a suggestion in the previous review — it remains intentionally deferred per the PR scope, which is fine.

Previous Review Status

All issues resolved:


Review iteration: 2 | 2026-03-16

alco and others added 2 commits March 16, 2026 12:56
The previous no-op implementation left stale PIDs in the ETS table.
A naive lookup-then-delete fix would still be racey because the entry
could be overwritten between the lookup and delete. Using
:ets.match_delete/2 atomically deletes the entry only if the pid
matches the calling (dying) process.

Closes #3864

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@alco alco changed the base branch from refactor/static-logger-strings to main March 16, 2026 14:35
@alco alco force-pushed the alco/stale-consumer-registry branch from c622202 to 537c8dd Compare March 16, 2026 14:36
@blacksmith-sh
Copy link
Contributor

blacksmith-sh bot commented Mar 16, 2026

Found 1 test failure on Blacksmith runners:

Failure

Test View Logs
Elixir.Electric.Replication.PublicationManagerTest/
test component restarts handles relation tracker restart
View Logs

Fix in Cursor

@magnetised
Copy link
Contributor

@alco the reason for the no-op unregister call was to handle race conditions between the consumer registry and the shape log collector. If you just remove the registry entry when the pid terminates but before you remove the shape from the SLC, then the SLC could receive an operation for that shape, see that the consumer process isn't there and start a new process for it. Then the SLC gets the "remove shape" for that new process and you're left with an orphan process.

@alco alco closed this Mar 16, 2026
@alco alco deleted the alco/stale-consumer-registry branch March 16, 2026 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Stale ConsumerRegistry and ShapeStatus entries after consumer exits with unhandled reasons

2 participants