Skip to content

Fix concurrent API race conditions#1574

Open
neolynx wants to merge 11 commits into
masterfrom
fix/task-race
Open

Fix concurrent API race conditions#1574
neolynx wants to merge 11 commits into
masterfrom
fix/task-race

Conversation

@neolynx
Copy link
Copy Markdown
Member

@neolynx neolynx commented May 25, 2026

Fixes #1570, #1467, #1449, #1376, #1305, #1196, #1163, #1125

Description of the Change

This fixes several race conditions when running API calls concurrently.

  • perform collection.LoadComplete inside maybeRunTaskInBackground
  • have tasks use a fresh copy of taskCollectionFactory, taskCollection
  • lock snapshots/repos on publish
  • lock pool on MultiDist=false

Checklist

  • remove / fix commented test
  • unit-test added (if change is algorithm)
  • functional test added/updated (if change is functional)

@neolynx neolynx force-pushed the fix/task-race branch 3 times, most recently from 7fd20e3 to 154615b Compare May 25, 2026 17:58
@neolynx neolynx force-pushed the fix/task-race branch 2 times, most recently from 1609873 to ddcdc79 Compare May 31, 2026 10:44
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 4, 2026

Codecov Report

❌ Patch coverage is 75.60440% with 111 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.30%. Comparing base (2974558) to head (0a98d9f).

Files with missing lines Patch % Lines
api/publish.go 73.52% 32 Missing and 22 partials ⚠️
api/repos.go 74.48% 14 Missing and 11 partials ⚠️
api/snapshot.go 79.38% 11 Missing and 9 partials ⚠️
api/mirror.go 70.00% 6 Missing and 3 partials ⚠️
task/list.go 87.50% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1574      +/-   ##
==========================================
+ Coverage   77.22%   77.30%   +0.08%     
==========================================
  Files         161      161              
  Lines       15085    15248     +163     
==========================================
+ Hits        11649    11788     +139     
  Misses       2292     2292              
- Partials     1144     1168      +24     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@neolynx neolynx added this to the 1.6.3 milestone Jun 4, 2026
@neolynx neolynx self-assigned this Jun 4, 2026
@neolynx neolynx force-pushed the fix/task-race branch 3 times, most recently from b88e1f3 to 84fd844 Compare June 7, 2026 14:23
@neolynx neolynx changed the title Fix/task race Fix concurrent API race conditions Jun 7, 2026
@neolynx neolynx added the needs review Ready for review & merge label Jun 7, 2026
@neolynx neolynx requested a review from a team June 7, 2026 14:48
@neolynx neolynx force-pushed the fix/task-race branch 3 times, most recently from 9c82a70 to cdc5610 Compare June 7, 2026 21:23
neolynx added 7 commits June 7, 2026 23:47
resource locks need to be before the background task. creating same publish endpoint at the same time is unlikely...
This fixes a flaw in async apis, which loaded the published repo from the
DB and mutated it outside the task closure, before the task lock was acquired.

* perform collection.LoadComplete inside maybeRunTaskInBackground
* have tasks use a fresh copy of taskCollectionFactory, taskCollection
Concurrent tasks were not properly locking their resources, leading to
inconsistent published indexes:

  Task A: apiPublishUpdateSwitch loads published, reads source repo/snapshot
  Request B: modifies same source repo or snapshot (add/remove packages, etc)
  Task A: Update() + Publish() reads stale/modified source -> inconsistent
           published index, or partial write if source deleted mid-task.

This changes introduces resource locking for publishing:
* SourceLocalRepo: iterate published.Sources (component -> source UUID), look up each local repo via localRepoCollection.ByUUID and append
string(repo.Key()) to resources
* SourceSnapshot: iterate b.Snapshots,look up each snapshot via snapshotCollection.ByName and append
string(snapshot.ResourceKey()) to resources.
This fixes a flaw in async apis, which loaded data from the
DB and mutated it outside the task closure, before the task lock was acquired.

    * perform collection.LoadComplete inside maybeRunTaskInBackground
    * have tasks use a fresh copy of taskCollectionFactory, taskCollection
This fixes a flaw in async apis, which loaded data from the
DB and mutated it outside the task closure, before the task lock was acquired.

 * perform collection.LoadComplete inside maybeRunTaskInBackground
 * have tasks use a fresh copy of taskCollectionFactory, taskCollection
This fixes a flaw in async apis, which loaded data from the
DB and mutated it outside the task closure, before the task lock was acquired.

 * perform collection.LoadComplete inside maybeRunTaskInBackground
 * have tasks use a fresh copy of taskCollectionFactory, taskCollection
Race condition iexisted where task State, err, and processReturnValue fields
were written by consumer goroutine and read by concurrent accessors without
proper synchronization, causing torn reads and data races.

Implemented single-lock model with optimal lock scope:

- Removed 8 accessor methods (direct field access is simpler)
- Lock only during brief state transitions (IDLE→RUNNING, RUNNING→SUCCEEDED/FAILED)
- Release lock during task.process() execution to enable full concurrency
- Readers hold list.Lock() only during atomic struct copy
- Moved State = RUNNING before goroutine spawn for clearer semantics
- task/list.go: RunTaskInBackground() copies *task before unlock,
    returns the pre-made copy instead of dereferencing after unlock
neolynx added 2 commits June 8, 2026 13:56
 ### api/snapshot.go — fix apiSnapshotsCreate

 The function was building the new snapshot's ref list only from b.PackageRefs, completely ignoring SourceSnapshots package contents (they were stored as provenance metadata only). The fix
 mirrors the approach from apiSnapshotsMerge:

 1. Start with source snapshots: merge all freshSources[i].RefList() together using Merge(overrideMatching=true)
 2. Layer explicit PackageRefs on top: only enter the package-loading loop if b.PackageRefs is non-empty, then merge the result into refList
 3. Pass the combined refList to NewSnapshotFromRefList

 This means an empty snapshot (SourceSnapshots: [], PackageRefs: []) still correctly produces an empty ref list, single-source and multi-source snapshot-of-snapshot cases are now handled,
 and PackageRefs can still augment or override on top of the merged sources.

 ### system/t12_api/publish.py — fix PublishSwitchAPITestSnapshot

 - Removed the # FIXME comment
 - Changed check_not_exists → check_exists for pyspi-0.6.1-1.3.stripped.dsc after the publish-switch to snapshot2, which is now the correct expectation since snapshot2 inherits all packages
   from snapshot1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs review Ready for review & merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

getFileLock in files/public.go leaks memory via unbounded fileLocks map

1 participant