fix(import): merge events into existing bucket instead of failing#575
fix(import): merge events into existing bucket instead of failing#575TimeToBuildBob wants to merge 4 commits intoActivityWatch:masterfrom
Conversation
When importing a bucket that already exists, insert the events into the existing bucket instead of returning a 500 error. This enables the common workflow of re-importing data (e.g. from Android) without having to manually delete existing buckets first. Closes ActivityWatch/activitywatch#1213
|
@greptileai review |
Greptile SummaryThis PR fixes a long-standing UX issue where importing a bucket that already exists would return a 500 error and abort the entire import. The fix intercepts Key changes:
Confidence Score: 3/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Client
participant ImportEndpoint
participant Datastore
Client->>ImportEndpoint: POST /import (bucket + events)
ImportEndpoint->>Datastore: create_bucket(bucket)
alt Bucket does not exist
Datastore-->>ImportEndpoint: Ok(())
ImportEndpoint->>Datastore: insert_events(bucket_id, events)
Datastore-->>ImportEndpoint: Ok(())
else BucketAlreadyExists
Datastore-->>ImportEndpoint: Err(BucketAlreadyExists)
ImportEndpoint->>Datastore: get_events_unclipped(bucket_id, start, end)
Datastore-->>ImportEndpoint: Vec<existing_events>
Note over ImportEndpoint: Build HashSet of (timestamp, duration_ns, data_json) identities
Note over ImportEndpoint: Filter incoming events — keep only those not in HashSet
ImportEndpoint->>Datastore: insert_events(bucket_id, new_events_only)
Datastore-->>ImportEndpoint: Ok(())
else Other error
Datastore-->>ImportEndpoint: Err(e)
ImportEndpoint-->>Client: 500 Internal Server Error
end
ImportEndpoint-->>Client: 200 OK
|
| Err(DatastoreError::BucketAlreadyExists(_)) => { | ||
| // Bucket already exists — merge events into it instead of failing | ||
| info!("Bucket '{}' already exists, merging events", bucket.id); | ||
| if let Some(events) = bucket.events.take() { | ||
| let events_vec = events.take_inner(); | ||
| if !events_vec.is_empty() { | ||
| if let Err(e) = datastore.insert_events(&bucket.id, &events_vec) { | ||
| let err_msg = format!( | ||
| "Failed to merge events into existing bucket '{}': {e:?}", | ||
| bucket.id | ||
| ); | ||
| warn!("{}", err_msg); | ||
| return Err(HttpErrorJson::new(Status::InternalServerError, err_msg)); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Duplicate events on re-import without event IDs
The PR description claims idempotency via INSERT OR REPLACE "matching by id if present", but the implementation has a gap that can cause data duplication.
Looking at the events table schema:
id INTEGER PRIMARY KEY AUTOINCREMENTAnd the insert statement:
INSERT OR REPLACE INTO events(bucketrow, id, starttime, endtime, data) VALUES (...)When an imported event has no id field (event.id = None), SQLite receives NULL for the primary key column. For an INTEGER PRIMARY KEY AUTOINCREMENT column, inserting NULL always triggers AUTOINCREMENT to generate a fresh row ID — it does not match any existing row. This means that every re-import of ID-less events (which is very common in exports from ActivityWatch, including Android exports) will insert duplicate rows rather than replacing existing ones.
The idempotency guarantee only holds if the source events carry stable, non-NULL IDs that map back to the same rows in the destination. However:
- Events from the Android app exported as JSON often omit or carry their own local SQLite IDs, which may differ from the desktop DB's IDs for the same logical events.
- The test uses two different timestamps (2000-01-01 vs 2000-01-02) for the two imports, so it never exercises the re-import-same-event path that would expose this.
A user who follows the documented workflow ("Re-import data from Android to desktop") twice in a row will silently accumulate duplicate events.
| @@ -310,18 +309,24 @@ mod api_tests { | |||
| "client": "client", | |||
| "hostname": "hostname", | |||
| "events": [{ | |||
| "timestamp":"2000-01-01T00:00:00Z", | |||
| "timestamp":"2000-01-02T00:00:00Z", | |||
| "duration":1.0, | |||
| "data": {} | |||
| }] | |||
| }}}"#, | |||
| ) | |||
| .dispatch(); | |||
| assert_eq!(res.status(), rocket::http::Status::InternalServerError); | |||
| assert_eq!( | |||
| res.into_string().unwrap(), | |||
| r#"{"message":"Failed to import bucket: BucketAlreadyExists(\"id1\")"}"# | |||
| ); | |||
| assert_eq!(res.status(), rocket::http::Status::Ok); | |||
|
|
|||
| // Verify events were merged — bucket should now have 2 events | |||
| let res = client | |||
| .get("/api/0/buckets/id1/events") | |||
| .header(ContentType::JSON) | |||
| .header(Header::new("Host", "127.0.0.1:5600")) | |||
| .dispatch(); | |||
| assert_eq!(res.status(), rocket::http::Status::Ok); | |||
| let events: serde_json::Value = serde_json::from_str(&res.into_string().unwrap()).unwrap(); | |||
| assert_eq!(events.as_array().unwrap().len(), 2); | |||
There was a problem hiding this comment.
Missing idempotency test for re-importing the same events
The new test imports a bucket with a different event (different timestamp: 2000-01-02 vs 2000-01-01) and verifies the count rises from 1 → 2. This correctly validates the merge path, but it does not cover the idempotency scenario that the PR description promises:
Re-importing the same events should be idempotent via
INSERT OR REPLACE(matching byidif present).
Consider adding a third import with the exact same event payload (same timestamp, same data, no explicit id) and asserting the count stays at 2, not growing to 3. This would immediately expose the duplication behaviour described above for ID-less events, and give a clear regression target once/if deduplication logic is added.
| Err(DatastoreError::BucketAlreadyExists(_)) => { | ||
| // Bucket already exists — merge events into it instead of failing | ||
| info!("Bucket '{}' already exists, merging events", bucket.id); | ||
| if let Some(events) = bucket.events.take() { | ||
| let events_vec = events.take_inner(); | ||
| if !events_vec.is_empty() { | ||
| if let Err(e) = datastore.insert_events(&bucket.id, &events_vec) { | ||
| let err_msg = format!( | ||
| "Failed to merge events into existing bucket '{}': {e:?}", | ||
| bucket.id | ||
| ); | ||
| warn!("{}", err_msg); | ||
| return Err(HttpErrorJson::new(Status::InternalServerError, err_msg)); | ||
| } | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
Bucket metadata silently ignored on merge
When the bucket already exists, the new code skips create_bucket entirely and only inserts events. This means bucket metadata present in the import payload — hostname, client, type, and any custom data — is silently discarded.
For the described use case ("re-import from Android to desktop"), the user may expect the import to at minimum update stale metadata (e.g., a changed hostname). It is worth documenting this behaviour in a comment or, if updating metadata is desirable, adding a update_bucket call alongside the event insertion.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #575 +/- ##
==========================================
- Coverage 70.81% 68.02% -2.80%
==========================================
Files 51 54 +3
Lines 2916 3193 +277
==========================================
+ Hits 2065 2172 +107
- Misses 851 1021 +170 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Quick status update after re-reading the review: Greptile is right that my earlier description overstated the behavior here. The current patch does fix the immediate 500-on-duplicate-bucket failure, but it does not make repeated imports idempotent for exports where events have no explicit So the accurate scope of this PR is:
Given that, I think this needs either a narrower claim + explicit docs about the limitation, or a stronger dedup story before merge. Posting this so future review doesn't rely on the earlier over-strong description. |
…tency When importing into an existing bucket, events without an explicit ID were silently duplicated on each re-import because SQLite AUTOINCREMENT always assigns a new rowid for NULL-id inserts. Fix: fetch existing events in the import time range before inserting, then filter out events already present (matched by timestamp, duration, and data). This makes re-import idempotent for the common Android/JSON export case where events lack explicit IDs. Add idempotency test that re-imports the same event and asserts the event count stays at 2 (not 3). Co-authored-by: Bob <bob@superuserlabs.org>
|
Fixed the idempotency bug raised in the Greptile review. Problem: Events without an explicit ID (the common case for Android exports and JSON exports) were duplicated on each re-import because SQLite AUTOINCREMENT always assigns a new rowid for NULL-id inserts. Fix (21b788e): Before inserting events into an existing bucket, fetch existing events in the import time range and filter out any that are already present. Matching uses Test added: Re-imports the same ID-less event and asserts the count stays at 2, not 3. All 14 aw-server tests pass. |
|
@greptileai review |
|
@TimeToBuildBob can the performance issue be easily addressed? Any risk of duplicate events being inserted? How about only inserting events newer than the last event or such? What would be most intuitive and least likely to lead to a corrupted/messed up bucket? |
|
Follow-up on the performance question: I addressed the obvious bottleneck in What changed:
So the current tradeoff looks pretty reasonable:
My take: matching on full event identity in the overlapping time window is the most intuitive and least corruption-prone behavior. It’s conservative, but it does the right thing. Validation: |
|
@greptileai review |
|
Addressed the correctness hole Greptile surfaced and pushed it to this PR. What changed:
Why this matters:
On Erik's performance/safety question:
Validation:
Commit pushed: |
|
@greptileai review |
Problem
Importing a bucket that already exists returns a 500 error and aborts the entire import. This affects users who want to:
Reported in ActivityWatch/activitywatch#1213.
Fix
When
create_bucketreturnsBucketAlreadyExists, instead of failing, extract the events from the import payload and insert them into the existing bucket. This implements the "merge" semantics the issue requested.Before: importing a bucket that exists → 500
Failed to import bucket: BucketAlreadyExists("id1")After: importing a bucket that exists → 200 OK, events merged in
Changes
aw-server/src/endpoints/import.rs: HandleBucketAlreadyExistsby inserting events into the existing bucketaw-server/tests/api.rs: Update test to verify merge behaviour (was asserting the 500 error); also verify merged event countNotes
INSERT OR REPLACE(matching byidif present).