fix(import): merge events into existing bucket instead of failing by TimeToBuildBob · Pull Request #575 · ActivityWatch/aw-server-rust

TimeToBuildBob · 2026-03-11T15:11:20Z

Problem

Importing a bucket that already exists returns a 500 error and aborts the entire import. This affects users who want to:

Re-import data from Android to desktop (the common cross-device sync use case)
Refresh/update existing buckets without deleting them first

Reported in ActivityWatch/activitywatch#1213.

Fix

When create_bucket returns BucketAlreadyExists, instead of failing, extract the events from the import payload and insert them into the existing bucket. This implements the "merge" semantics the issue requested.

Before: importing a bucket that exists → 500 Failed to import bucket: BucketAlreadyExists("id1")

After: importing a bucket that exists → 200 OK, events merged in

Changes

aw-server/src/endpoints/import.rs: Handle BucketAlreadyExists by inserting events into the existing bucket
aw-server/tests/api.rs: Update test to verify merge behaviour (was asserting the 500 error); also verify merged event count

Notes

Only bucket creation is skipped; events are always inserted. Re-importing the same events should be idempotent via INSERT OR REPLACE (matching by id if present).
Other errors still propagate as before.

When importing a bucket that already exists, insert the events into the existing bucket instead of returning a 500 error. This enables the common workflow of re-importing data (e.g. from Android) without having to manually delete existing buckets first. Closes ActivityWatch/activitywatch#1213

TimeToBuildBob · 2026-03-11T15:11:32Z

@greptileai review

greptile-apps · 2026-03-11T15:15:06Z

Greptile Summary

This PR fixes a long-standing UX issue where importing a bucket that already exists would return a 500 error and abort the entire import. The fix intercepts BucketAlreadyExists and merges incoming events into the existing bucket, using a content-based identity key (timestamp, duration_ns, data_json) to deduplicate events before insertion. The datastore layer is cleanly refactored to expose a get_events_unclipped variant that returns raw stored timestamps without clipping to the query range, which is necessary for correct dedup identity matching.

Key changes:

aw-server/src/endpoints/import.rs: New event_identity helper and merge path on BucketAlreadyExists; other errors still propagate as 500s
aw-datastore/src/datastore.rs + worker.rs: get_events refactored into get_events_inner with a clip_to_query_range flag; new public get_events_unclipped method exposed through the worker channel
aw-server/tests/api.rs: Comprehensive tests added for new-event merge, idempotent re-import, and the contained-event edge case that would silently fail with clipped dedup
The event_identity dedup key serializes event.data with serde_json::to_string, which is insertion-order-dependent; events whose data keys are serialized in a different order by different clients (e.g., Android vs desktop) could evade dedup and produce duplicates
The dedup path loads all existing events in the imported time range into memory (O(n)), which may be heavy for large buckets with wide re-import ranges

Confidence Score: 3/5

Safe to merge for the primary use case, but the key-order-dependent dedup identity could silently produce duplicates for events whose data was serialized by different clients with different key orderings.
The core logic is sound and well-tested. The get_events_unclipped refactor is clean and non-breaking. Tests cover merge, idempotency, and the contained-event edge case. The two remaining concerns — non-canonical JSON key ordering in the identity hash and unbounded memory load for large re-imports — are not blocking for the described use case (Android → desktop re-import with consistent key ordering), but they are real correctness/performance gaps for edge cases involving multi-client data or very large buckets.
aw-server/src/endpoints/import.rs — the event_identity function and the get_events_unclipped call in the merge path

Important Files Changed

Filename	Overview
aw-server/src/endpoints/import.rs	Core merge logic: handles BucketAlreadyExists by fetching existing events (unclipped), building a (timestamp, duration, data_json) identity HashSet, and inserting only new events. Two style issues: the data_json identity key is key-order-dependent which can cause missed deduplication for events from clients with different serialization ordering, and the full event load for dedup is O(n) in memory.
aw-datastore/src/datastore.rs	Clean refactor: get_events split into get_events_inner (with clip_to_query_range bool) plus get_events (clip=true) and get_events_unclipped (clip=false). No logic changes to existing clipping behaviour; the new variant is additive and correctly preserves original event timestamps when clip=false.
aw-datastore/src/worker.rs	Adds bool field to Command::GetEvents enum variant and wires up the dispatch to call get_events_unclipped vs get_events accordingly. Change is internally consistent; the bool semantics (true=unclipped) match the new public Datastore::get_events_unclipped method.
aw-server/tests/api.rs	Well-expanded test suite covering three distinct scenarios: merge of a new event (1→2), idempotent re-import (stays at 2), and contained-event edge case that would break with clipped dedup (2→3). All three assertions carry descriptive failure messages.

Sequence Diagram

sequenceDiagram
    participant Client
    participant ImportEndpoint
    participant Datastore

    Client->>ImportEndpoint: POST /import (bucket + events)
    ImportEndpoint->>Datastore: create_bucket(bucket)

    alt Bucket does not exist
        Datastore-->>ImportEndpoint: Ok(())
        ImportEndpoint->>Datastore: insert_events(bucket_id, events)
        Datastore-->>ImportEndpoint: Ok(())
    else BucketAlreadyExists
        Datastore-->>ImportEndpoint: Err(BucketAlreadyExists)
        ImportEndpoint->>Datastore: get_events_unclipped(bucket_id, start, end)
        Datastore-->>ImportEndpoint: Vec<existing_events>
        Note over ImportEndpoint: Build HashSet of (timestamp, duration_ns, data_json) identities
        Note over ImportEndpoint: Filter incoming events — keep only those not in HashSet
        ImportEndpoint->>Datastore: insert_events(bucket_id, new_events_only)
        Datastore-->>ImportEndpoint: Ok(())
    else Other error
        Datastore-->>ImportEndpoint: Err(e)
        ImportEndpoint-->>Client: 500 Internal Server Error
    end

    ImportEndpoint-->>Client: 200 OK

Comments Outside Diff (2)

aw-server/src/endpoints/import.rs, line 24-30 (link)

Key-order-dependent identity hash for event dedup

serde_json::to_string(&event.data) serializes the Map<String, Value> in insertion order (backed by IndexMap). This means two events with identical key-value pairs but serialized in a different key order produce different data_json strings and are not considered duplicates.

For the primary re-import use case (Android → desktop, same source), this works because the same Android app produces keys in the same order on every export. But it silently breaks dedup for events whose data was originally created by a different client that uses a different key ordering (e.g., one client inserts "app" before "title", another inserts them in reverse).

Consider sorting keys before serializing to get a canonical representation:
```
    let data_json = {
        use std::collections::BTreeMap;
        let sorted: BTreeMap<_, _> = event.data.iter().collect();
        serde_json::to_string(&sorted).map_err(|e| {
            HttpErrorJson::new(
                Status::InternalServerError,
                format!("Failed to encode event data for dedup: {e}"),
            )
        })?
    };
```
This makes the identity hash independent of serialization order and covers all clients consistently.
aw-server/src/endpoints/import.rs, line 55-65 (link)

Unbounded memory load for large re-imports

get_events_unclipped pulls the entire set of events overlapping the imported time range into memory to build the existing_identities HashSet. For a bucket with years of history and a wide re-import range (e.g., a full Android sync), this could mean loading tens of thousands of events solely for dedup comparison.

Since the identity tuple is (timestamp, duration_ns, data_json), and the most common duplication source is identical timestamps, consider fetching only a point-count or a lightweight existence check rather than full event payloads — or at least add a comment noting the O(n) memory footprint so future maintainers are aware of the trade-off.

_{Last reviewed commit: "fix(import): dedupli..."}

greptile-apps · 2026-03-11T15:15:10Z

aw-server/src/endpoints/import.rs

+            Err(DatastoreError::BucketAlreadyExists(_)) => {
+                // Bucket already exists — merge events into it instead of failing
+                info!("Bucket '{}' already exists, merging events", bucket.id);
+                if let Some(events) = bucket.events.take() {
+                    let events_vec = events.take_inner();
+                    if !events_vec.is_empty() {
+                        if let Err(e) = datastore.insert_events(&bucket.id, &events_vec) {
+                            let err_msg = format!(
+                                "Failed to merge events into existing bucket '{}': {e:?}",
+                                bucket.id
+                            );
+                            warn!("{}", err_msg);
+                            return Err(HttpErrorJson::new(Status::InternalServerError, err_msg));
+                        }
+                    }
+                }
+            }


Duplicate events on re-import without event IDs

The PR description claims idempotency via INSERT OR REPLACE "matching by id if present", but the implementation has a gap that can cause data duplication.

Looking at the events table schema:

id INTEGER PRIMARY KEY AUTOINCREMENT

And the insert statement:

INSERT OR REPLACE INTO events(bucketrow, id, starttime, endtime, data) VALUES (...)

When an imported event has no id field (event.id = None), SQLite receives NULL for the primary key column. For an INTEGER PRIMARY KEY AUTOINCREMENT column, inserting NULL always triggers AUTOINCREMENT to generate a fresh row ID — it does not match any existing row. This means that every re-import of ID-less events (which is very common in exports from ActivityWatch, including Android exports) will insert duplicate rows rather than replacing existing ones.

The idempotency guarantee only holds if the source events carry stable, non-NULL IDs that map back to the same rows in the destination. However:

Events from the Android app exported as JSON often omit or carry their own local SQLite IDs, which may differ from the desktop DB's IDs for the same logical events.

The test uses two different timestamps (2000-01-01 vs 2000-01-02) for the two imports, so it never exercises the re-import-same-event path that would expose this.

A user who follows the documented workflow ("Re-import data from Android to desktop") twice in a row will silently accumulate duplicate events.

greptile-apps · 2026-03-11T15:15:11Z

aw-server/tests/api.rs

@@ -310,18 +309,24 @@ mod api_tests {
                "client": "client",
                "hostname": "hostname",
                "events": [{
-                    "timestamp":"2000-01-01T00:00:00Z",
+                    "timestamp":"2000-01-02T00:00:00Z",
                    "duration":1.0,
                    "data": {}
                }]
            }}}"#,
            )
            .dispatch();
-        assert_eq!(res.status(), rocket::http::Status::InternalServerError);
-        assert_eq!(
-            res.into_string().unwrap(),
-            r#"{"message":"Failed to import bucket: BucketAlreadyExists(\"id1\")"}"#
-        );
+        assert_eq!(res.status(), rocket::http::Status::Ok);
+
+        // Verify events were merged — bucket should now have 2 events
+        let res = client
+            .get("/api/0/buckets/id1/events")
+            .header(ContentType::JSON)
+            .header(Header::new("Host", "127.0.0.1:5600"))
+            .dispatch();
+        assert_eq!(res.status(), rocket::http::Status::Ok);
+        let events: serde_json::Value = serde_json::from_str(&res.into_string().unwrap()).unwrap();
+        assert_eq!(events.as_array().unwrap().len(), 2);


Missing idempotency test for re-importing the same events

The new test imports a bucket with a different event (different timestamp: 2000-01-02 vs 2000-01-01) and verifies the count rises from 1 → 2. This correctly validates the merge path, but it does not cover the idempotency scenario that the PR description promises:

Re-importing the same events should be idempotent via INSERT OR REPLACE (matching by id if present).

Consider adding a third import with the exact same event payload (same timestamp, same data, no explicit id) and asserting the count stays at 2, not growing to 3. This would immediately expose the duplication behaviour described above for ID-less events, and give a clear regression target once/if deduplication logic is added.

greptile-apps · 2026-03-11T15:15:12Z

aw-server/src/endpoints/import.rs

+            Err(DatastoreError::BucketAlreadyExists(_)) => {
+                // Bucket already exists — merge events into it instead of failing
+                info!("Bucket '{}' already exists, merging events", bucket.id);
+                if let Some(events) = bucket.events.take() {
+                    let events_vec = events.take_inner();
+                    if !events_vec.is_empty() {
+                        if let Err(e) = datastore.insert_events(&bucket.id, &events_vec) {
+                            let err_msg = format!(
+                                "Failed to merge events into existing bucket '{}': {e:?}",
+                                bucket.id
+                            );
+                            warn!("{}", err_msg);
+                            return Err(HttpErrorJson::new(Status::InternalServerError, err_msg));
+                        }
+                    }
+                }
+            }


Bucket metadata silently ignored on merge

When the bucket already exists, the new code skips create_bucket entirely and only inserts events. This means bucket metadata present in the import payload — hostname, client, type, and any custom data — is silently discarded.

For the described use case ("re-import from Android to desktop"), the user may expect the import to at minimum update stale metadata (e.g., a changed hostname). It is worth documenting this behaviour in a comment or, if updating metadata is desirable, adding a update_bucket call alongside the event insertion.

codecov · 2026-03-11T15:17:02Z

Codecov Report

❌ Patch coverage is 78.94737% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.02%. Comparing base (656f3c9) to head (58d0655).
⚠️ Report is 39 commits behind head on master.

Files with missing lines	Patch %	Lines
aw-server/src/endpoints/import.rs	69.69%	10 Missing ⚠️
aw-datastore/src/worker.rs	85.71%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #575      +/-   ##
==========================================
- Coverage   70.81%   68.02%   -2.80%     
==========================================
  Files          51       54       +3     
  Lines        2916     3193     +277     
==========================================
+ Hits         2065     2172     +107     
- Misses        851     1021     +170

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TimeToBuildBob · 2026-03-16T08:36:57Z

Quick status update after re-reading the review:

Greptile is right that my earlier description overstated the behavior here. The current patch does fix the immediate 500-on-duplicate-bucket failure, but it does not make repeated imports idempotent for exports where events have no explicit id field. In that common case, SQLite will assign fresh rowids on each re-import and duplicates can accumulate.

So the accurate scope of this PR is:

✅ importing into an existing bucket no longer hard-fails immediately
⚠️ repeated re-import of the same ID-less events is still not safe/idempotent
⚠️ incoming bucket metadata is still ignored when merging into an existing bucket

Given that, I think this needs either a narrower claim + explicit docs about the limitation, or a stronger dedup story before merge. Posting this so future review doesn't rely on the earlier over-strong description.

…tency When importing into an existing bucket, events without an explicit ID were silently duplicated on each re-import because SQLite AUTOINCREMENT always assigns a new rowid for NULL-id inserts. Fix: fetch existing events in the import time range before inserting, then filter out events already present (matched by timestamp, duration, and data). This makes re-import idempotent for the common Android/JSON export case where events lack explicit IDs. Add idempotency test that re-imports the same event and asserts the event count stays at 2 (not 3). Co-authored-by: Bob <bob@superuserlabs.org>

TimeToBuildBob · 2026-03-18T19:53:50Z

Fixed the idempotency bug raised in the Greptile review.

Problem: Events without an explicit ID (the common case for Android exports and JSON exports) were duplicated on each re-import because SQLite AUTOINCREMENT always assigns a new rowid for NULL-id inserts. INSERT OR REPLACE only deduplicates when the primary key conflicts, which never happens for NULL ids.

Fix (21b788e): Before inserting events into an existing bucket, fetch existing events in the import time range and filter out any that are already present. Matching uses Event::PartialEq which compares (timestamp, duration, data) — the natural identity of an event independent of its database ID.

Test added: Re-imports the same ID-less event and asserts the count stays at 2, not 3.

All 14 aw-server tests pass.

TimeToBuildBob · 2026-03-18T19:53:57Z

@greptileai review

ErikBjare · 2026-03-19T13:52:21Z

@TimeToBuildBob can the performance issue be easily addressed? Any risk of duplicate events being inserted? How about only inserting events newer than the last event or such? What would be most intuitive and least likely to lead to a corrupted/messed up bucket?

TimeToBuildBob · 2026-03-19T15:09:13Z

Follow-up on the performance question: I addressed the obvious bottleneck in 5f4ab07.

What changed:

switched the merge dedup from repeated Vec::contains scans to a HashSet of event identities
identity is still the same natural key as before: (timestamp, duration, data)
behavior stays the same, but the hot path is now much saner for larger imports

So the current tradeoff looks pretty reasonable:

correctness: repeated import of ID-less events stays idempotent
performance: no more O(n·m) linear scan against the fetched overlap set
safety: no "only newer than last event" heuristic, which would be faster but is much easier to get wrong for out-of-order/backfilled data and would risk silently missing or duplicating events

My take: matching on full event identity in the overlapping time window is the most intuitive and least corruption-prone behavior. It’s conservative, but it does the right thing.

Validation: cargo test -q test_import_export

ErikBjare · 2026-03-19T16:58:56Z

@greptileai review

TimeToBuildBob · 2026-03-19T17:41:30Z

Addressed the correctness hole Greptile surfaced and pushed it to this PR.

What changed:

added an internal unclipped event query path in aw-datastore
kept the public/display-oriented get_events(...) behavior unchanged
switched import dedup to use get_events_unclipped(...)
added a regression test for the failure mode where a shorter imported event is fully contained inside a longer existing event

Why this matters:

the previous dedup path fetched clipped overlap events
that can collapse a longer existing event into the same (timestamp, duration, data) identity as a shorter imported event
result: a legitimate imported event could get silently dropped

On Erik's performance/safety question:

the obvious perf issue was already addressed by 5f4ab07 (HashSet instead of repeated Vec::contains scans)
I did not switch to a "only import newer than last event" heuristic, because that's faster but much less intuitive for backfills / out-of-order imports and more likely to create weird bucket state
current behavior is the safer tradeoff: merge into existing bucket, keep repeated imports idempotent, and avoid silently dropping valid overlapping-but-distinct events

Validation:

cargo test -q -p aw-server --test api api_tests::test_import_export -- --exact
cargo check -q -p aw-server -p aw-datastore
cargo fmt --all --check

Commit pushed: 58d0655

ErikBjare · 2026-03-19T19:53:18Z

@greptileai review

TimeToBuildBob mentioned this pull request Mar 11, 2026

Bucket import issues ActivityWatch/activitywatch#1213

Open

2 tasks

greptile-apps bot reviewed Mar 11, 2026

View reviewed changes

perf(import): use HashSet for merge dedup

5f4ab07

fix(import): deduplicate against unclipped overlap events

58d0655

Uh oh!

Conversation

TimeToBuildBob commented Mar 11, 2026

Problem

Fix

Changes

Notes

Uh oh!

TimeToBuildBob commented Mar 11, 2026

Uh oh!

greptile-apps bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Sequence Diagram

Comments Outside Diff (2)

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TimeToBuildBob commented Mar 16, 2026

Uh oh!

TimeToBuildBob commented Mar 18, 2026

Uh oh!

TimeToBuildBob commented Mar 18, 2026

Uh oh!

ErikBjare commented Mar 19, 2026

Uh oh!

TimeToBuildBob commented Mar 19, 2026

Uh oh!

ErikBjare commented Mar 19, 2026

Uh oh!

TimeToBuildBob commented Mar 19, 2026

Uh oh!

ErikBjare commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps bot commented Mar 11, 2026 •

edited

Loading

codecov bot commented Mar 11, 2026 •

edited

Loading