Skip to content

fix(sync-service): Handle errors when collecting statistics#4002

Open
magnetised wants to merge 1 commit intomainfrom
magnetised/fix-telemetry-errors
Open

fix(sync-service): Handle errors when collecting statistics#4002
magnetised wants to merge 1 commit intomainfrom
magnetised/fix-telemetry-errors

Conversation

@magnetised
Copy link
Contributor

@magnetised magnetised commented Mar 12, 2026

Telemetry is currently started before the rest of the stack, which means that it starts trying to poll information after a few seconds, but there's no guarantee that the rest of the stack is available within the given timeframe.

Wrap all telemetry collection calls in try..catch to avoid spurious, transient, errors.

https://electricsql-04.sentry.io/issues/102382605/?alert_rule_id=130196&alert_timestamp=1773223876687&alert_type=email&notification_uuid=8fa23ea6-1e04-4a07-98bd-bd0765895424&project=4508410462404688&referrer=digest_email

@codecov
Copy link

codecov bot commented Mar 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 75.75%. Comparing base (f7f6bc7) to head (5821ed5).
✅ All tests successful. No failed tests found.

❗ There is a different number of reports uploaded between BASE (f7f6bc7) and HEAD (5821ed5). Click for more details.

HEAD has 3 uploads less than BASE
Flag BASE (f7f6bc7) HEAD (5821ed5)
unit-tests 5 4
typescript 5 4
packages/typescript-client 1 0
Additional details and impacted files
@@             Coverage Diff             @@
##             main    #4002       +/-   ##
===========================================
- Coverage   88.74%   75.75%   -12.99%     
===========================================
  Files          25       11       -14     
  Lines        2425      693     -1732     
  Branches      611      171      -440     
===========================================
- Hits         2152      525     -1627     
+ Misses        271      167      -104     
+ Partials        2        1        -1     
Flag Coverage Δ
packages/experimental 87.73% <ø> (ø)
packages/react-hooks 86.48% <ø> (ø)
packages/start 82.83% <ø> (ø)
packages/typescript-client ?
packages/y-electric 56.05% <ø> (ø)
typescript 75.75% <ø> (-12.99%) ⬇️
unit-tests 75.75% <ø> (-12.99%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@magnetised magnetised requested a review from msfstef March 12, 2026 11:13
@msfstef
Copy link
Contributor

msfstef commented Mar 12, 2026

I'm pretty sure we had at some point moved the telemetry at the top of the tree for some reason - can we verify that this doesn't conflict with any previous decisions made and ens up causing more problems? We should have left comments on it

@magnetised magnetised force-pushed the magnetised/fix-telemetry-errors branch from fad4531 to 219670e Compare March 12, 2026 11:17
@blacksmith-sh

This comment has been minimized.

@msfstef
Copy link
Contributor

msfstef commented Mar 12, 2026

Found the relevant PR #2722 @icehaunter - it was moved to the top because we wanted telemetry to shut down last, not to start it first.

Otherwise there's various telemetry operations happening during shutdowns that don't find the telemetry processes alive to complete I think.

@msfstef
Copy link
Contributor

msfstef commented Mar 12, 2026

Shouldn't we just keep it at the top but make the polling resilient to the stack not being present?

@magnetised magnetised force-pushed the magnetised/fix-telemetry-errors branch from 219670e to d0a291d Compare March 12, 2026 11:47
@magnetised magnetised requested a review from msfstef March 12, 2026 11:49
@magnetised magnetised changed the title fix(sync-service): Move telemetry to bottom of supervisor children fix(sync-service): Handle errors when collecting statistics Mar 12, 2026
@magnetised magnetised force-pushed the magnetised/fix-telemetry-errors branch from d0a291d to 863f935 Compare March 12, 2026 11:58
Copy link
Contributor

@msfstef msfstef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since all of these functions exist explicitly for the purpose of collecting telemetry, I would rather we make them return appropriate tuples so in case there's actual errors we report them as normal instead of almost silently stop reporting (I get there's a warning log but we don't want to make it an error log cause that path also includes errors that are expected due to boot order)

  • For ShapeDb.statistics it's a GenServer call so I suppose we do need to try-catch - although we could make the function itself encapsulate the missing genserver in its output as an error tuple
  • For ShapeDb.pending_buffer_size it's an ETS query, so we should just guard for missing ETS internally
  • For ShapeCache.count_shapes it drills down to Query.count_shapes which already returns a tuple we handle and a WriteBuffer.pending_count_diff which is an ETS check - although I'm assuming if Query.count_shapes passes then we can assume the ETS should exist, so we would just need to return the error tuple to handle it

I think this would be cleaner and still allows us to catch unexpected errors in our logic? The GenServer one is probably the more annoying one.

)

_ ->
try do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not make these ShapeDb function return tuples and handle them appropriately? I would rather we don't swallow all errors in the telemetry since some of these are exclusively called here, and any actual errors we want to catch and log would now be lost and we silently stop collecting telemetry

@magnetised
Copy link
Contributor Author

@msfstef i wrote a long disagreement with you but actually I think you're right. As you say, these functions are purely for telemetry and there should be a small range of "expected" errors from them.

@magnetised magnetised force-pushed the magnetised/fix-telemetry-errors branch from 863f935 to 5821ed5 Compare March 12, 2026 14:43
@magnetised magnetised requested a review from msfstef March 12, 2026 14:44
Copy link
Contributor

@msfstef msfstef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Minor comment to confirm that the non-error reporting stat is actually correct (sanity check)

def pending_operations_count(stack_id) do
:ets.info(operations_table_name(stack_id), :size)
rescue
ArgumentError -> 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember the exact semantics of this buffer, but I assume this is an in-memory buffer so if the ETS table is not present then the operations are indeed 0 - is this correct or is there a chance of some perissted data not having been loaded and 0 being wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants