-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
bugfixingFixing defects or unexpected behavior in existing codeFixing defects or unexpected behavior in existing codesmartem-backendCore backend services, messaging, and persistence layerCore backend services, messaging, and persistence layersmartem-backend:apiREST API endpoints and HTTP interface changesREST API endpoints and HTTP interface changessmartem-devtools:e2e-testEnd-to-end testing infrastructure and scenariosEnd-to-end testing infrastructure and scenariostestingWriting, updating, or fixing automated testsWriting, updating, or fixing automated tests
Description
Problem
During E2E test runs with high-speed playback (45.6x compression), the RabbitMQ connection is reset after ~3 minutes due to missed heartbeats.
Root Cause
RabbitMQ server closes the connection after not receiving heartbeats for 60 seconds. The API server's main thread becomes too busy processing rapid requests to service the pika heartbeat mechanism.
Evidence
From RabbitMQ logs:
2026-01-29 19:00:01.095545+00:00 [error] closing AMQP connection (duration: '3M, 0s'):
missed heartbeats from client, timeout: 60s
From API logs:
pika.adapters.blocking_connection - ERROR - Unexpected connection close detected:
StreamLostError: ("Stream connection lost: ConnectionResetError(104, 'Connection reset by peer')",)
Impact
- Failed to publish
grid.createdevents - Cascading HTTP 500 errors in agent (~791 errors)
- Events lost during the reconnection window
Proposed Fix Options
- Use threaded heartbeat handling in pika - Configure
BlockingConnectionwith threaded heartbeat processing - Switch to async pika (aio-pika) - Better heartbeat handling with async I/O
- Implement connection recovery/retry logic - Graceful reconnection with message buffering
- Increase heartbeat timeout or disable for local dev - Quick fix for dev/test environments
Affected Component
smartem_backend - RabbitMQ event publisher
Reproduction
Run E2E test with compressed playback:
./repos/DiamondLightSource/smartem-devtools/tests/e2e/run-e2e-test.shThe issue manifests after ~3 minutes of sustained high-throughput ingestion.
Metadata
Metadata
Assignees
Labels
bugfixingFixing defects or unexpected behavior in existing codeFixing defects or unexpected behavior in existing codesmartem-backendCore backend services, messaging, and persistence layerCore backend services, messaging, and persistence layersmartem-backend:apiREST API endpoints and HTTP interface changesREST API endpoints and HTTP interface changessmartem-devtools:e2e-testEnd-to-end testing infrastructure and scenariosEnd-to-end testing infrastructure and scenariostestingWriting, updating, or fixing automated testsWriting, updating, or fixing automated tests