Skip to content

Check for deadlocks in telemetry buffers (logs) #2898

@sl0thentr0py

Description

@sl0thentr0py

Context

The Python SDK hit a deadlock in the logs telemetry buffer (Slack thread). The Ruby SDK has a very similar buffer implementation, so we should audit it for the same class of issue.

Problem (from Python)

  • The logs buffer acquires a lock when adding a log and when flushing/clearing the buffer
  • During a flush (lock held), GC ran and emitted a log → the logging integration tried to add it to the buffer → attempted to acquire the already-held lock → deadlock
  • Re-entrant locks didn't help since the GC callback runs on a different thread

What to check in Ruby

  1. Audit the telemetry buffer lock usage — ensure no code path can trigger a re-entrant lock acquisition (e.g. via callbacks, GC, instrumentation side-effects during flush)
  2. Minimize critical sections — the lock should only protect fetch/pop/clear/add operations on the buffer data structure. Envelope construction and other side-effects should happen outside the lock
  3. Consider the .NET approach — their implementation is mostly lock-free (atomic increments/decrements), only locking briefly during flush to extract a copy of the buffer array before releasing

Related

  • Python fix by Ivana: moved side-effect-producing work outside the locked section
  • Java was checked and looks fine (Alexander)
  • .NET uses lock-free atomics, had a separate recursion issue with Debug=true

Metadata

Metadata

Assignees

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions