Skip to content

LLM Observability context lost when using Kitlin coroutines #10474

@grantlerduck

Description

@grantlerduck

Tracer Version(s)

1.58.2

Java Version(s)

21.0.7

JVM Vendor

Amazon Corretto

Bug Report

LLM Observability context lost when using Kotlin coroutines

When using LLM Observability with Kotlin coroutines, the LLMObsState context is lost when coroutines suspend and resume on different threads. This breaks parent-child span relationships and can cause spans to have incorrect or missing parent IDs.

Environment

  • dd-trace-java version: 1.58.2
  • Kotlin with coroutines
  • Spring WebFlux / suspend functions

Root Cause

LLMObsState relies on datadog.context.Context which is backed by ThreadLocalContextManager:

// ThreadLocalContextManager.java
final class ThreadLocalContextManager implements ContextManager {
  private static final ThreadLocal<Context[]> CURRENT_HOLDER =
      ThreadLocal.withInitial(() -> new Context[] {EmptyContext.INSTANCE});

  @Override
  public Context current() {
    return CURRENT_HOLDER.get()[0];  // Plain ThreadLocal - not coroutine-safe
  }
}

When DDLLMObsSpan is created, it attaches to the current context and sets the parent:

// DDLLMObsSpan.java constructor
AgentSpanContext parent = LLMObsState.getLLMObsParentContext();  // Returns null on different thread
// ...
this.scope = LLMObsState.attach();
LLMObsState.setLLMObsParentContext(this.span.context());

Impact

This is particularly problematic for:

  • Batch LLM operations (parallel embeddings, parallel completions)
  • Agent workflows with parallel tool execution
  • Any async/concurrent LLM workloads using Kotlin coroutines or reactive frameworks such as Spring WebFlux

Potential Solutions

  1. Leverage existing Kotlin coroutine instrumentation - dd-trace-java has coroutine support for APM traces. Could similar context propagation be applied to LLMObsState?

  2. Use CoroutineContext elements - Allow users to propagate LLM Obs context explicitly via coroutine context, similar to how MDC context can be propagated.

  3. Integrate with datadog.context.Context propagation - If there's existing machinery for propagating Context across async boundaries, ensure LLMObsState benefits from it.

Contribution

I would be happy to contribute a fix for this if the approach is welcomed. Please let me know the preferred direction and any guidance on implementation.

Expected Behavior

Child LLM spans created within coroutines should correctly reference their parent span, maintaining the trace hierarchy.

Actual Behavior

  • LLMObsState.getLLMObsParentContext() returns null when called from a different thread
  • Child spans get parent_id="undefined" instead of the actual parent span ID
  • Trace hierarchy is broken in the Datadog LLM Observability UI

Reproduction Code

Reproduction Scenario

// Parent span created on Thread-1
val parentSpan = LLMObs.startAgentSpan("agent", mlApp, sessionId)

coroutineScope {
    // Child spans created in parallel coroutines
    // These may run on Thread-2, Thread-3, etc.
    listOf(task1, task2, task3).map { task ->
        async([Dispatchers.IO](http://dispatchers.io/)) {
            // LLMObsState.getLLMObsParentContext() returns null here
            // because ThreadLocal from Thread-1 is not available
            val childSpan = LLMObs.startLLMSpan("llm.inference", model, provider, mlApp, sessionId)
            // ... do work
            childSpan.finish()
        }
    }.awaitAll()
}

parentSpan.finish()

Metadata

Metadata

Assignees

No one assigned

    Labels

    type: bugBug report and fix

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions