-
Notifications
You must be signed in to change notification settings - Fork 324
Description
Tracer Version(s)
1.58.2
Java Version(s)
21.0.7
JVM Vendor
Amazon Corretto
Bug Report
LLM Observability context lost when using Kotlin coroutines
When using LLM Observability with Kotlin coroutines, the LLMObsState context is lost when coroutines suspend and resume on different threads. This breaks parent-child span relationships and can cause spans to have incorrect or missing parent IDs.
Environment
- dd-trace-java version: 1.58.2
- Kotlin with coroutines
- Spring WebFlux / suspend functions
Root Cause
LLMObsState relies on datadog.context.Context which is backed by ThreadLocalContextManager:
// ThreadLocalContextManager.java
final class ThreadLocalContextManager implements ContextManager {
private static final ThreadLocal<Context[]> CURRENT_HOLDER =
ThreadLocal.withInitial(() -> new Context[] {EmptyContext.INSTANCE});
@Override
public Context current() {
return CURRENT_HOLDER.get()[0]; // Plain ThreadLocal - not coroutine-safe
}
}When DDLLMObsSpan is created, it attaches to the current context and sets the parent:
// DDLLMObsSpan.java constructor
AgentSpanContext parent = LLMObsState.getLLMObsParentContext(); // Returns null on different thread
// ...
this.scope = LLMObsState.attach();
LLMObsState.setLLMObsParentContext(this.span.context());Impact
This is particularly problematic for:
- Batch LLM operations (parallel embeddings, parallel completions)
- Agent workflows with parallel tool execution
- Any async/concurrent LLM workloads using Kotlin coroutines or reactive frameworks such as Spring WebFlux
Potential Solutions
-
Leverage existing Kotlin coroutine instrumentation - dd-trace-java has coroutine support for APM traces. Could similar context propagation be applied to
LLMObsState? -
Use
CoroutineContextelements - Allow users to propagate LLM Obs context explicitly via coroutine context, similar to how MDC context can be propagated. -
Integrate with
datadog.context.Contextpropagation - If there's existing machinery for propagatingContextacross async boundaries, ensureLLMObsStatebenefits from it.
Contribution
I would be happy to contribute a fix for this if the approach is welcomed. Please let me know the preferred direction and any guidance on implementation.
Expected Behavior
Child LLM spans created within coroutines should correctly reference their parent span, maintaining the trace hierarchy.
Actual Behavior
LLMObsState.getLLMObsParentContext()returnsnullwhen called from a different thread- Child spans get
parent_id="undefined"instead of the actual parent span ID - Trace hierarchy is broken in the Datadog LLM Observability UI
Reproduction Code
Reproduction Scenario
// Parent span created on Thread-1
val parentSpan = LLMObs.startAgentSpan("agent", mlApp, sessionId)
coroutineScope {
// Child spans created in parallel coroutines
// These may run on Thread-2, Thread-3, etc.
listOf(task1, task2, task3).map { task ->
async([Dispatchers.IO](http://dispatchers.io/)) {
// LLMObsState.getLLMObsParentContext() returns null here
// because ThreadLocal from Thread-1 is not available
val childSpan = LLMObs.startLLMSpan("llm.inference", model, provider, mlApp, sessionId)
// ... do work
childSpan.finish()
}
}.awaitAll()
}
parentSpan.finish()