Draft: Add DatadogBridge for real-time APM span context propagation#15309
Draft: Add DatadogBridge for real-time APM span context propagation#15309
Conversation
The DatadogExporter creates LLMObs spans retroactively (after execution
completes), which means dd-trace auto-instrumented APM spans from tools
and processors get parented to the request handler instead of the correct
Mastra span. This is because no dd-trace span is active in scope during
execution.
The new DatadogBridge solves this by creating dd-trace APM spans eagerly
via tracer.startSpan() at span creation time, and activating them in
dd-trace's scope via tracer.scope().activate() during execution. This
means auto-instrumented HTTP/DB calls from MCP tools, guardrail
processors, etc. are correctly nested under their parent Mastra spans.
LLMObs annotation (model info, token usage, I/O) is still emitted
retroactively through dd-trace's own LLMObs pipeline using the existing
nested llmobs.trace() callback pattern.
Usage:
import { DatadogBridge } from '@mastra/datadog';
new Mastra({
observability: {
configs: {
default: {
bridge: new DatadogBridge({ mlApp: 'my-app' }),
}
}
}
});
https://claude.ai/code/session_01Q7w4QfZvEXyUvyY2y4XQe1
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
WalkthroughAdds a new DatadogBridge that integrates with dd-trace to create APM spans eagerly for real-time context propagation, buffers Mastra spans for retroactive LLM Observability emission via dd-trace, and includes tests, docs, exports, lifecycle (flush/shutdown) and cleanup logic. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
🦋 Changeset detectedLatest commit: ce8b6da The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
observability/datadog/src/bridge.test.ts (1)
32-97: Consider resettingapmSpanCounterinbeforeEachfor test isolation.The
apmSpanCountervariable continues incrementing across tests since it's not reset in thebeforeEachhook. While current tests don't depend on specific counter values, this could cause fragile tests if future tests expect specific span IDs.♻️ Suggested improvement
Add a reset mechanism to the hoisted mock:
const { mockAnnotate, mockTrace, // ... other exports capturedAPMSpans, + resetApmSpanCounter, } = vi.hoisted(() => { let currentScopeSpan: any = undefined; const parents: any[] = []; const llmobsSpans: any[] = []; let apmSpanCounter = 0; // ... return { // ... existing returns + resetApmSpanCounter: () => { apmSpanCounter = 0; }, }; });Then in
beforeEach:beforeEach(() => { vi.clearAllMocks(); traceParents.length = 0; capturedLLMObsSpans.length = 0; capturedAPMSpans.length = 0; + resetApmSpanCounter(); // ... });🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@observability/datadog/src/bridge.test.ts` around lines 32 - 97, The hoisted mock keeps apmSpanCounter incrementing across tests which can leak state; expose a reset function from the vi.hoisted return (e.g., resetApmSpanCounter) that sets apmSpanCounter = 0 and then call that reset in the test suite's beforeEach to ensure test isolation; reference the existing apmSpanCounter and mockStartSpan in your change so the counter reset is clearly tied to the span factory used by mockStartSpan.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@observability/datadog/src/bridge.test.ts`:
- Around line 32-97: The hoisted mock keeps apmSpanCounter incrementing across
tests which can leak state; expose a reset function from the vi.hoisted return
(e.g., resetApmSpanCounter) that sets apmSpanCounter = 0 and then call that
reset in the test suite's beforeEach to ensure test isolation; reference the
existing apmSpanCounter and mockStartSpan in your change so the counter reset is
clearly tied to the span factory used by mockStartSpan.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 06b92b46-be15-420c-8d4e-86ade858250d
📒 Files selected for processing (3)
observability/datadog/src/bridge.test.tsobservability/datadog/src/bridge.tsobservability/datadog/src/index.ts
…ss to false - Add guide page at docs/observability/tracing/bridges/datadog explaining when to use the bridge, how it works, setup with dd-trace, agent vs agentless mode, and trace hierarchy. - Add reference page at reference/observability/tracing/bridges/datadog documenting DatadogBridgeConfig, methods, usage examples, span mapping, and environment variables. - Update both docs and reference sidebars to include the new pages. - Cross-reference the bridge from the Datadog exporter pages so users with dd-trace APM are pointed at the right tool. - Default DatadogBridge agentless to false. Bridge users almost always have a local Datadog Agent (required for dd-trace APM data), so agentless mode would split LLMObs traffic away from APM traffic. The exporter remains agentless-by-default for LLMObs-only use cases. https://claude.ai/code/session_01Q7w4QfZvEXyUvyY2y4XQe1
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@observability/datadog/src/bridge.ts`:
- Around line 431-437: buildSpanTree()/emitSpanTree()/tryEmitReadySpans
currently clear the entire state.buffer after attempting to build/emit a tree,
which drops child spans whose parent arrived later; change the logic to preserve
unresolved spans by only removing spans that were actually emitted.
Specifically, modify tryEmitReadySpans (and the branch that calls
buildSpanTree/emitSpanTree) to: 1) let buildSpanTree/emitSpanTree return the
set/list of span IDs that were successfully emitted (or the root subtree nodes),
2) remove only those emitted IDs from state.buffer, and 3) keep any spans whose
parent was unresolved in state.buffer and do not set state.treeEmitted true
unless the intended root emission completed; apply the same change for the other
occurrence mentioned (around the 495-500 block) so unresolved children remain
buffered until their parent is emitted.
- Around line 531-538: Late-arriving MODEL_STEP spans are missing inherited
model/provider metadata because the late-span path calls buildSpanOptions(span)
without the parent-derived context; update the logic that computes
childInheritedModelAttrs and the late-span emission (locations around
childInheritedModelAttrs, buildSpanOptions, emitSingleSpan) to persist or
re-derive effective model/provider attributes from the parent (e.g., store
effective attrs in state.contexts keyed by trace/span id or walk to the parent
span to extract its ModelGenerationAttributes) and pass those into
buildSpanOptions so emitted LLMObs spans include modelName/modelProvider; apply
the same fix to the other late-span areas noted (around lines referenced by the
reviewer).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 09d761e0-e0e9-416d-ab58-f998eaa03d94
📒 Files selected for processing (9)
.changeset/datadog-bridge.mddocs/src/content/en/docs/observability/tracing/bridges/datadog.mdxdocs/src/content/en/docs/observability/tracing/exporters/datadog.mdxdocs/src/content/en/docs/sidebars.jsdocs/src/content/en/reference/observability/tracing/bridges/datadog.mdxdocs/src/content/en/reference/observability/tracing/exporters/datadog.mdxdocs/src/content/en/reference/sidebars.jsobservability/datadog/src/bridge.test.tsobservability/datadog/src/bridge.ts
✅ Files skipped from review due to trivial changes (5)
- docs/src/content/en/reference/sidebars.js
- docs/src/content/en/reference/observability/tracing/exporters/datadog.mdx
- .changeset/datadog-bridge.md
- docs/src/content/en/docs/observability/tracing/bridges/datadog.mdx
- docs/src/content/en/docs/sidebars.js
…ate spans in DatadogBridge Two related bugs in tryEmitReadySpans/emitSpanTree/emitSingleSpan: 1. Orphan span drop: buildSpanTree() only links a child to its parent if the parent is present in the buffer at tree-build time. Previously, after the initial tree was emitted, state.buffer.clear() discarded any unresolved spans — including children whose parent simply hadn't ended yet (e.g., child ended early, root ended before parent, parent is still in flight). The parent would then emit successfully via the late-arrival path, but the orphaned child was gone forever. Fix: remove only spans that actually landed in state.contexts during tree emission. Unresolved spans stay buffered. The late-arrival phase (now run unconditionally, not just in an else branch) emits them once their parent's context exists, iterating to a fixed point so chains of late spans all flush. 2. Missing model/provider on late MODEL_STEP spans: emitSpanTree propagates MODEL_GENERATION's model/provider down to MODEL_STEP descendants via childInheritedModelAttrs. But emitSingleSpan (the late-arrival path) called buildSpanOptions(span) with no inherited attrs, so a late MODEL_STEP would render without modelName/modelProvider in LLMObs. Fix: store childInheritedModelAttrs in state.contexts alongside each emitted ddSpan. When the late-arrival path emits a span, it looks up the parent's stored attrs and passes them through. emitSingleSpan also computes its own childInheritedModelAttrs for any further descendants. Added regression tests: - preserves unresolved children when root ends before their parent - passes MODEL_GENERATION model/provider to a late-arriving MODEL_STEP child https://claude.ai/code/session_01Q7w4QfZvEXyUvyY2y4XQe1
…rthand "LLMObs" is a code-level shorthand (matches dd-trace's tracer.llmobs API namespace and env vars like DD_LLMOBS_ML_APP), but Datadog's user-facing product name is "LLM Observability". Update prose in both the new bridge docs and the existing exporter docs to use the full product name consistently so readers aren't left decoding an acronym. https://claude.ai/code/session_01Q7w4QfZvEXyUvyY2y4XQe1
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@observability/datadog/src/bridge.ts`:
- Around line 354-364: The root traceContext is only captured on 'span_started'
and stays stale; update it when a root span is enriched by calling
captureTraceContext(span) for root spans in the 'span_updated' branch and again
before finishApmSpan(span) in the 'span_ended' branch so the latest user/session
identifiers are used; apply the same changes to the other occurrence around
finishApmSpan/enqueueSpan (the second switch block referenced near the other
range) and ensure you only refresh for root spans (use the same root detection
logic used elsewhere).
- Around line 833-842: The early return condition `if (this.isDisabled ||
!(tracer as any).llmobs) return;` prevents the fallback
`tracer.flush()`/`tracer.shutdown()` from ever running; change the guard to only
bail when `this.isDisabled` so the code can fall through to the `tracer.llmobs`
branch or the fallback branch. Concretely, update the checks in the methods
handling flushing/shutdown so they first `if (this.isDisabled) return;` and
then: if `tracer.llmobs?.flush` call `tracer.llmobs.flush()`, else if `(tracer
as any).flush` call `tracer.flush()` (and mirror the same pattern for
`shutdown`), preserving the existing try/catch and logging around
`tracer.llmobs` and the fallback.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: d6a52ea4-ed87-467e-a1cc-8a199e1b555d
📒 Files selected for processing (2)
observability/datadog/src/bridge.test.tsobservability/datadog/src/bridge.ts
| switch (event.type) { | ||
| case 'span_started': | ||
| this.captureTraceContext(span); | ||
| return; | ||
|
|
||
| case 'span_updated': | ||
| return; | ||
|
|
||
| case 'span_ended': | ||
| this.finishApmSpan(span); | ||
| this.enqueueSpan(span); |
There was a problem hiding this comment.
Refresh root trace context on updates and end events.
traceContext is latched from the first root span_started payload and never updated. Because spans can be enriched later (span_updated exists, and span_ended carries the final metadata), a root that gets userId or sessionId after start will export the whole tree with stale or empty identifiers.
🐛 Proposed fix
case 'span_started':
this.captureTraceContext(span);
return;
case 'span_updated':
- return;
+ this.captureTraceContext(span);
+ return;
case 'span_ended':
+ this.captureTraceContext(span);
this.finishApmSpan(span);
this.enqueueSpan(span);
return; private captureTraceContext(span: AnyExportedSpan): void {
- if (span.isRootSpan && !this.traceContext.has(span.traceId)) {
- this.traceContext.set(span.traceId, {
- userId: span.metadata?.userId,
- sessionId: span.metadata?.sessionId,
- });
- }
+ if (!span.isRootSpan) return;
+
+ const existing = this.traceContext.get(span.traceId);
+ this.traceContext.set(span.traceId, {
+ userId: span.metadata?.userId ?? existing?.userId,
+ sessionId: span.metadata?.sessionId ?? existing?.sessionId,
+ });
}Also applies to: 408-414
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@observability/datadog/src/bridge.ts` around lines 354 - 364, The root
traceContext is only captured on 'span_started' and stays stale; update it when
a root span is enriched by calling captureTraceContext(span) for root spans in
the 'span_updated' branch and again before finishApmSpan(span) in the
'span_ended' branch so the latest user/session identifiers are used; apply the
same changes to the other occurrence around finishApmSpan/enqueueSpan (the
second switch block referenced near the other range) and ensure you only refresh
for root spans (use the same root detection logic used elsewhere).
| if (this.isDisabled || !(tracer as any).llmobs) return; | ||
|
|
||
| if (tracer.llmobs?.flush) { | ||
| try { | ||
| await tracer.llmobs.flush(); | ||
| this.logger.debug('Datadog llmobs flushed'); | ||
| } catch (e) { | ||
| this.logger.error('Error flushing llmobs', { error: e }); | ||
| } | ||
| } else if ((tracer as any).flush) { |
There was a problem hiding this comment.
Don't return before the tracer.flush() fallback.
The !(tracer as any).llmobs guard makes the fallback branch unreachable, so flush() and shutdown() do nothing in the exact case the fallback is supposed to cover.
🐛 Proposed fix
async flush(): Promise<void> {
- if (this.isDisabled || !(tracer as any).llmobs) return;
+ if (this.isDisabled) return;
if (tracer.llmobs?.flush) {
try {
await tracer.llmobs.flush();📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (this.isDisabled || !(tracer as any).llmobs) return; | |
| if (tracer.llmobs?.flush) { | |
| try { | |
| await tracer.llmobs.flush(); | |
| this.logger.debug('Datadog llmobs flushed'); | |
| } catch (e) { | |
| this.logger.error('Error flushing llmobs', { error: e }); | |
| } | |
| } else if ((tracer as any).flush) { | |
| async flush(): Promise<void> { | |
| if (this.isDisabled) return; | |
| if (tracer.llmobs?.flush) { | |
| try { | |
| await tracer.llmobs.flush(); | |
| this.logger.debug('Datadog llmobs flushed'); | |
| } catch (e) { | |
| this.logger.error('Error flushing llmobs', { error: e }); | |
| } | |
| } else if ((tracer as any).flush) { |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@observability/datadog/src/bridge.ts` around lines 833 - 842, The early return
condition `if (this.isDisabled || !(tracer as any).llmobs) return;` prevents the
fallback `tracer.flush()`/`tracer.shutdown()` from ever running; change the
guard to only bail when `this.isDisabled` so the code can fall through to the
`tracer.llmobs` branch or the fallback branch. Concretely, update the checks in
the methods handling flushing/shutdown so they first `if (this.isDisabled)
return;` and then: if `tracer.llmobs?.flush` call `tracer.llmobs.flush()`, else
if `(tracer as any).flush` call `tracer.flush()` (and mirror the same pattern
for `shutdown`), preserving the existing try/catch and logging around
`tracer.llmobs` and the fallback.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
docs/src/content/en/reference/observability/tracing/bridges/datadog.mdx (1)
284-284: Optional wording polish for readability.This line repeats “Tags” multiple times in close succession; a small rewrite would read more smoothly.
✍️ Suggested wording tweak
-Tags supplied via `tracingOptions.tags` are converted into structured LLM Observability annotation tags. Tags formatted as `key:value` are split into separate entries; tags without a colon are set with a `true` value. +Values supplied via `tracingOptions.tags` are converted into structured LLM Observability annotation tags. Entries formatted as `key:value` are split into separate key/value pairs; entries without a colon are set to `true`.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@docs/src/content/en/reference/observability/tracing/bridges/datadog.mdx` at line 284, Reword the sentence that begins "Tags supplied via `tracingOptions.tags`..." to improve readability and reduce repetition: rewrite it to a single clear sentence describing that entries in tracingOptions.tags become structured LLM Observability annotation tags, that items containing a colon (key:value) are split into separate key and value entries, and that items without a colon are interpreted as boolean true; update the sentence containing `tracingOptions.tags` and the examples of `key:value` and "true" to reflect this clearer phrasing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@docs/src/content/en/reference/observability/tracing/bridges/datadog.mdx`:
- Line 284: Reword the sentence that begins "Tags supplied via
`tracingOptions.tags`..." to improve readability and reduce repetition: rewrite
it to a single clear sentence describing that entries in tracingOptions.tags
become structured LLM Observability annotation tags, that items containing a
colon (key:value) are split into separate key and value entries, and that items
without a colon are interpreted as boolean true; update the sentence containing
`tracingOptions.tags` and the examples of `key:value` and "true" to reflect this
clearer phrasing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: dd094d4b-842b-438f-a6ac-55863b74cb2d
📒 Files selected for processing (4)
docs/src/content/en/docs/observability/tracing/bridges/datadog.mdxdocs/src/content/en/docs/observability/tracing/exporters/datadog.mdxdocs/src/content/en/reference/observability/tracing/bridges/datadog.mdxdocs/src/content/en/reference/observability/tracing/exporters/datadog.mdx
🚧 Files skipped from review as they are similar to previous changes (2)
- docs/src/content/en/docs/observability/tracing/exporters/datadog.mdx
- docs/src/content/en/docs/observability/tracing/bridges/datadog.mdx
Description
Introduces
DatadogBridge, a new observability bridge that solves a critical issue with APM span parenting in Datadog integrations. The bridge creates native dd-trace APM spans eagerly during execution (rather than retroactively) so that auto-instrumented operations (HTTP requests, database queries, etc.) made by tools and processors have the correct parent span context.Problem Solved
The existing
DatadogExportercreates LLMObs spans retroactively after execution completes. This means when tools and processors make outbound calls, there is no active dd-trace span in scope, causing dd-trace's auto-instrumentation to fall back to the nearest active span (typically the request handler) instead of the actual parent span.Solution
DatadogBridgeuses a dual-API approach:tracer.startSpan()for eager APM span creation and activation in dd-trace's scope during executiontracer.llmobs.trace()for retroactive LLMObs annotation and export after spans completeThis ensures:
Key Features
Configuration
Type of Change
Checklist
https://claude.ai/code/session_01Q7w4QfZvEXyUvyY2y4XQe1
ELI5 Explanation
This PR makes Mastra create live Datadog APM spans while tasks run (instead of only reporting afterwards), so auto-instrumented work like HTTP and DB calls are shown as children of the correct Mastra task. It also still emits LLM Observability annotations after spans finish so Datadog receives the same LLM-specific metadata.
Overview
Adds DatadogBridge: an observability bridge that eagerly creates and activates native dd-trace APM spans during execution so dd-trace auto-instrumentation is parented under the correct Mastra span. It preserves retrospective LLM Observability emission by routing annotations through dd-trace’s llmobs API after spans complete. This resolves incorrect parenting that occurred when using the DatadogExporter alone.
DatadogBridge uses a dual API:
Key Features
Changes
New: observability/datadog/src/bridge.ts
New: observability/datadog/src/bridge.test.ts
Modified: observability/datadog/src/index.ts
Docs & Changeset
Commit fixes / notable behavior changes
Technical Notes
Review Impact