You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(logs): [SVLS-8582] Hold logs and add durable context to durable function logs (#1053)
## Summary
If the function is a durable function, then add two attributes to every
log:
- `lambda.durable_function.execution_id`
- `lambda.durable_function.execution_name`
## Background
- In previous PRs
(DataDog/datadog-lambda-python#728,
DataDog/datadog-lambda-js#730), tracer adds
attributes `aws_lambda.durable_function.execution_id` and
`aws_lambda.durable_function.execution_name` to the `aws.lambda` span
## Details
### Data flow
1. `TraceAgent::handle_traces()` detects an `aws.lambda` span carrying
`request_id`, `durable_function_execution_id`, and
`durable_function_execution_name` in its meta tags
2. It sends a `ProcessorCommand::ForwardDurableContext { request_id,
execution_id, execution_name }` to `InvocationProcessorService`
3. `Processor::forward_durable_context()` in the lifecycle processor
relays this as a `DurableContextUpdate` to the logs pipeline via an mpsc
channel, using `send().await` to guarantee delivery
4. `LogsAgent::spin()` receives the update and calls
`LogsProcessor::process_durable_context_update()`, which inserts the
entry into `LambdaProcessor::durable_context_map` and drains any held
logs for that `request_id`
### Log holding and draining
- After cold start, the logs processor holds all incoming logs without
flushing them, because it does not yet know whether this is a durable
function
- Held logs are stored in `held_logs: HashMap<String, Vec<IntakeLog>>`,
keyed by `request_id`
- Logs without a `request_id` (e.g. in managed instance mode) are pushed
directly to `ready_logs` and never held, since they cannot carry durable
context
- `durable_context_map: HashMap<String, DurableExecutionContext>` maps
`request_id` to `(execution_id, execution_name)`. It has a fixed
capacity (500 entries) with FIFO eviction
- When the logs processor receives a `PlatformInitStart` event, it
learns whether the function is a durable function:
- If **not** a durable function: drain all held logs (mark them ready
for aggregation and flush)
- If **durable**: drain held logs whose `request_id` is already in
`durable_context_map` (tag them with
`lambda.durable_function.execution_id` and
`lambda.durable_function.execution_name`); keep the rest held until
their context arrives
- When an entry is inserted into `durable_context_map`, any held logs
for that `request_id` are drained immediately
### Memory safety and resilience
- `held_logs` is capped at **50 keys** (intentionally small — see
below). Insertion order is tracked in `held_logs_order:
VecDeque<String>` for FIFO eviction
- When `held_logs` is at capacity and a new `request_id` arrives, the
**oldest key is evicted**: its logs are serialized and pushed to
`ready_logs` without durable context tags. This ensures logs are always
eventually sent to Datadog even if the tracer is not installed and
context never arrives
- The cap is kept small (50) to limit the size of the batch flushed at
shutdown, reducing the risk of the final flush timing out when held logs
are drained without durable context
- At **shutdown**, after draining the telemetry channel:
1. The `durable_context_rx` channel is drained to apply any pending
context updates, maximising the chance logs are decorated before
flushing
2. All remaining `held_logs` are drained to `ready_logs` without durable
context tags, so no logs are lost
### Types
- `DurableContextUpdate { request_id, execution_id, execution_name }` —
message sent from trace agent through lifecycle processor to logs
pipeline
- `DurableExecutionContext { execution_id, execution_name }` — value
type stored in `durable_context_map`
## Test plan
### Manual test
#### Steps
Build a layer, install it on a function, and invoke it.
#### Result
1. In Datadog, all the logs for this durable execution have the two new
attributes
<img width="734" height="421" alt="image"
src="https://github.com/user-attachments/assets/173e3be2-8bb1-4e08-be97-521c63679bf1"
/>
2. The logs query
> source:lambda
@lambda.arn:"arn:aws:lambda:us-east-2:425362996713:function:yiming-durable-py-custom-tracer"
@lambda.durable_function.execution_name:c949fb3d-a8f5-4ae6-a802-b1458149a4b2
returns all the logs for two invocations of this durable execution. It
returns 98 logs, equal to 49 logs for the first invocation + 49 logs for
the second invocation. ([query
link](https://ddserverless.datadoghq.com/logs?query=source%3Alambda%20%40lambda.arn%3A%22arn%3Aaws%3Alambda%3Aus-east-2%3A425362996713%3Afunction%3Ayiming-durable%22%20%40lambda.durable_execution_name%3A2f492839-75df-4acb-9f2a-30b1b36d5c8f&agg_m=count&agg_m_source=base&agg_t=count&cols=host%2Cservice&fromUser=true&messageDisplay=inline&refresh_mode=paused&storage=flex_tier&stream_sort=time%2Cdesc&viz=stream&from_ts=1772655235033&to_ts=1772655838416&live=false))
### Unit tests
Passed the added unit tests
---------
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
0 commit comments