Skip to content

feat(discovery): Rust-to-Go logging bridge for libdd_discovery#49395

Draft
martavicentenavarro wants to merge 2 commits intomainfrom
marta/logs_spl
Draft

feat(discovery): Rust-to-Go logging bridge for libdd_discovery#49395
martavicentenavarro wants to merge 2 commits intomainfrom
marta/logs_spl

Conversation

@martavicentenavarro
Copy link
Copy Markdown
Contributor

@martavicentenavarro martavicentenavarro commented Apr 15, 2026

What does this PR do?

Forwards Rust log records from libdd_discovery to Go's pkg/util/log so they appear in the same log files as the rest of the agent, with the same format and level filtering.

Two new Go files are added under pkg/discovery/module/:

  • impl_services_rust_logger_linux.go — exports goDiscoveryLogCallback to C. Called by the Rust library for each log record; routes by level to log.Errorf/Warnf/Infof/Debugf/Tracef with a [dd_discovery] prefix.
  • impl_services_rust_init_linux.go — registers the callback at package init() time via dd_discovery_init_logger.

On the Rust side (pkg/discovery/module/rust/src/ffi.rs), a GoLogger struct implements log::Log and forwards every record to the registered callback stored in a OnceLock. log::set_max_level(LevelFilter::Trace) is set so all records reach the Go callback; Go's own level filter then decides what is actually emitted. The exported C function dd_discovery_init_logger is idempotent: only the first call has effect, so concurrent init() calls at start-up are safe.

dd_discovery.h gains the dd_log_fn typedef and the dd_discovery_init_logger declaration. dd_log_fn is explicitly added to cbindgen's [export] include list (type aliases require explicit listing); dd_discovery_init_logger is auto-exported by cbindgen as a pub extern "C" function.

Motivation

Without this bridge, Rust log output from libdd_discovery is silently discarded when the library is linked into system-probe via CGo. Operators have no visibility into what the library is doing, which makes debugging in production impossible.

This follows the same pattern already used in the repo for the Python runtime (pkg/collector/python), where LogMessage is exported to C and registered via set_log_cb at initialisation time.

Describe how you validated your changes

Manually verified that log records emitted by the Rust library appear in the agent log output at the expected levels.

The existing unit tests in ffi.rs cover the pre-existing FFI layer (services_response_to_result, dd_discovery_free, pids_from_c). The new logger bridge (GoLogger, dd_discovery_init_logger) does not have dedicated unit tests: testing dd_discovery_init_logger in a Rust unit test would permanently mutate the process-global log::logger static, which would interfere with other tests in the same binary.

Additional Notes

The two-file split (impl_services_rust_logger_linux.go + impl_services_rust_init_linux.go) is required by a CGo constraint: a file containing //export cannot have C function definitions in its preamble, only declarations. The static helper getGoLogCallback() (which returns the function pointer) must therefore live in a separate file.

@martavicentenavarro martavicentenavarro added changelog/no-changelog No changelog entry needed qa/done QA done before merge and regressions are covered by tests team/agent-discovery labels Apr 15, 2026
@dd-octo-sts dd-octo-sts bot added the internal Identify a non-fork PR label Apr 15, 2026
@github-actions github-actions bot added the medium review PR review might take time label Apr 15, 2026
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts bot commented Apr 15, 2026

Files inventory check summary

File checks results against ancestor 56888f8d:

Results for datadog-agent_7.79.0~devel.git.755.79bc052.pipeline.108251497-1_amd64.deb:

No change detected

@cit-pr-commenter-54b7da
Copy link
Copy Markdown

cit-pr-commenter-54b7da bot commented Apr 15, 2026

Regression Detector

Regression Detector Results

Metrics dashboard
Target profiles
Run ID: 9169dbe4-46d4-4d41-a55d-727749392f0e

Baseline: 56888f8
Comparison: 79bc052
Diff

Optimization Goals: ✅ No significant changes detected

Experiments ignored for regressions

Regressions in experiments with settings containing erratic: true are ignored.

perf experiment goal Δ mean % Δ mean % CI trials links
docker_containers_cpu % cpu utilization -1.78 [-4.77, +1.20] 1 Logs

Fine details of change detection per experiment

perf experiment goal Δ mean % Δ mean % CI trials links
tcp_syslog_to_blackhole ingress throughput +0.56 [+0.39, +0.73] 1 Logs
quality_gate_metrics_logs memory utilization +0.46 [+0.22, +0.70] 1 Logs bounds checks dashboard
ddot_metrics_sum_cumulative memory utilization +0.35 [+0.19, +0.50] 1 Logs
ddot_metrics_sum_cumulativetodelta_exporter memory utilization +0.28 [+0.05, +0.50] 1 Logs
ddot_logs memory utilization +0.25 [+0.18, +0.31] 1 Logs
quality_gate_logs % cpu utilization +0.19 [-1.45, +1.83] 1 Logs bounds checks dashboard
quality_gate_idle_all_features memory utilization +0.15 [+0.11, +0.18] 1 Logs bounds checks dashboard
otlp_ingest_logs memory utilization +0.08 [-0.02, +0.17] 1 Logs
otlp_ingest_metrics memory utilization +0.05 [-0.11, +0.20] 1 Logs
file_to_blackhole_0ms_latency egress throughput +0.05 [-0.49, +0.58] 1 Logs
ddot_metrics_sum_delta memory utilization +0.04 [-0.14, +0.22] 1 Logs
uds_dogstatsd_to_api_v3 ingress throughput +0.01 [-0.19, +0.21] 1 Logs
tcp_dd_logs_filter_exclude ingress throughput -0.00 [-0.11, +0.11] 1 Logs
uds_dogstatsd_to_api ingress throughput -0.01 [-0.22, +0.20] 1 Logs
quality_gate_idle memory utilization -0.02 [-0.07, +0.03] 1 Logs bounds checks dashboard
file_to_blackhole_100ms_latency egress throughput -0.02 [-0.12, +0.08] 1 Logs
file_to_blackhole_500ms_latency egress throughput -0.05 [-0.46, +0.35] 1 Logs
file_to_blackhole_1000ms_latency egress throughput -0.07 [-0.51, +0.37] 1 Logs
uds_dogstatsd_20mb_12k_contexts_20_senders memory utilization -0.15 [-0.21, -0.09] 1 Logs
docker_containers_memory memory utilization -0.17 [-0.26, -0.09] 1 Logs
ddot_metrics memory utilization -0.22 [-0.40, -0.05] 1 Logs
file_tree memory utilization -0.30 [-0.35, -0.24] 1 Logs
docker_containers_cpu % cpu utilization -1.78 [-4.77, +1.20] 1 Logs

Bounds Checks: ❌ Failed

perf experiment bounds_check_name replicates_passed observed_value links
docker_containers_cpu simple_check_run 10/10 583 ≥ 26
docker_containers_memory memory_usage 10/10 275.85MiB ≤ 370MiB
docker_containers_memory simple_check_run 10/10 677 ≥ 26
file_to_blackhole_0ms_latency memory_usage 10/10 0.19GiB ≤ 1.20GiB
file_to_blackhole_0ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_1000ms_latency memory_usage 10/10 0.24GiB ≤ 1.20GiB
file_to_blackhole_1000ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_100ms_latency memory_usage 10/10 0.20GiB ≤ 1.20GiB
file_to_blackhole_100ms_latency missed_bytes 10/10 0B = 0B
file_to_blackhole_500ms_latency memory_usage 10/10 0.22GiB ≤ 1.20GiB
file_to_blackhole_500ms_latency missed_bytes 10/10 0B = 0B
quality_gate_idle intake_connections 0/10 4 > 3 bounds checks dashboard
quality_gate_idle memory_usage 10/10 173.77MiB ≤ 181MiB bounds checks dashboard
quality_gate_idle_all_features intake_connections 2/10 4 > 3 bounds checks dashboard
quality_gate_idle_all_features memory_usage 10/10 500.12MiB ≤ 550MiB bounds checks dashboard
quality_gate_logs intake_connections 10/10 4 ≤ 6 bounds checks dashboard
quality_gate_logs memory_usage 10/10 208.61MiB ≤ 220MiB bounds checks dashboard
quality_gate_logs missed_bytes 10/10 0B = 0B bounds checks dashboard
quality_gate_metrics_logs cpu_usage 10/10 351.70 ≤ 2000 bounds checks dashboard
quality_gate_metrics_logs intake_connections 10/10 4 ≤ 6 bounds checks dashboard
quality_gate_metrics_logs memory_usage 10/10 418.25MiB ≤ 475MiB bounds checks dashboard
quality_gate_metrics_logs missed_bytes 10/10 0B = 0B bounds checks dashboard

Explanation

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

Performance changes are noted in the perf column of each table:

  • ✅ = significantly better comparison variant performance
  • ❌ = significantly worse comparison variant performance
  • ➖ = no significant change in performance

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

  1. Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.

  2. Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.

  3. Its configuration does not mark it "erratic".

CI Pass/Fail Decision

Failed. Some Quality Gates were violated.

  • quality_gate_idle_all_features, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle_all_features, bounds check intake_connections: 2/10 replicas passed. Failed 8 which is > 0. Gate FAILED.
  • quality_gate_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check intake_connections: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check cpu_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check missed_bytes: 10/10 replicas passed. Gate passed.
  • quality_gate_metrics_logs, bounds check memory_usage: 10/10 replicas passed. Gate passed.
  • quality_gate_idle, bounds check intake_connections: 0/10 replicas passed. Failed 10 which is > 0. Gate FAILED.
  • quality_gate_idle, bounds check memory_usage: 10/10 replicas passed. Gate passed.

@martavicentenavarro martavicentenavarro force-pushed the marta/logs_spl branch 2 times, most recently from c1ec169 to 70710a7 Compare April 16, 2026 15:29
@martavicentenavarro
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4da33278d6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread pkg/discovery/module/rust/src/ffi.rs Outdated
);
return;
}
log::set_max_level(log::LevelFilter::Trace);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor configured log level before forwarding Rust records

Setting log::set_max_level(log::LevelFilter::Trace) here causes the Rust log facade to build and dispatch every debug!/trace! record, and GoLogger::enabled returns true, so filtering only happens later in Go after string allocation and callback crossing. In discovery hot paths (for example trace! in procfs/fd.rs:get_log_files), this introduces avoidable per-record overhead even when the agent is running at info level, which is a production CPU regression. The Rust max level should be derived from the effective agent log level (and updated when that level changes) rather than hardcoded to Trace.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. dd_discovery_init_logger now accepts the initial Go log level as a uint32_t parameter and passes it to log::set_max_level, so Rust discards records below the configured threshold before formatting or crossing the FFI.

Runtime level changes are not yet propagated (would require a separate dd_discovery_set_log_level FFI call hooked into ChangeLogLevel) , tracking that as a follow-up.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a Rust→Go logging bridge for libdd_discovery so Rust log records emitted inside the discovery static library are forwarded into the agent’s standard Go logger (pkg/util/log) with consistent formatting/level handling.

Changes:

  • Implement a Rust log::Log backend (GoLogger) that forwards records to a Go callback registered via an idempotent dd_discovery_init_logger.
  • Extend the generated C header (dd_discovery.h) and cbindgen config to expose the callback typedef and init function.
  • Add Linux-only Go/cgo glue to export goDiscoveryLogCallback and register it during package initialization.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pkg/discovery/module/rust/src/ffi.rs Adds Rust-side logger implementation, callback storage, and dd_discovery_init_logger FFI export.
pkg/discovery/module/rust/include/dd_discovery.h Exposes dd_log_fn typedef and dd_discovery_init_logger declaration for consumers.
pkg/discovery/module/rust/cbindgen.toml Ensures the new callback typedef is included in generated headers.
pkg/discovery/module/impl_services_rust_logger_linux.go Exports the Go callback and maps Rust log levels into pkg/util/log.
pkg/discovery/module/impl_services_rust_init_linux.go Registers the callback at init-time and links against the Rust static library.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/discovery/module/rust/src/ffi.rs Outdated
Comment on lines +353 to +370
pub unsafe extern "C" fn dd_discovery_init_logger(callback: dd_log_fn) {
// SAFETY: Wrapping in catch_unwind prevents a Rust panic from unwinding
// across the C ABI boundary, which would be undefined behaviour.
let _ = panic::catch_unwind(AssertUnwindSafe(|| {
if LOGGER_CALLBACK.set(callback).is_err() {
return;
}
// `set_logger` returns Err if another logger was already registered
// (e.g. in a test harness). In that case LOGGER_CALLBACK is set but
// GoLogger is not installed, so all Rust log records would be silently
// dropped. Surface the conflict so it is not invisible to the caller.
if log::set_logger(&GO_LOGGER).is_err() {
eprintln!(
"[dd_discovery] dd_discovery_init_logger: a Rust logger was already \
registered; Rust log records will not be forwarded to the Go logger"
);
return;
}
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dd_discovery_init_logger returns void and, when log::set_logger fails (logger already registered), it only prints a message to stderr. In the system-probe context this can be invisible to operators, leaving Rust logs silently dropped with no structured signal to the caller. Consider changing the C ABI to return a status code (success / already-initialized / logger-conflict) so Go can log/report it via pkg/util/log, and/or use the provided callback to emit this error into the Go logger instead of eprintln!.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Replaced the eprintln! with a direct call to the provided callback at error level (1), so the conflict is reported through the Go structured logger instead of stderr.

Comment on lines +114 to +115
void dd_discovery_init_logger(dd_log_fn callback);

Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exported API dd_discovery_init_logger currently returns void, so callers cannot programmatically detect whether logger installation succeeded (e.g., if a Rust logger was already registered). Since the Rust implementation can fall back to stderr today, consider updating this declaration to return an explicit status code so Go callers can surface failures in agent logs and handle them deterministically.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stderr fallback is gone due to the fix in the previous comment.

Comment on lines +23 to +30
func goDiscoveryLogCallback(level C.uint32_t, msg *C.char, msgLen C.size_t) {
// Reject nil pointers, empty messages (nothing useful to log), and messages
// larger than MaxInt32 to prevent overflow when casting msgLen to C.int below.
if msg == nil || msgLen == 0 || msgLen > math.MaxInt32 {
return
}
message := C.GoStringN(msg, C.int(msgLen))
// Level mapping mirrors dd_log_fn in dd_discovery.h: 1=Error, 2=Warn, 3=Info, 4=Debug, 5+=Trace.
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

goDiscoveryLogCallback always converts the C buffer to a Go string (C.GoStringN) before checking whether the target Go log level is enabled. This incurs an allocation/copy even when the agent is configured to drop that level (e.g., Trace/Debug). Consider mapping the incoming level to pkg/util/log's LogLevel and using log.ShouldLog(...) to early-return before calling C.GoStringN.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Added a log.ShouldLog() check before C.GoStringN so the string allocation is skipped entirely for records below the configured level.

Comment on lines +312 to +372
impl log::Log for GoLogger {
fn enabled(&self, _metadata: &log::Metadata) -> bool {
true // All records pass through; Go-side filtering applies.
}

fn log(&self, record: &log::Record) {
let Some(&callback) = LOGGER_CALLBACK.get() else {
return;
};
let level: u32 = match record.level() {
log::Level::Error => 1,
log::Level::Warn => 2,
log::Level::Info => 3,
log::Level::Debug => 4,
log::Level::Trace => 5,
};
let message = record.args().to_string();
// SAFETY: `callback` was registered by the Go runtime and is valid for
// the process lifetime. `message` is live for the duration of this call.
unsafe { callback(level, message.as_ptr() as *const c_char, message.len()) };
}

fn flush(&self) {}
}

static GO_LOGGER: GoLogger = GoLogger;

/// Register a Go logging callback. Idempotent: only the first call has effect;
/// subsequent calls (e.g. from racing goroutines at start-up) are ignored.
///
/// All log levels are forwarded to the callback; the Go side applies its own
/// level filter so no records are emitted above the configured level.
///
/// # Panic safety
/// Wrapped in `catch_unwind` following the convention in this file: unwinding
/// across a C ABI boundary is undefined behaviour per the Rust nomicon.
///
/// # Safety
/// `callback` must be a valid function pointer that remains valid for the
/// lifetime of the process.
#[unsafe(no_mangle)]
pub unsafe extern "C" fn dd_discovery_init_logger(callback: dd_log_fn) {
// SAFETY: Wrapping in catch_unwind prevents a Rust panic from unwinding
// across the C ABI boundary, which would be undefined behaviour.
let _ = panic::catch_unwind(AssertUnwindSafe(|| {
if LOGGER_CALLBACK.set(callback).is_err() {
return;
}
// `set_logger` returns Err if another logger was already registered
// (e.g. in a test harness). In that case LOGGER_CALLBACK is set but
// GoLogger is not installed, so all Rust log records would be silently
// dropped. Surface the conflict so it is not invisible to the caller.
if log::set_logger(&GO_LOGGER).is_err() {
eprintln!(
"[dd_discovery] dd_discovery_init_logger: a Rust logger was already \
registered; Rust log records will not be forwarded to the Go logger"
);
return;
}
log::set_max_level(log::LevelFilter::Trace);
}));
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dd_discovery_init_logger sets the Rust global max level to Trace and GoLogger.enabled() always returns true. This forces every Rust log call (including Trace) to cross the FFI boundary and allocate/format even when Go is configured to drop that level, which can be a significant overhead in hot paths. Consider accepting a max level from Go (or another mechanism to reflect Go’s current level) and setting log::set_max_level accordingly, and/or implementing enabled() to filter by that level before formatting/forwarding.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed based on a previous comment.

@github-actions github-actions bot added long review PR is complex, plan time to review it and removed medium review PR review might take time labels Apr 17, 2026
- Pass the current Go log level to dd_discovery_init_logger so Rust sets
  log::set_max_level accordingly, discarding records below the configured
  threshold before formatting or crossing the FFI boundary.
- Add log.ShouldLog() guard in goDiscoveryLogCallback before C.GoStringN
  to skip the string allocation for records that will be dropped.
- Replace eprintln! with a direct callback invocation at error level when
  log::set_logger fails, so the conflict reaches the Go structured logger
  instead of being lost to stderr.

Runtime propagation of log level changes is left as a follow-up.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/no-changelog No changelog entry needed component/system-probe internal Identify a non-fork PR long review PR is complex, plan time to review it qa/done QA done before merge and regressions are covered by tests team/agent-discovery

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants