[processor/memorylimiter] Fix degenerate GC loop when exporter has problems by Krishnachaitanyakc · Pull Request #15053 · open-telemetry/opentelemetry-collector

Krishnachaitanyakc · 2026-04-02T23:37:57Z

Description

When a downstream exporter target crashes, the memory limiter processor can enter a permanent degenerate state where forced GC runs on every check interval tick. Since min_gc_interval_when_hard_limited defaults to 0, runtime.GC() fires on every tick when memory exceeds the hard limit. The GC cannot reclaim memory held by live references in exporter queues and retry goroutines, so each cycle wastes CPU without freeing memory. This CPU starvation prevents exporter consumer goroutines from draining the queue, blocking recovery even after the downstream service is restored.

The fix adds GC effectiveness tracking with exponential backoff:

After each forced GC, compare memory before and after. If GC reclaimed less than 5%, it is considered "ineffective."
On consecutive ineffective GCs, the interval between forced GCs doubles exponentially (capped at 2 minutes).
This preserves CPU for the exporter consumer goroutines to drain their queues, enabling recovery.
The backoff resets when memory drops below the soft limit (i.e., the system has recovered).

Additionally, the default min_gc_interval_when_hard_limited is changed from 0 (no minimum interval, meaning GC on every tick) to 10 seconds, consistent with the existing min_gc_interval_when_soft_limited default.

Link to tracking issue

Fixes #4981

Testing

Updated existing TestCallGCWhenSoftLimit to simulate effective GC (so the new backoff logic doesn't interfere with interval-testing)
Added TestGCBackoffWhenIneffective — verifies GC backs off when it cannot reclaim memory (reproduces the Degenerate collector performance when exporter has problems #4981 scenario)
Added TestGCBackoffResetOnRecovery — verifies backoff resets when memory drops below soft limit
Added TestEffectiveGCInterval — unit tests for the exponential backoff calculation with cap
All existing tests pass (internal/memorylimiter, processor/memorylimiterprocessor, extension/memorylimiterextension)
TestNoDataLoss integration test continues to pass (real memory pressure with MockExporter/MockReceiver)

Documentation

Changelog entry added via .chloggen/fix-degenerate-gc-loop-4981.yaml.

…oblems When a downstream exporter target crashes, the memory limiter could enter a permanent degenerate state where forced GC runs on every check interval tick. The GC cannot reclaim memory held by exporter queues and retry goroutines, so each GC cycle wastes CPU without freeing any memory. This CPU starvation prevents the exporter consumer goroutines from draining the queue, blocking recovery even after the downstream service is restored. The fix adds GC effectiveness tracking: after each forced GC, the memory limiter compares memory usage before and after. If GC reclaimed less than 5%, the interval between forced GCs is doubled exponentially (capped at 2 minutes). This preserves CPU for queue draining and enables recovery. The backoff resets when memory drops below the soft limit. Also changes the default min_gc_interval_when_hard_limited from 0 (GC on every tick) to 10 seconds, consistent with the soft limit default. Fixes open-telemetry#4981 Assisted-by: Claude Opus 4.6

P1: Revert default min_gc_interval_when_hard_limited change. Setting it to 10s broke existing configs where min_gc_interval_when_soft_limited is set to a value smaller than 10s (e.g., 1s) — the validator rejects soft < hard, causing startup failures on deployed collectors. P2: Add lastAllocAfterGC tracking to detect recovery before the backoff expires. After an exporter recovers and its queue drains, garbage exists but Alloc stays high until GC runs. With a large backoff, forced GC was delayed while the heap was already reclaimable. Now, on each check, if Alloc has dropped >5% vs lastAllocAfterGC, the backoff resets and a forced GC fires promptly. Also reduced maxGCBackoffInterval from 2min to 30s for faster recovery detection. Assisted-by: Claude Opus 4.6

…ecovery P2a: The maxGCBackoffInterval cap (30s) could be shorter than the user's configured min_gc_interval, causing GC to fire more often during outages than the user intended. Now the cap is max(maxGCBackoffInterval, baseInterval), so a configured 60s minimum is always respected. P2b: A GC that freed <5% of memory but brought usage below the soft limit was incorrectly marked as ineffective. This left stale backoff state that throttled the next pressure event. Now a GC is considered effective if it either reclaimed >=5% OR resolved the pressure (below soft limit), preventing stale backoff from carrying across recovery boundaries. Assisted-by: Claude Opus 4.6

Krishnachaitanyakc · 2026-04-04T01:07:27Z

@jmacd @bogdandrutu could you review this memory-limiter fix? It changes forced-GC behavior in internal/memorylimiter to back off when GC is ineffective and adds tests for recovery/backoff paths.

Krishnachaitanyakc added 3 commits April 2, 2026 16:25

Krishnachaitanyakc marked this pull request as ready for review April 3, 2026 00:26

Krishnachaitanyakc requested a review from a team as a code owner April 3, 2026 00:26

Krishnachaitanyakc requested a review from jmacd April 3, 2026 00:26

Krishnachaitanyakc added 2 commits April 9, 2026 23:33

Merge branch 'main' into fix/issue-4981

f2c8492

Merge branch 'main' into fix/issue-4981

4a9f886

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[processor/memorylimiter] Fix degenerate GC loop when exporter has problems#15053

[processor/memorylimiter] Fix degenerate GC loop when exporter has problems#15053
Krishnachaitanyakc wants to merge 5 commits intoopen-telemetry:mainfrom
Krishnachaitanyakc:fix/issue-4981

Krishnachaitanyakc commented Apr 2, 2026

Uh oh!

Krishnachaitanyakc commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Krishnachaitanyakc commented Apr 2, 2026

Description

Link to tracking issue

Testing

Documentation

Uh oh!

Krishnachaitanyakc commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant