Skip to content

[processor/memorylimiter] Fix degenerate GC loop when exporter has problems#15053

Open
Krishnachaitanyakc wants to merge 5 commits intoopen-telemetry:mainfrom
Krishnachaitanyakc:fix/issue-4981
Open

[processor/memorylimiter] Fix degenerate GC loop when exporter has problems#15053
Krishnachaitanyakc wants to merge 5 commits intoopen-telemetry:mainfrom
Krishnachaitanyakc:fix/issue-4981

Conversation

@Krishnachaitanyakc
Copy link
Copy Markdown

Description

When a downstream exporter target crashes, the memory limiter processor can enter a permanent degenerate state where forced GC runs on every check interval tick. Since min_gc_interval_when_hard_limited defaults to 0, runtime.GC() fires on every tick when memory exceeds the hard limit. The GC cannot reclaim memory held by live references in exporter queues and retry goroutines, so each cycle wastes CPU without freeing memory. This CPU starvation prevents exporter consumer goroutines from draining the queue, blocking recovery even after the downstream service is restored.

The fix adds GC effectiveness tracking with exponential backoff:

  • After each forced GC, compare memory before and after. If GC reclaimed less than 5%, it is considered "ineffective."
  • On consecutive ineffective GCs, the interval between forced GCs doubles exponentially (capped at 2 minutes).
  • This preserves CPU for the exporter consumer goroutines to drain their queues, enabling recovery.
  • The backoff resets when memory drops below the soft limit (i.e., the system has recovered).

Additionally, the default min_gc_interval_when_hard_limited is changed from 0 (no minimum interval, meaning GC on every tick) to 10 seconds, consistent with the existing min_gc_interval_when_soft_limited default.

Link to tracking issue

Fixes #4981

Testing

  • Updated existing TestCallGCWhenSoftLimit to simulate effective GC (so the new backoff logic doesn't interfere with interval-testing)
  • Added TestGCBackoffWhenIneffective — verifies GC backs off when it cannot reclaim memory (reproduces the Degenerate collector performance when exporter has problems #4981 scenario)
  • Added TestGCBackoffResetOnRecovery — verifies backoff resets when memory drops below soft limit
  • Added TestEffectiveGCInterval — unit tests for the exponential backoff calculation with cap
  • All existing tests pass (internal/memorylimiter, processor/memorylimiterprocessor, extension/memorylimiterextension)
  • TestNoDataLoss integration test continues to pass (real memory pressure with MockExporter/MockReceiver)

Documentation

Changelog entry added via .chloggen/fix-degenerate-gc-loop-4981.yaml.

…oblems

When a downstream exporter target crashes, the memory limiter could enter
a permanent degenerate state where forced GC runs on every check interval
tick. The GC cannot reclaim memory held by exporter queues and retry
goroutines, so each GC cycle wastes CPU without freeing any memory. This
CPU starvation prevents the exporter consumer goroutines from draining
the queue, blocking recovery even after the downstream service is restored.

The fix adds GC effectiveness tracking: after each forced GC, the memory
limiter compares memory usage before and after. If GC reclaimed less than
5%, the interval between forced GCs is doubled exponentially (capped at
2 minutes). This preserves CPU for queue draining and enables recovery.
The backoff resets when memory drops below the soft limit.

Also changes the default min_gc_interval_when_hard_limited from 0 (GC on
every tick) to 10 seconds, consistent with the soft limit default.

Fixes open-telemetry#4981

Assisted-by: Claude Opus 4.6
P1: Revert default min_gc_interval_when_hard_limited change. Setting it
to 10s broke existing configs where min_gc_interval_when_soft_limited is
set to a value smaller than 10s (e.g., 1s) — the validator rejects
soft < hard, causing startup failures on deployed collectors.

P2: Add lastAllocAfterGC tracking to detect recovery before the backoff
expires. After an exporter recovers and its queue drains, garbage exists
but Alloc stays high until GC runs. With a large backoff, forced GC was
delayed while the heap was already reclaimable. Now, on each check, if
Alloc has dropped >5% vs lastAllocAfterGC, the backoff resets and a
forced GC fires promptly. Also reduced maxGCBackoffInterval from 2min
to 30s for faster recovery detection.

Assisted-by: Claude Opus 4.6
…ecovery

P2a: The maxGCBackoffInterval cap (30s) could be shorter than the user's
configured min_gc_interval, causing GC to fire more often during outages
than the user intended. Now the cap is max(maxGCBackoffInterval,
baseInterval), so a configured 60s minimum is always respected.

P2b: A GC that freed <5% of memory but brought usage below the soft
limit was incorrectly marked as ineffective. This left stale backoff
state that throttled the next pressure event. Now a GC is considered
effective if it either reclaimed >=5% OR resolved the pressure (below
soft limit), preventing stale backoff from carrying across recovery
boundaries.

Assisted-by: Claude Opus 4.6
@Krishnachaitanyakc Krishnachaitanyakc marked this pull request as ready for review April 3, 2026 00:26
@Krishnachaitanyakc Krishnachaitanyakc requested a review from a team as a code owner April 3, 2026 00:26
@Krishnachaitanyakc Krishnachaitanyakc requested a review from jmacd April 3, 2026 00:26
@Krishnachaitanyakc
Copy link
Copy Markdown
Author

@jmacd @bogdandrutu could you review this memory-limiter fix? It changes forced-GC behavior in internal/memorylimiter to back off when GC is ineffective and adds tests for recovery/backoff paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Degenerate collector performance when exporter has problems

1 participant