[processor/memorylimiter] Fix degenerate GC loop when exporter has problems#15053
Open
Krishnachaitanyakc wants to merge 5 commits intoopen-telemetry:mainfrom
Open
[processor/memorylimiter] Fix degenerate GC loop when exporter has problems#15053Krishnachaitanyakc wants to merge 5 commits intoopen-telemetry:mainfrom
Krishnachaitanyakc wants to merge 5 commits intoopen-telemetry:mainfrom
Conversation
…oblems When a downstream exporter target crashes, the memory limiter could enter a permanent degenerate state where forced GC runs on every check interval tick. The GC cannot reclaim memory held by exporter queues and retry goroutines, so each GC cycle wastes CPU without freeing any memory. This CPU starvation prevents the exporter consumer goroutines from draining the queue, blocking recovery even after the downstream service is restored. The fix adds GC effectiveness tracking: after each forced GC, the memory limiter compares memory usage before and after. If GC reclaimed less than 5%, the interval between forced GCs is doubled exponentially (capped at 2 minutes). This preserves CPU for queue draining and enables recovery. The backoff resets when memory drops below the soft limit. Also changes the default min_gc_interval_when_hard_limited from 0 (GC on every tick) to 10 seconds, consistent with the soft limit default. Fixes open-telemetry#4981 Assisted-by: Claude Opus 4.6
P1: Revert default min_gc_interval_when_hard_limited change. Setting it to 10s broke existing configs where min_gc_interval_when_soft_limited is set to a value smaller than 10s (e.g., 1s) — the validator rejects soft < hard, causing startup failures on deployed collectors. P2: Add lastAllocAfterGC tracking to detect recovery before the backoff expires. After an exporter recovers and its queue drains, garbage exists but Alloc stays high until GC runs. With a large backoff, forced GC was delayed while the heap was already reclaimable. Now, on each check, if Alloc has dropped >5% vs lastAllocAfterGC, the backoff resets and a forced GC fires promptly. Also reduced maxGCBackoffInterval from 2min to 30s for faster recovery detection. Assisted-by: Claude Opus 4.6
…ecovery P2a: The maxGCBackoffInterval cap (30s) could be shorter than the user's configured min_gc_interval, causing GC to fire more often during outages than the user intended. Now the cap is max(maxGCBackoffInterval, baseInterval), so a configured 60s minimum is always respected. P2b: A GC that freed <5% of memory but brought usage below the soft limit was incorrectly marked as ineffective. This left stale backoff state that throttled the next pressure event. Now a GC is considered effective if it either reclaimed >=5% OR resolved the pressure (below soft limit), preventing stale backoff from carrying across recovery boundaries. Assisted-by: Claude Opus 4.6
Author
|
@jmacd @bogdandrutu could you review this memory-limiter fix? It changes forced-GC behavior in |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
When a downstream exporter target crashes, the memory limiter processor can enter a permanent degenerate state where forced GC runs on every check interval tick. Since
min_gc_interval_when_hard_limiteddefaults to 0,runtime.GC()fires on every tick when memory exceeds the hard limit. The GC cannot reclaim memory held by live references in exporter queues and retry goroutines, so each cycle wastes CPU without freeing memory. This CPU starvation prevents exporter consumer goroutines from draining the queue, blocking recovery even after the downstream service is restored.The fix adds GC effectiveness tracking with exponential backoff:
Additionally, the default
min_gc_interval_when_hard_limitedis changed from 0 (no minimum interval, meaning GC on every tick) to 10 seconds, consistent with the existingmin_gc_interval_when_soft_limiteddefault.Link to tracking issue
Fixes #4981
Testing
TestCallGCWhenSoftLimitto simulate effective GC (so the new backoff logic doesn't interfere with interval-testing)TestGCBackoffWhenIneffective— verifies GC backs off when it cannot reclaim memory (reproduces the Degenerate collector performance when exporter has problems #4981 scenario)TestGCBackoffResetOnRecovery— verifies backoff resets when memory drops below soft limitTestEffectiveGCInterval— unit tests for the exponential backoff calculation with capTestNoDataLossintegration test continues to pass (real memory pressure with MockExporter/MockReceiver)Documentation
Changelog entry added via
.chloggen/fix-degenerate-gc-loop-4981.yaml.