Pre-check
Apache Dubbo Component
Java SDK (apache/dubbo)
Dubbo Version
Dubbo Java 3.3.6, OpenJDK 21.0.5+11-LTS, Ubuntu (ZGC Generational: -XX:+UseZGC -XX:+ZGenerational, Heap: -Xms65g -Xmx65g)
Steps to reproduce this issue
When the Dubbo thread pool is exhausted, AbortPolicyWithReport.rejectedExecution() invokes dumpJStack() → JVMUtil.jstack() → ThreadMXBean.dumpAllThreads(true, true).
The second true parameter (lockedSynchronizers) forces the JVM to scan the entire Java heap at a safepoint to find all AbstractOwnableSynchronizer instances via HeapInspection::find_instances_at_safepoint(). On ZGC with large heaps, every object reference during this scan passes through ZGC's load barrier (color bit check → potential relocate → forwarding table lookup → remap), resulting in a ~37-second safepoint pause that freezes the entire application.
Environment
- JDK: OpenJDK 21.0.5+11-LTS
- GC: ZGC Generational (
-XX:+UseZGC -XX:+ZGenerational)
- Heap: 65GB (
-Xms65g -Xmx65g)
- Threads: ~1950
- Dubbo version: 3.3.6
Reproduction Steps
- Deploy a Dubbo application with ZGC and a large heap (≥32GB)
- Drive enough traffic to exhaust the Dubbo thread pool
AbortPolicyWithReport.rejectedExecution() fires and calls dumpJStack()
- Observe a ~37-second full application freeze in GC logs (safepoint duration)
Relevant Code
JVMUtil.java Line 36:
public static void jstack(OutputStream stream) throws Exception {
ThreadMXBean threadMxBean = ManagementFactory.getThreadMXBean();
for (ThreadInfo threadInfo : threadMxBean.dumpAllThreads(true, true)) {
// ^^^^ lockedSynchronizers=true triggers heap scan
stream.write(getThreadDumpString(threadInfo).getBytes(StandardCharsets.UTF_8));
}
}
What you expected to happen
JVMUtil.jstack() should complete without causing a significant safepoint pause on ZGC with large heaps.
What actually happens: A 36.83-second safepoint pause freezes all ~1950 application threads. The freeze cascades: queued requests during the pause immediately exhaust the thread pool again on release, triggering another dump. We observed 4 consecutive ThreadDumps producing ~150 seconds of near-total service unavailability.
GC Log Evidence (safepoint entries)
| Event |
Timestamp |
At Safepoint Duration |
Thread Count |
| ThreadDump #1 |
10:41:04.466 |
38,589,643,169 ns (38.59s) |
1954 |
| ThreadDump #2 |
10:51:15.645 |
36,833,709,974 ns (36.83s) |
1947 |
For comparison, normal ZGC safepoint operations (Mark Start, Mark End, Relocate Start) complete in 0.1–0.8ms.
async-profiler Wall-Mode Stack Trace (VM Thread during freeze)
VM_ThreadDump::doit()
→ ConcurrentLocksDump::dump_at_safepoint()
→ HeapInspection::find_instances_at_safepoint()
→ ZHeap::object_iterate()
→ ZHeapIterator::drain_and_steal<true>()
→ FindInstanceClosure::do_object()
→ ZBarrier::load_barrier_on_oop_field_preloaded
ZGC barrier frames account for 55% of VM Thread samples during the freeze:
| Frame |
% of VM Thread samples |
FindInstanceClosure::do_object() |
25.0% |
ZBarrier::load_barrier_on_oop_field_preloaded |
19.8% |
OopOopIterateDispatch::oop_oop_iterate |
17.0% |
ZBarrierSet::AccessBarrier::oop_access_barrier |
8.4% |
Anything else
Root Cause Analysis
dumpAllThreads(lockedMonitors, lockedSynchronizers) with lockedSynchronizers=true requires the JVM to find all heap instances of AbstractOwnableSynchronizer. This is implemented via HeapInspection::find_instances_at_safepoint(), which iterates the entire heap at a safepoint.
On ZGC, every object reference must pass through the load barrier (color bit check, potential relocate, forwarding table lookup). For a 65GB heap, this results in a ~37-second safepoint pause as observed in our production environment.
This is a known JVM-level characteristic. The OpenJDK community has addressed it on the tooling side:
- JDK-8262098: "jhsdb jstack can be very slow" — identified heap scan for
AbstractOwnableSynchronizer as expensive
- JDK-8324066: "clhsdb jstack should not by default scan for j.u.c locks" — fixed by disabling lock scanning by default
However, the programmatic API (ThreadMXBean.dumpAllThreads) has no such protection — callers passing lockedSynchronizers=true still trigger the full heap scan unconditionally.
Suggested Fix
Option A (minimal, recommended): Change dumpAllThreads(true, true) to dumpAllThreads(true, false):
// Before
ThreadInfo[] infos = threadMxBean.dumpAllThreads(true, true);
// After — avoids heap scan for AbstractOwnableSynchronizer
ThreadInfo[] infos = threadMxBean.dumpAllThreads(true, false);
This retains locked monitor information (cheap — derived from thread stacks) but skips lockedSynchronizers (triggers expensive heap scan). Lock contention on synchronized blocks remains fully visible; only java.util.concurrent.locks ownership is lost from the dump output.
Option B (adaptive): Detect the GC type and heap size at runtime; only pass lockedSynchronizers=true when the cost is acceptable.
Option C (JDK 21+): Use jcmd <pid> Thread.dump_to_file equivalent or HotSpotDiagnosticMXBean for thread dumps that do not require a safepoint heap scan.
Cascade Effect
The existing rate-limiting fix (#14468) prevents repeated dumps within 10 minutes but does not reduce the per-dump cost. When a single dump takes 37 seconds, the cascade is:
- Thread pool exhausted →
dumpAllThreads(true, true) → 37s freeze
- During freeze, all incoming requests queue up
- On safepoint release, the queued requests immediately exhaust the pool again
- If the 10-minute window has passed (or on first occurrence), another 37s freeze
In our production incident, this produced ~150 seconds of near-total unavailability.
Related Issues
Are you willing to submit a pull request to fix on your own?
Pre-check
Apache Dubbo Component
Java SDK (apache/dubbo)
Dubbo Version
Dubbo Java 3.3.6, OpenJDK 21.0.5+11-LTS, Ubuntu (ZGC Generational:
-XX:+UseZGC -XX:+ZGenerational, Heap:-Xms65g -Xmx65g)Steps to reproduce this issue
When the Dubbo thread pool is exhausted,
AbortPolicyWithReport.rejectedExecution()invokesdumpJStack()→JVMUtil.jstack()→ThreadMXBean.dumpAllThreads(true, true).The second
trueparameter (lockedSynchronizers) forces the JVM to scan the entire Java heap at a safepoint to find allAbstractOwnableSynchronizerinstances viaHeapInspection::find_instances_at_safepoint(). On ZGC with large heaps, every object reference during this scan passes through ZGC's load barrier (color bit check → potential relocate → forwarding table lookup → remap), resulting in a ~37-second safepoint pause that freezes the entire application.Environment
-XX:+UseZGC -XX:+ZGenerational)-Xms65g -Xmx65g)Reproduction Steps
AbortPolicyWithReport.rejectedExecution()fires and callsdumpJStack()Relevant Code
JVMUtil.javaLine 36:What you expected to happen
JVMUtil.jstack()should complete without causing a significant safepoint pause on ZGC with large heaps.What actually happens: A 36.83-second safepoint pause freezes all ~1950 application threads. The freeze cascades: queued requests during the pause immediately exhaust the thread pool again on release, triggering another dump. We observed 4 consecutive ThreadDumps producing ~150 seconds of near-total service unavailability.
GC Log Evidence (safepoint entries)
For comparison, normal ZGC safepoint operations (Mark Start, Mark End, Relocate Start) complete in 0.1–0.8ms.
async-profiler Wall-Mode Stack Trace (VM Thread during freeze)
ZGC barrier frames account for 55% of VM Thread samples during the freeze:
FindInstanceClosure::do_object()ZBarrier::load_barrier_on_oop_field_preloadedOopOopIterateDispatch::oop_oop_iterateZBarrierSet::AccessBarrier::oop_access_barrierAnything else
Root Cause Analysis
dumpAllThreads(lockedMonitors, lockedSynchronizers)withlockedSynchronizers=truerequires the JVM to find all heap instances ofAbstractOwnableSynchronizer. This is implemented viaHeapInspection::find_instances_at_safepoint(), which iterates the entire heap at a safepoint.On ZGC, every object reference must pass through the load barrier (color bit check, potential relocate, forwarding table lookup). For a 65GB heap, this results in a ~37-second safepoint pause as observed in our production environment.
This is a known JVM-level characteristic. The OpenJDK community has addressed it on the tooling side:
AbstractOwnableSynchronizeras expensiveHowever, the programmatic API (
ThreadMXBean.dumpAllThreads) has no such protection — callers passinglockedSynchronizers=truestill trigger the full heap scan unconditionally.Suggested Fix
Option A (minimal, recommended): Change
dumpAllThreads(true, true)todumpAllThreads(true, false):This retains locked monitor information (cheap — derived from thread stacks) but skips
lockedSynchronizers(triggers expensive heap scan). Lock contention onsynchronizedblocks remains fully visible; onlyjava.util.concurrent.locksownership is lost from the dump output.Option B (adaptive): Detect the GC type and heap size at runtime; only pass
lockedSynchronizers=truewhen the cost is acceptable.Option C (JDK 21+): Use
jcmd <pid> Thread.dump_to_fileequivalent orHotSpotDiagnosticMXBeanfor thread dumps that do not require a safepoint heap scan.Cascade Effect
The existing rate-limiting fix (#14468) prevents repeated dumps within 10 minutes but does not reduce the per-dump cost. When a single dump takes 37 seconds, the cascade is:
dumpAllThreads(true, true)→ 37s freezeIn our production incident, this produced ~150 seconds of near-total unavailability.
Related Issues
AbortPolicyWithReportcauses JVM pause (same symptom, reported in 2020 for Dubbo 2.6.8, no resolution)Are you willing to submit a pull request to fix on your own?