Skip to content

[Bug] JVMUtil.jstack() causes ~37s safepoint pause on ZGC with large heaps due to dumpAllThreads(true, true) #16194

@eddieran

Description

@eddieran

Pre-check

  • I am sure that all the content I provide is in English.
  • I had searched in the issues and found no similar issues.

Apache Dubbo Component

Java SDK (apache/dubbo)

Dubbo Version

Dubbo Java 3.3.6, OpenJDK 21.0.5+11-LTS, Ubuntu (ZGC Generational: -XX:+UseZGC -XX:+ZGenerational, Heap: -Xms65g -Xmx65g)

Steps to reproduce this issue

When the Dubbo thread pool is exhausted, AbortPolicyWithReport.rejectedExecution() invokes dumpJStack()JVMUtil.jstack()ThreadMXBean.dumpAllThreads(true, true).

The second true parameter (lockedSynchronizers) forces the JVM to scan the entire Java heap at a safepoint to find all AbstractOwnableSynchronizer instances via HeapInspection::find_instances_at_safepoint(). On ZGC with large heaps, every object reference during this scan passes through ZGC's load barrier (color bit check → potential relocate → forwarding table lookup → remap), resulting in a ~37-second safepoint pause that freezes the entire application.

Environment

  • JDK: OpenJDK 21.0.5+11-LTS
  • GC: ZGC Generational (-XX:+UseZGC -XX:+ZGenerational)
  • Heap: 65GB (-Xms65g -Xmx65g)
  • Threads: ~1950
  • Dubbo version: 3.3.6

Reproduction Steps

  1. Deploy a Dubbo application with ZGC and a large heap (≥32GB)
  2. Drive enough traffic to exhaust the Dubbo thread pool
  3. AbortPolicyWithReport.rejectedExecution() fires and calls dumpJStack()
  4. Observe a ~37-second full application freeze in GC logs (safepoint duration)

Relevant Code

JVMUtil.java Line 36:

public static void jstack(OutputStream stream) throws Exception {
    ThreadMXBean threadMxBean = ManagementFactory.getThreadMXBean();
    for (ThreadInfo threadInfo : threadMxBean.dumpAllThreads(true, true)) {
        //                                                         ^^^^ lockedSynchronizers=true triggers heap scan
        stream.write(getThreadDumpString(threadInfo).getBytes(StandardCharsets.UTF_8));
    }
}

What you expected to happen

JVMUtil.jstack() should complete without causing a significant safepoint pause on ZGC with large heaps.

What actually happens: A 36.83-second safepoint pause freezes all ~1950 application threads. The freeze cascades: queued requests during the pause immediately exhaust the thread pool again on release, triggering another dump. We observed 4 consecutive ThreadDumps producing ~150 seconds of near-total service unavailability.

GC Log Evidence (safepoint entries)

Event Timestamp At Safepoint Duration Thread Count
ThreadDump #1 10:41:04.466 38,589,643,169 ns (38.59s) 1954
ThreadDump #2 10:51:15.645 36,833,709,974 ns (36.83s) 1947

For comparison, normal ZGC safepoint operations (Mark Start, Mark End, Relocate Start) complete in 0.1–0.8ms.

async-profiler Wall-Mode Stack Trace (VM Thread during freeze)

VM_ThreadDump::doit()
  → ConcurrentLocksDump::dump_at_safepoint()
    → HeapInspection::find_instances_at_safepoint()
      → ZHeap::object_iterate()
        → ZHeapIterator::drain_and_steal<true>()
          → FindInstanceClosure::do_object()
            → ZBarrier::load_barrier_on_oop_field_preloaded

ZGC barrier frames account for 55% of VM Thread samples during the freeze:

Frame % of VM Thread samples
FindInstanceClosure::do_object() 25.0%
ZBarrier::load_barrier_on_oop_field_preloaded 19.8%
OopOopIterateDispatch::oop_oop_iterate 17.0%
ZBarrierSet::AccessBarrier::oop_access_barrier 8.4%

Anything else

Root Cause Analysis

dumpAllThreads(lockedMonitors, lockedSynchronizers) with lockedSynchronizers=true requires the JVM to find all heap instances of AbstractOwnableSynchronizer. This is implemented via HeapInspection::find_instances_at_safepoint(), which iterates the entire heap at a safepoint.

On ZGC, every object reference must pass through the load barrier (color bit check, potential relocate, forwarding table lookup). For a 65GB heap, this results in a ~37-second safepoint pause as observed in our production environment.

This is a known JVM-level characteristic. The OpenJDK community has addressed it on the tooling side:

  • JDK-8262098: "jhsdb jstack can be very slow" — identified heap scan for AbstractOwnableSynchronizer as expensive
  • JDK-8324066: "clhsdb jstack should not by default scan for j.u.c locks" — fixed by disabling lock scanning by default

However, the programmatic API (ThreadMXBean.dumpAllThreads) has no such protection — callers passing lockedSynchronizers=true still trigger the full heap scan unconditionally.

Suggested Fix

Option A (minimal, recommended): Change dumpAllThreads(true, true) to dumpAllThreads(true, false):

// Before
ThreadInfo[] infos = threadMxBean.dumpAllThreads(true, true);

// After — avoids heap scan for AbstractOwnableSynchronizer
ThreadInfo[] infos = threadMxBean.dumpAllThreads(true, false);

This retains locked monitor information (cheap — derived from thread stacks) but skips lockedSynchronizers (triggers expensive heap scan). Lock contention on synchronized blocks remains fully visible; only java.util.concurrent.locks ownership is lost from the dump output.

Option B (adaptive): Detect the GC type and heap size at runtime; only pass lockedSynchronizers=true when the cost is acceptable.

Option C (JDK 21+): Use jcmd <pid> Thread.dump_to_file equivalent or HotSpotDiagnosticMXBean for thread dumps that do not require a safepoint heap scan.

Cascade Effect

The existing rate-limiting fix (#14468) prevents repeated dumps within 10 minutes but does not reduce the per-dump cost. When a single dump takes 37 seconds, the cascade is:

  1. Thread pool exhausted → dumpAllThreads(true, true) → 37s freeze
  2. During freeze, all incoming requests queue up
  3. On safepoint release, the queued requests immediately exhaust the pool again
  4. If the 10-minute window has passed (or on first occurrence), another 37s freeze

In our production incident, this produced ~150 seconds of near-total unavailability.

Related Issues

Are you willing to submit a pull request to fix on your own?

  • Yes I am willing to submit a pull request on my own!

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedEverything needs help from contributors

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions