[Bug] JVMUtil.jstack() causes ~37s safepoint pause on ZGC with large heaps due to dumpAllThreads(true, true)

### Pre-check

- [X] I am sure that all the content I provide is in English.
- [X] I had searched in the [issues](https://github.com/apache/dubbo/issues?q=is%3Aissue) and found no similar issues.

### Apache Dubbo Component

Java SDK (apache/dubbo)

### Dubbo Version

Dubbo Java 3.3.6, OpenJDK 21.0.5+11-LTS, Ubuntu (ZGC Generational: `-XX:+UseZGC -XX:+ZGenerational`, Heap: `-Xms65g -Xmx65g`)

### Steps to reproduce this issue

When the Dubbo thread pool is exhausted, `AbortPolicyWithReport.rejectedExecution()` invokes `dumpJStack()` → `JVMUtil.jstack()` → `ThreadMXBean.dumpAllThreads(true, true)`.

The second `true` parameter (`lockedSynchronizers`) forces the JVM to scan the **entire Java heap** at a safepoint to find all `AbstractOwnableSynchronizer` instances via `HeapInspection::find_instances_at_safepoint()`. On ZGC with large heaps, every object reference during this scan passes through ZGC's load barrier (color bit check → potential relocate → forwarding table lookup → remap), resulting in a **~37-second safepoint pause** that freezes the entire application.

#### Environment
- **JDK**: OpenJDK 21.0.5+11-LTS
- **GC**: ZGC Generational (`-XX:+UseZGC -XX:+ZGenerational`)
- **Heap**: 65GB (`-Xms65g -Xmx65g`)
- **Threads**: ~1950
- **Dubbo version**: 3.3.6

#### Reproduction Steps

1. Deploy a Dubbo application with ZGC and a large heap (≥32GB)
2. Drive enough traffic to exhaust the Dubbo thread pool
3. `AbortPolicyWithReport.rejectedExecution()` fires and calls `dumpJStack()`
4. Observe a **~37-second full application freeze** in GC logs (safepoint duration)

#### Relevant Code

[`JVMUtil.java` Line 36](https://github.com/apache/dubbo/blob/3.3/dubbo-common/src/main/java/org/apache/dubbo/common/utils/JVMUtil.java#L36):

```java
public static void jstack(OutputStream stream) throws Exception {
    ThreadMXBean threadMxBean = ManagementFactory.getThreadMXBean();
    for (ThreadInfo threadInfo : threadMxBean.dumpAllThreads(true, true)) {
        //                                                         ^^^^ lockedSynchronizers=true triggers heap scan
        stream.write(getThreadDumpString(threadInfo).getBytes(StandardCharsets.UTF_8));
    }
}
```

### What you expected to happen

`JVMUtil.jstack()` should complete without causing a significant safepoint pause on ZGC with large heaps.

**What actually happens:** A **36.83-second safepoint pause** freezes all ~1950 application threads. The freeze cascades: queued requests during the pause immediately exhaust the thread pool again on release, triggering another dump. We observed **4 consecutive ThreadDumps producing ~150 seconds of near-total service unavailability**.

#### GC Log Evidence (safepoint entries)

| Event | Timestamp | At Safepoint Duration | Thread Count |
|-------|-----------|----------------------|--------------|
| ThreadDump #1 | 10:41:04.466 | 38,589,643,169 ns (**38.59s**) | 1954 |
| ThreadDump #2 | 10:51:15.645 | 36,833,709,974 ns (**36.83s**) | 1947 |

For comparison, normal ZGC safepoint operations (Mark Start, Mark End, Relocate Start) complete in **0.1–0.8ms**.

#### async-profiler Wall-Mode Stack Trace (VM Thread during freeze)

```
VM_ThreadDump::doit()
  → ConcurrentLocksDump::dump_at_safepoint()
    → HeapInspection::find_instances_at_safepoint()
      → ZHeap::object_iterate()
        → ZHeapIterator::drain_and_steal<true>()
          → FindInstanceClosure::do_object()
            → ZBarrier::load_barrier_on_oop_field_preloaded
```

ZGC barrier frames account for **55%** of VM Thread samples during the freeze:

| Frame | % of VM Thread samples |
|-------|----------------------|
| `FindInstanceClosure::do_object()` | 25.0% |
| `ZBarrier::load_barrier_on_oop_field_preloaded` | 19.8% |
| `OopOopIterateDispatch::oop_oop_iterate` | 17.0% |
| `ZBarrierSet::AccessBarrier::oop_access_barrier` | 8.4% |

### Anything else

#### Root Cause Analysis

`dumpAllThreads(lockedMonitors, lockedSynchronizers)` with `lockedSynchronizers=true` requires the JVM to find all heap instances of `AbstractOwnableSynchronizer`. This is implemented via `HeapInspection::find_instances_at_safepoint()`, which iterates the **entire heap** at a safepoint.

On ZGC, every object reference must pass through the load barrier (color bit check, potential relocate, forwarding table lookup). For a 65GB heap, this results in a **~37-second safepoint pause** as observed in our production environment.

This is a known JVM-level characteristic. The OpenJDK community has addressed it on the tooling side:
- [JDK-8262098](https://bugs.openjdk.org/browse/JDK-8262098): "jhsdb jstack can be very slow" — identified heap scan for `AbstractOwnableSynchronizer` as expensive
- [JDK-8324066](https://bugs.openjdk.org/browse/JDK-8324066): "clhsdb jstack should not by default scan for j.u.c locks" — fixed by disabling lock scanning by default

However, the **programmatic API** (`ThreadMXBean.dumpAllThreads`) has no such protection — callers passing `lockedSynchronizers=true` still trigger the full heap scan unconditionally.

#### Suggested Fix

**Option A (minimal, recommended):** Change `dumpAllThreads(true, true)` to `dumpAllThreads(true, false)`:

```java
// Before
ThreadInfo[] infos = threadMxBean.dumpAllThreads(true, true);

// After — avoids heap scan for AbstractOwnableSynchronizer
ThreadInfo[] infos = threadMxBean.dumpAllThreads(true, false);
```

This retains locked monitor information (cheap — derived from thread stacks) but skips `lockedSynchronizers` (triggers expensive heap scan). Lock contention on `synchronized` blocks remains fully visible; only `java.util.concurrent.locks` ownership is lost from the dump output.

**Option B (adaptive):** Detect the GC type and heap size at runtime; only pass `lockedSynchronizers=true` when the cost is acceptable.

**Option C (JDK 21+):** Use `jcmd <pid> Thread.dump_to_file` equivalent or `HotSpotDiagnosticMXBean` for thread dumps that do not require a safepoint heap scan.

#### Cascade Effect

The existing rate-limiting fix (#14468) prevents repeated dumps within 10 minutes but does **not** reduce the per-dump cost. When a single dump takes 37 seconds, the cascade is:
1. Thread pool exhausted → `dumpAllThreads(true, true)` → 37s freeze
2. During freeze, all incoming requests queue up
3. On safepoint release, the queued requests immediately exhaust the pool again
4. If the 10-minute window has passed (or on first occurrence), another 37s freeze

In our production incident, this produced ~150 seconds of near-total unavailability.

#### Related Issues

- #6696 — `AbortPolicyWithReport` causes JVM pause (same symptom, reported in 2020 for Dubbo 2.6.8, no resolution)
- #14467 / #14468 — Fix repeated jstack when thread pool exhausted (addresses frequency, not per-call cost)
- [JDK-8262098](https://bugs.openjdk.org/browse/JDK-8262098) — jhsdb jstack can be very slow
- [JDK-8324066](https://bugs.openjdk.org/browse/JDK-8324066) — clhsdb jstack should not scan for j.u.c locks by default
- [JDK-8221153](https://bugs.openjdk.org/browse/JDK-8221153) — ZGC heap iteration and verification pollutes GC statistics

### Are you willing to submit a pull request to fix on your own?

- [X] Yes I am willing to submit a pull request on my own!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] JVMUtil.jstack() causes ~37s safepoint pause on ZGC with large heaps due to dumpAllThreads(true, true) #16194

Pre-check

Apache Dubbo Component

Dubbo Version

Steps to reproduce this issue

Environment

Reproduction Steps

Relevant Code

What you expected to happen

GC Log Evidence (safepoint entries)

async-profiler Wall-Mode Stack Trace (VM Thread during freeze)

Anything else

Root Cause Analysis

Suggested Fix

Cascade Effect

Related Issues

Are you willing to submit a pull request to fix on your own?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Event	Timestamp	At Safepoint Duration	Thread Count
ThreadDump #1	10:41:04.466	38,589,643,169 ns (38.59s)	1954
ThreadDump #2	10:51:15.645	36,833,709,974 ns (36.83s)	1947

Frame	% of VM Thread samples
`FindInstanceClosure::do_object()`	25.0%
`ZBarrier::load_barrier_on_oop_field_preloaded`	19.8%
`OopOopIterateDispatch::oop_oop_iterate`	17.0%
`ZBarrierSet::AccessBarrier::oop_access_barrier`	8.4%

[Bug] JVMUtil.jstack() causes ~37s safepoint pause on ZGC with large heaps due to dumpAllThreads(true, true) #16194

Description

Pre-check

Apache Dubbo Component

Dubbo Version

Steps to reproduce this issue

Environment

Reproduction Steps

Relevant Code

What you expected to happen

GC Log Evidence (safepoint entries)

async-profiler Wall-Mode Stack Trace (VM Thread during freeze)

Anything else

Root Cause Analysis

Suggested Fix

Cascade Effect

Related Issues

Are you willing to submit a pull request to fix on your own?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions