Skip to content

Commit 3c7d553

Browse files
committed
Improve README examples with NUMA-aware CPU pinning guide, fix mockless+pidstat crash, add BACKLOG
- README: replace toy examples with practical ones showing lscpu topology analysis, perf stat interpretation, and NUMA-aware cpuset allocation - Fix MOCK_PID unbound variable when running mockless mode with pidstat enabled - Add BACKLOG.md with investigation findings on kernel CPU accounting gap
1 parent a969de1 commit 3c7d553

File tree

3 files changed

+195
-20
lines changed

3 files changed

+195
-20
lines changed

BACKLOG.md

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Backlog
2+
3+
## 1. Kernel CPU accounting under-reports vs hardware PMU counters
4+
5+
**Priority**: Low (understanding only — use `perf stat` as ground truth)
6+
**Status**: Root cause narrowed, further experiments possible
7+
8+
### Problem
9+
10+
The Linux kernel's CPU time accounting (`/proc/pid/stat`, `schedstat`) consistently under-reports
11+
CPU utilization compared to `perf stat` (hardware PMU counters) for the Netty custom scheduler workload.
12+
13+
**Measured on NETTY_SCHEDULER with 4 carrier threads pinned to 4 physical cores (Ryzen 9 7950X):**
14+
15+
| Source | CPUs utilized | Notes |
16+
|--------|--------------|-------|
17+
| `perf stat` (PMU task-clock) | **3.96** | Hardware counter ground truth |
18+
| `/proc/pid/stat` (utime+stime) | **3.19** | Kernel accounting |
19+
| `schedstat` (sum of all thread run_ns) | **3.19** | CFS-level accounting |
20+
| `pidstat` process-level | **2.84** | Even lower (pidstat's own sampling) |
21+
| `pidstat` per-thread carrier sum | **2.72** | 4 x ~68% |
22+
23+
### Key findings
24+
25+
- **pidstat is NOT lying** — it faithfully reports what the kernel provides
26+
- `/proc/pid/stat` and `schedstat` agree perfectly (both at 3.19 CPUs)
27+
- The kernel itself under-counts by ~0.77 CPUs (19%) vs PMU hardware counters
28+
- This is NOT caused by `CONFIG_TICK_CPU_ACCOUNTING` — the kernel uses
29+
`CONFIG_VIRT_CPU_ACCOUNTING_GEN=y` (full dyntick, precise at context switch boundaries)
30+
- All 31 kernel-visible threads were accounted for — no "hidden threads" from
31+
`-Djdk.trackAllThreads=false` (VTs are invisible to the kernel by design, their CPU time
32+
is charged to carrier threads)
33+
34+
### Hypotheses for the 0.77 CPU gap
35+
36+
1. **Kernel scheduling overhead**: `__schedule()` / `finish_task_switch()` runs during
37+
thread transitions. Some of this CPU time may be attributed to idle/swapper rather
38+
than the thread being switched in/out.
39+
40+
2. **Interrupt handling**: hardware interrupts (NIC, timer) steal cycles from the process.
41+
`perf stat` counts all cycles on cores used by the process (task-clock includes time
42+
when PMU is inherited by children or interrupted contexts), while `/proc/stat` only
43+
counts time explicitly attributed to the thread.
44+
45+
3. **`task-clock` semantics**: `perf stat`'s `task-clock` measures wall-clock time that
46+
at least one thread of the process was running. With 4 threads on 4 cores, task-clock
47+
closely approximates 4.0 * elapsed. This includes interrupt handling time on those cores
48+
that `/proc/stat` charges elsewhere.
49+
50+
4. **Carrier thread park/unpark transitions**: even with VIRT_CPU_ACCOUNTING_GEN, the
51+
accounting happens at `schedule()` boundaries. CPU cycles consumed during the entry/exit
52+
paths of `LockSupport.park()` (before the actual `schedule()` call and after the wakeup)
53+
may be partially lost.
54+
55+
### Further experiments (if desired)
56+
57+
1. **Compare `perf stat -e task-clock` vs `perf stat -e cpu-clock`**: `task-clock` counts
58+
per-thread time, `cpu-clock` counts wall time. If they differ, it reveals interrupt overhead.
59+
60+
2. **Run with `nohz_full=4-7` (isolated CPUs)**: removes timer tick interrupts from server
61+
cores. If the gap shrinks, interrupt overhead is the cause.
62+
63+
3. **Spin-wait instead of park**: replace `LockSupport.park()` with `Thread.onSpinWait()`
64+
in `FifoEventLoopScheduler`. If gap shrinks, park/unpark accounting is lossy.
65+
66+
4. **Check `/proc/interrupts`** delta during benchmark: quantify how many interrupts hit
67+
cores 4-7 and estimate their CPU cost.
68+
69+
5. **`perf stat` per-thread (`-t TID`)** for each carrier: compare PMU task-clock per
70+
carrier vs schedstat per carrier to see if the gap is evenly distributed.
71+
72+
### Conclusion
73+
74+
For benchmarking purposes, **always use `perf stat` as the ground truth** for CPU utilization.
75+
pidstat is still useful for relative thread balance analysis and for monitoring non-server
76+
components (mock server, load generator) where the gap is less significant.
77+
78+
---
79+
80+
## 2. Add spin-wait phase before carrier thread parking
81+
82+
**Priority**: Medium (performance optimization)
83+
**Status**: Not started
84+
85+
In `FifoEventLoopScheduler.virtualThreadSchedulerLoop()`, the carrier thread parks immediately
86+
when the queue drains. Adding a brief spin-wait phase (e.g., 100-1000 iterations of
87+
`Thread.onSpinWait()`) before calling `LockSupport.park()` could:
88+
89+
- Reduce wake-up latency for incoming work (avoid kernel schedule/deschedule)
90+
- Reduce context switch count (currently ~20/sec, could go to near-zero)
91+
- Trade-off: slightly higher idle CPU consumption
92+
93+
### Key file
94+
- `core/src/main/java/io/netty/loom/FifoEventLoopScheduler.java` line ~199

benchmark-runner/README.md

Lines changed: 94 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -151,47 +151,126 @@ perf stat uses `PROFILING_DELAY_SECONDS` and `PROFILING_DURATION_SECONDS`.
151151

152152
## Example Runs
153153

154-
### Basic comparison: modes
154+
### Choosing CPU pinning with `lscpu -e`
155+
156+
Good benchmarking requires NUMA-aware CPU pinning. Start by inspecting your topology:
157+
158+
```bash
159+
$ lscpu -e
160+
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
161+
0 0 0 0 0:0:0:0 yes # NUMA 0, physical core 0
162+
1 0 0 1 1:1:1:0 yes # NUMA 0, physical core 1
163+
...
164+
8 1 0 8 8:8:8:1 yes # NUMA 1, physical core 8
165+
...
166+
16 0 0 0 0:0:0:0 yes # NUMA 0, SMT sibling of core 0
167+
17 0 0 1 1:1:1:0 yes # NUMA 0, SMT sibling of core 1
168+
```
169+
170+
Key rules:
171+
- **Keep all benchmark components on the same NUMA node** to avoid cross-node memory latency
172+
- **Use physical cores only** (avoid SMT siblings) for more stable results
173+
- **Isolate noisy processes** (IDEs, browsers) on the other NUMA node
174+
175+
Example layout for a 16-core/2-NUMA system with 4 server threads:
176+
177+
| Component | CPUs | Rationale |
178+
|-----------|------|-----------|
179+
| Load generator | 0-1 | 2 physical cores, enough to saturate |
180+
| Mock server | 2-3 | 2 physical cores for backend simulation |
181+
| Handoff server | 4-7 | 4 physical cores, one per event loop thread |
182+
| Other processes | 8-15 | Isolated on NUMA node 1 |
183+
184+
### NETTY_SCHEDULER with 4 threads
185+
186+
```bash
187+
./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 --io nio \
188+
--server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
189+
--jvm-args "-Xms8g -Xmx8g" \
190+
--connections 10000 --load-threads 4 \
191+
--mock-think-time 30 --mock-threads 4 \
192+
--perf-stat
193+
```
194+
195+
### Analyzing bottlenecks with perf stat
196+
197+
Use `--perf-stat` to get reliable hardware-level metrics. The `perf-stat.txt` output is the
198+
ground truth for CPU utilization — pidstat per-thread numbers can be misleading with virtual threads.
199+
200+
```
201+
Performance counter stats for process id '95868':
202+
203+
39,741,757,754 task-clock # 3.970 CPUs utilized
204+
806 context-switches # 20.281 /sec
205+
199,114,762,646 instructions # 1.17 insn per cycle
206+
1,338,722,757 branch-misses # 3.08% of all branches
207+
```
208+
209+
Key metrics to watch:
210+
- **CPUs utilized**: how many cores the server is actually using (3.97 of 4 = fully saturated)
211+
- **Context switches/sec**: lower is better; custom scheduler typically achieves 20-80/sec
212+
- **IPC (insn per cycle)**: higher is better; >1.0 is good, <0.5 suggests memory stalls
213+
- **Branch misses**: >5% suggests unpredictable control flow
214+
215+
If CPUs utilized equals your allocated core count, the server is CPU-bound — add more cores.
216+
If context switches are high (>10K/sec), the scheduler or OS is thrashing.
217+
218+
pidstat is still useful for spotting **mock server or load generator bottlenecks**
219+
check `pidstat-mock.log` and `pidstat-loadgen.log` to ensure they aren't saturated.
220+
221+
### NON_VIRTUAL_NETTY (default mode)
155222

156223
```bash
157-
# Custom scheduler mode
158-
./run-benchmark.sh --mode netty_scheduler
224+
./run-benchmark.sh --threads 4 \
225+
--server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
226+
--connections 10000 --mock-think-time 30
227+
```
228+
229+
### VIRTUAL_NETTY mode
159230

160-
# Virtual Netty mode, mockless
161-
./run-benchmark.sh --mode virtual_netty --threads 2 --mockless
231+
```bash
232+
./run-benchmark.sh --mode VIRTUAL_NETTY --threads 4 --io nio \
233+
--server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
234+
--connections 10000 --mock-think-time 30
162235
```
163236

164-
### With CPU pinning
237+
### Mockless mode (skip HTTP call to mock, inline Jackson work)
165238

166239
```bash
167-
./run-benchmark.sh --mode netty_scheduler --threads 4 \
168-
--server-cpuset 1-4 --mock-cpuset 0 --load-cpuset 5-7
240+
./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 --mockless \
241+
--server-cpuset "4-7" --load-cpuset "0-1" \
242+
--connections 10000
169243
```
170244

171-
### With profiling
245+
### With async-profiler
172246

173247
```bash
174-
./run-benchmark.sh --mode netty_scheduler \
248+
./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
249+
--server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
175250
--profiler --profiler-path /path/to/async-profiler \
176251
--warmup 15s --total-duration 45s
177252
```
178253

179-
### With JFR events enabled (subset)
254+
### Rate-limited test with wrk2
180255

181256
```bash
182-
./run-benchmark.sh --mode netty_scheduler --jfr --jfr-events NettyRunIo,VirtualThreadTaskRuns
257+
./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
258+
--server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
259+
--rate 120000 --connections 10000 --total-duration 60s --warmup 15s
183260
```
184261

185-
### Rate-limited test with wrk2
262+
### With JFR events
186263

187264
```bash
188-
./run-benchmark.sh --rate 10000 --connections 200 --total-duration 60s --warmup 15s
265+
./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
266+
--server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
267+
--jfr --jfr-events NettyRunIo,VirtualThreadTaskRuns
189268
```
190269

191270
### Mixed: CLI flags + env vars
192271

193272
```bash
194-
SERVER_JVM_ARGS="-XX:+PrintGCDetails" ./run-benchmark.sh --mode virtual_netty --threads 2
273+
SERVER_JVM_ARGS="-XX:+PrintGCDetails" ./run-benchmark.sh --mode VIRTUAL_NETTY --threads 2
195274
```
196275

197276
## Output

benchmark-runner/scripts/run-benchmark.sh

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -581,14 +581,16 @@ start_pidstat() {
581581

582582
log "pidstat running (PID: $PIDSTAT_PID)"
583583

584-
log "Starting pidstat for mock server (PID: $MOCK_PID)..."
584+
if [[ -n "${MOCK_PID:-}" ]]; then
585+
log "Starting pidstat for mock server (PID: $MOCK_PID)..."
585586

586-
local mock_output_file="$OUTPUT_DIR/$PIDSTAT_MOCK_OUTPUT"
587+
local mock_output_file="$OUTPUT_DIR/$PIDSTAT_MOCK_OUTPUT"
587588

588-
pidstat -p "$MOCK_PID" "$PIDSTAT_INTERVAL" > "$mock_output_file" 2>&1 &
589-
PIDSTAT_MOCK_PID=$!
589+
pidstat -p "$MOCK_PID" "$PIDSTAT_INTERVAL" > "$mock_output_file" 2>&1 &
590+
PIDSTAT_MOCK_PID=$!
590591

591-
log "pidstat running for mock server (PID: $PIDSTAT_MOCK_PID)"
592+
log "pidstat running for mock server (PID: $PIDSTAT_MOCK_PID)"
593+
fi
592594
}
593595

594596
stop_pidstat() {

0 commit comments

Comments
 (0)