Improve README examples with NUMA-aware CPU pinning guide, fix mockless+pidstat crash, add BACKLOG

franz1981 · franz1981 · commit 3c7d553f594b · 2026-03-27T15:12:29.000+01:00
- README: replace toy examples with practical ones showing lscpu topology
  analysis, perf stat interpretation, and NUMA-aware cpuset allocation
- Fix MOCK_PID unbound variable when running mockless mode with pidstat enabled
- Add BACKLOG.md with investigation findings on kernel CPU accounting gap
diff --git a/BACKLOG.md b/BACKLOG.md
@@ -0,0 +1,94 @@
+# Backlog
+
+## 1. Kernel CPU accounting under-reports vs hardware PMU counters
+
+**Priority**: Low (understanding only — use `perf stat` as ground truth)
+**Status**: Root cause narrowed, further experiments possible
+
+### Problem
+
+The Linux kernel's CPU time accounting (`/proc/pid/stat`, `schedstat`) consistently under-reports
+CPU utilization compared to `perf stat` (hardware PMU counters) for the Netty custom scheduler workload.
+
+**Measured on NETTY_SCHEDULER with 4 carrier threads pinned to 4 physical cores (Ryzen 9 7950X):**
+
+| Source | CPUs utilized | Notes |
+|--------|--------------|-------|
+| `perf stat` (PMU task-clock) | **3.96** | Hardware counter ground truth |
+| `/proc/pid/stat` (utime+stime) | **3.19** | Kernel accounting |
+| `schedstat` (sum of all thread run_ns) | **3.19** | CFS-level accounting |
+| `pidstat` process-level | **2.84** | Even lower (pidstat's own sampling) |
+| `pidstat` per-thread carrier sum | **2.72** | 4 x ~68% |
+
+### Key findings
+
+- **pidstat is NOT lying** — it faithfully reports what the kernel provides
+- `/proc/pid/stat` and `schedstat` agree perfectly (both at 3.19 CPUs)
+- The kernel itself under-counts by ~0.77 CPUs (19%) vs PMU hardware counters
+- This is NOT caused by `CONFIG_TICK_CPU_ACCOUNTING` — the kernel uses
+  `CONFIG_VIRT_CPU_ACCOUNTING_GEN=y` (full dyntick, precise at context switch boundaries)
+- All 31 kernel-visible threads were accounted for — no "hidden threads" from
+  `-Djdk.trackAllThreads=false` (VTs are invisible to the kernel by design, their CPU time
+  is charged to carrier threads)
+
+### Hypotheses for the 0.77 CPU gap
+
+1. **Kernel scheduling overhead**: `__schedule()` / `finish_task_switch()` runs during
+   thread transitions. Some of this CPU time may be attributed to idle/swapper rather
+   than the thread being switched in/out.
+
+2. **Interrupt handling**: hardware interrupts (NIC, timer) steal cycles from the process.
+   `perf stat` counts all cycles on cores used by the process (task-clock includes time
+   when PMU is inherited by children or interrupted contexts), while `/proc/stat` only
+   counts time explicitly attributed to the thread.
+
+3. **`task-clock` semantics**: `perf stat`'s `task-clock` measures wall-clock time that
+   at least one thread of the process was running. With 4 threads on 4 cores, task-clock
+   closely approximates 4.0 * elapsed. This includes interrupt handling time on those cores
+   that `/proc/stat` charges elsewhere.
+
+4. **Carrier thread park/unpark transitions**: even with VIRT_CPU_ACCOUNTING_GEN, the
+   accounting happens at `schedule()` boundaries. CPU cycles consumed during the entry/exit
+   paths of `LockSupport.park()` (before the actual `schedule()` call and after the wakeup)
+   may be partially lost.
+
+### Further experiments (if desired)
+
+1. **Compare `perf stat -e task-clock` vs `perf stat -e cpu-clock`**: `task-clock` counts
+   per-thread time, `cpu-clock` counts wall time. If they differ, it reveals interrupt overhead.
+
+2. **Run with `nohz_full=4-7` (isolated CPUs)**: removes timer tick interrupts from server
+   cores. If the gap shrinks, interrupt overhead is the cause.
+
+3. **Spin-wait instead of park**: replace `LockSupport.park()` with `Thread.onSpinWait()`
+   in `FifoEventLoopScheduler`. If gap shrinks, park/unpark accounting is lossy.
+
+4. **Check `/proc/interrupts`** delta during benchmark: quantify how many interrupts hit
+   cores 4-7 and estimate their CPU cost.
+
+5. **`perf stat` per-thread (`-t TID`)** for each carrier: compare PMU task-clock per
+   carrier vs schedstat per carrier to see if the gap is evenly distributed.
+
+### Conclusion
+
+For benchmarking purposes, **always use `perf stat` as the ground truth** for CPU utilization.
+pidstat is still useful for relative thread balance analysis and for monitoring non-server
+components (mock server, load generator) where the gap is less significant.
+
+---
+
+## 2. Add spin-wait phase before carrier thread parking
+
+**Priority**: Medium (performance optimization)
+**Status**: Not started
+
+In `FifoEventLoopScheduler.virtualThreadSchedulerLoop()`, the carrier thread parks immediately
+when the queue drains. Adding a brief spin-wait phase (e.g., 100-1000 iterations of
+`Thread.onSpinWait()`) before calling `LockSupport.park()` could:
+
+- Reduce wake-up latency for incoming work (avoid kernel schedule/deschedule)
+- Reduce context switch count (currently ~20/sec, could go to near-zero)
+- Trade-off: slightly higher idle CPU consumption
+
+### Key file
+- `core/src/main/java/io/netty/loom/FifoEventLoopScheduler.java` line ~199
diff --git a/benchmark-runner/README.md b/benchmark-runner/README.md
@@ -151,47 +151,126 @@ perf stat uses `PROFILING_DELAY_SECONDS` and `PROFILING_DURATION_SECONDS`.
 
 ## Example Runs
 
-### Basic comparison: modes
+### Choosing CPU pinning with `lscpu -e`
+
+Good benchmarking requires NUMA-aware CPU pinning. Start by inspecting your topology:
+
+```bash
+$ lscpu -e
+CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
+  0    0      0    0 0:0:0:0          yes    # NUMA 0, physical core 0
+  1    0      0    1 1:1:1:0          yes    # NUMA 0, physical core 1
+  ...
+  8    1      0    8 8:8:8:1          yes    # NUMA 1, physical core 8
+  ...
+ 16    0      0    0 0:0:0:0          yes    # NUMA 0, SMT sibling of core 0
+ 17    0      0    1 1:1:1:0          yes    # NUMA 0, SMT sibling of core 1
+```
+
+Key rules:
+- **Keep all benchmark components on the same NUMA node** to avoid cross-node memory latency
+- **Use physical cores only** (avoid SMT siblings) for more stable results
+- **Isolate noisy processes** (IDEs, browsers) on the other NUMA node
+
+Example layout for a 16-core/2-NUMA system with 4 server threads:
+
+| Component | CPUs | Rationale |
+|-----------|------|-----------|
+| Load generator | 0-1 | 2 physical cores, enough to saturate |
+| Mock server | 2-3 | 2 physical cores for backend simulation |
+| Handoff server | 4-7 | 4 physical cores, one per event loop thread |
+| Other processes | 8-15 | Isolated on NUMA node 1 |
+
+### NETTY_SCHEDULER with 4 threads
+
+```bash
+./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 --io nio \
+  --server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
+  --jvm-args "-Xms8g -Xmx8g" \
+  --connections 10000 --load-threads 4 \
+  --mock-think-time 30 --mock-threads 4 \
+  --perf-stat
+```
+
+### Analyzing bottlenecks with perf stat
+
+Use `--perf-stat` to get reliable hardware-level metrics. The `perf-stat.txt` output is the
+ground truth for CPU utilization — pidstat per-thread numbers can be misleading with virtual threads.
+
+```
+Performance counter stats for process id '95868':
+
+  39,741,757,754  task-clock           #  3.970 CPUs utilized
+             806  context-switches     # 20.281 /sec
+     199,114,762,646  instructions     #  1.17 insn per cycle
+   1,338,722,757  branch-misses        #  3.08% of all branches
+```
+
+Key metrics to watch:
+- **CPUs utilized**: how many cores the server is actually using (3.97 of 4 = fully saturated)
+- **Context switches/sec**: lower is better; custom scheduler typically achieves 20-80/sec
+- **IPC (insn per cycle)**: higher is better; >1.0 is good, <0.5 suggests memory stalls
+- **Branch misses**: >5% suggests unpredictable control flow
+
+If CPUs utilized equals your allocated core count, the server is CPU-bound — add more cores.
+If context switches are high (>10K/sec), the scheduler or OS is thrashing.
+
+pidstat is still useful for spotting **mock server or load generator bottlenecks** —
+check `pidstat-mock.log` and `pidstat-loadgen.log` to ensure they aren't saturated.
+
+### NON_VIRTUAL_NETTY (default mode)
 
 ```bash
-# Custom scheduler mode
-./run-benchmark.sh --mode netty_scheduler
+./run-benchmark.sh --threads 4 \
+  --server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
+  --connections 10000 --mock-think-time 30
+```
+
+### VIRTUAL_NETTY mode
 
-# Virtual Netty mode, mockless
-./run-benchmark.sh --mode virtual_netty --threads 2 --mockless
+```bash
+./run-benchmark.sh --mode VIRTUAL_NETTY --threads 4 --io nio \
+  --server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
+  --connections 10000 --mock-think-time 30
 ```
 
-### With CPU pinning
+### Mockless mode (skip HTTP call to mock, inline Jackson work)
 
 ```bash
-./run-benchmark.sh --mode netty_scheduler --threads 4 \
-  --server-cpuset 1-4 --mock-cpuset 0 --load-cpuset 5-7
+./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 --mockless \
+  --server-cpuset "4-7" --load-cpuset "0-1" \
+  --connections 10000
 ```
 
-### With profiling
+### With async-profiler
 
 ```bash
-./run-benchmark.sh --mode netty_scheduler \
+./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
+  --server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
   --profiler --profiler-path /path/to/async-profiler \
   --warmup 15s --total-duration 45s
 ```
 
-### With JFR events enabled (subset)
+### Rate-limited test with wrk2
 
 ```bash
-./run-benchmark.sh --mode netty_scheduler --jfr --jfr-events NettyRunIo,VirtualThreadTaskRuns
+./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
+  --server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
+  --rate 120000 --connections 10000 --total-duration 60s --warmup 15s
 ```
 
-### Rate-limited test with wrk2
+### With JFR events
 
 ```bash
-./run-benchmark.sh --rate 10000 --connections 200 --total-duration 60s --warmup 15s
+./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
+  --server-cpuset "4-7" --mock-cpuset "2-3" --load-cpuset "0-1" \
+  --jfr --jfr-events NettyRunIo,VirtualThreadTaskRuns
 ```
 
 ### Mixed: CLI flags + env vars
 
 ```bash
-SERVER_JVM_ARGS="-XX:+PrintGCDetails" ./run-benchmark.sh --mode virtual_netty --threads 2
+SERVER_JVM_ARGS="-XX:+PrintGCDetails" ./run-benchmark.sh --mode VIRTUAL_NETTY --threads 2
 ```
 
 ## Output
diff --git a/benchmark-runner/scripts/run-benchmark.sh b/benchmark-runner/scripts/run-benchmark.sh
@@ -581,14 +581,16 @@ start_pidstat() {
 
     log "pidstat running (PID: $PIDSTAT_PID)"
 
-    log "Starting pidstat for mock server (PID: $MOCK_PID)..."
+    if [[ -n "${MOCK_PID:-}" ]]; then
+        log "Starting pidstat for mock server (PID: $MOCK_PID)..."
 
-    local mock_output_file="$OUTPUT_DIR/$PIDSTAT_MOCK_OUTPUT"
+        local mock_output_file="$OUTPUT_DIR/$PIDSTAT_MOCK_OUTPUT"
 
-    pidstat -p "$MOCK_PID" "$PIDSTAT_INTERVAL" > "$mock_output_file" 2>&1 &
-    PIDSTAT_MOCK_PID=$!
+        pidstat -p "$MOCK_PID" "$PIDSTAT_INTERVAL" > "$mock_output_file" 2>&1 &
+        PIDSTAT_MOCK_PID=$!
 
-    log "pidstat running for mock server (PID: $PIDSTAT_MOCK_PID)"
+        log "pidstat running for mock server (PID: $PIDSTAT_MOCK_PID)"
+    fi
 }
 
 stop_pidstat() {