franz1981
diff --git a/‎benchmark-runner/analysis/FINDINGS.md‎
Lines changed: 140 additions & 0 deletions b/‎benchmark-runner/analysis/FINDINGS.md‎
Lines changed: 140 additions & 0 deletions
diff --git a/‎benchmark-runner/analysis/REPORT-120K.md‎
Lines changed: 105 additions & 0 deletions b/‎benchmark-runner/analysis/REPORT-120K.md‎
Lines changed: 105 additions & 0 deletions
diff --git a/‎benchmark-runner/analysis/REPORT.md‎
Lines changed: 123 additions & 0 deletions b/‎benchmark-runner/analysis/REPORT.md‎
Lines changed: 123 additions & 0 deletions
@@ -0,0 +1,140 @@
+# CPU Efficiency Investigation — Findings
+
+## Setup
+
+Deep profiling with perf stat (6 passes, ≤5 HW events each, no multiplexing) and perf mem (AMD IBS sampling, ~300K samples, JIT symbol resolution via `libperf-jvmti.so` + `perf inject --jit`).
+
+Four configs at 120K fixed-rate. All hit ~119.7K ± 0.1% across all passes. affinity_8 and fj_8_8 also profiled at max throughput.
+
+All server cores (8-15) are on CCD1 sharing the same 32MB L3. L3 is the last level cache — an L3 miss goes to DRAM.
+
+> **Glossary**
+> - **EL** — Event Loop (Netty I/O thread)
+> - **FJ** — ForkJoinPool (virtual thread scheduler)
+> - **IPC** — Instructions Per Cycle
+> - **nvcswch** — non-voluntary context switches
+> - **IBS** — Instruction Based Sampling (AMD hardware profiling, tags each sample with exact data source)
+> - **CCD** — Core Complex Die (8 cores sharing L3)
+> - **DRAM** — off-chip main memory
+
+---
+
+## Question 1: Why does custom_8_nio use less CPU than FJ-based configs at 120K?
+
+### Per-request metrics
+
+|  | custom_8_nio | no_affinity_8 | affinity_8 | fj_8_8 |
+|--|-------------|--------------|-----------|-------|
+| CPUs utilized | 5.94 | 6.95 | 6.95 | 6.82 |
+| IPC | 0.997 | 0.981 | 0.970 | 1.015 |
+| instructions/req | 215,386 | 225,781 | 225,894 | 231,115 |
+| DRAM misses/req | 2,041 | 3,202 | 2,858 | 2,390 |
+| context switches/10s | 333K | 1,033K | 1,000K | 1,071K |
+| cpu-migrations/10s | 46K | 35K | 31K | **178K** |
+
+Delta vs custom_8_nio:
+
+| Metric | no_affinity_8 | affinity_8 | fj_8_8 |
+|--------|--------------|-----------|-------|
+| instructions/req | +4.8% | +4.9% | **+7.3%** |
+| DRAM misses/req | **+56.9%** | **+40.0%** | +17.1% |
+| context switches | **3.1x** | **3.0x** | **3.2x** |
+| cpu-migrations | -26% | -34% | **+281%** |
+
+Two sources of the CPU gap: FJ configs execute more instructions/req (scheduling overhead) and have more DRAM misses/req.
+
+### Where DRAM accesses happen at 120K (perf mem, IBS)
+
+The following table shows IBS sample counts tagged as "RAM hit" per category. These are not absolute DRAM miss counts — IBS samples memory accesses at a fixed rate and tags each sample with its data source (L1/L2/L3/RAM). The relative distribution across configs is valid (same sampling rate, same duration, same throughput).
+
+| Category | custom_8_nio | no_affinity_8 | affinity_8 | fj_8_8 |
+|----------|:-----------:|:------------:|:---------:|:-----:|
+| Netty pipeline (write, channelRead, handler) | 171 | 386 | 311 | 393 |
+| Continuation (thaw, prepare, StackChunkFrame, run) | 25 | 85 | 60 | 153 |
+| HTTP client (MainClientExec, KeepAlive, HttpHost, ...) | 156 | 233 | 211 | 200 |
+| Kernel networking (sock_poll, epoll, tcp_*, lock_sock) | 421 | 459 | 459 | 635 |
+| FJ handoff (LinkedBlockingQueue, unparkVirtualThread) | — | — | — | 174 |
+| Other | 822 | 670 | 709 | 898 |
+| **Total** | **1,595** | **1,833** | **1,750** | **2,453** |
+
+The DRAM increase in FJ configs is broad — not concentrated in a few hotspots. Every category shows more DRAM samples than custom: Netty pipeline (+82-130%), continuation (+140-512%), HTTP client (+28-49%), kernel networking (+9-51%). fj_8_8 additionally pays 174 samples for the EL→FJ handoff (LinkedBlockingQueue + unparkVirtualThread), absent in all other configs.
+
+**Continuation thaw** accesses `stackChunkOopDesc` fields via pointer-chasing — each load depends on the previous load's result, so the CPU cannot prefetch ahead. IBS data source tagging shows these misses go straight from L2 to DRAM (near-zero L3 hits), meaning the data has been evicted from the entire cache hierarchy between thaw cycles.
+
+**Netty pipeline** traverses a linked list of `ChannelHandlerContext` nodes — also pointer-chasing. The specific functions differ between configs (e.g. `CombinedChannelDuplexHandler.write` appears only in affinity_8, `SimpleChannelInboundHandler.channelRead` only in no_affinity_8) but the total pipeline DRAM cost is consistently 2-2.3x higher than custom across all FJ configs.
+
+**fj_8_8** has the highest total DRAM samples (2,453, +54% vs custom). Its 16 threads on 8 cores cause 178K cpu-migrations — 4-6x more than 8-thread configs. Each migration moves a thread to a core where its working set is not in L1/L2.
+
+---
+
+## Question 2: What changes between 120K and max throughput?
+
+### affinity_8: 120K vs max
+
+| Metric | @ 120K | @ max (152-163K*) | Delta |
+|--------|--------|-------------|-------|
+| CPUs utilized | 6.95 | 8.00 | saturated |
+| IPC | 0.970 | 1.039 | **+7.1%** |
+| instructions/req | 225,894 | 224,723 | -0.5% (same) |
+| DRAM misses/req | 2,858 | 1,317 | **-54.0%** |
+| context switches/10s | 1,000K | 10K | **-99%** |
+
+Same instructions/req at both load levels. At max, carriers are saturated (8.0 CPUs), DRAM misses/req drop 54%, and context switches drop 99%. No IBS data at max to show where the DRAM reduction comes from.
+
+### fj_8_8: 120K vs max
+
+| Metric | @ 120K | @ max (143-149K*) | Delta |
+|--------|--------|-------------|-------|
+| CPUs utilized | 6.82 | 7.90 | near-saturated |
+| IPC | 1.015 | 0.870 | **-14.3%** |
+| instructions/req | 231,115 | 199,160 | **-13.8%** |
+| DRAM misses/req | 2,390 | 1,260 | **-47.3%** |
+| context switches/10s | 1,071K | 170K | **-84.2%** |
+| cpu-migrations/10s | 178K | 6K | **-96.6%** |
+
+Different pattern from affinity_8: instructions/req drop 14% at max while IPC also drops 14%, canceling out. The 16 threads stop migrating (-97%) and DRAM misses drop 47%. IPC drops with 16 threads on 8 cores.
+
+### L3 miss rate (same-run)
+
+| Config | L3 miss/req | L3 miss rate |
+|--------|------------|-------------|
+| custom_8_nio @ 120K | 2,145 | **11.8%** |
+| no_affinity_8 @ 120K | 2,590 | **13.1%** |
+| affinity_8 @ 120K | 2,576 | **14.0%** |
+| fj_8_8 @ 120K | 2,455 | **14.2%** |
+| affinity_8 @ max | 1,554 | **8.7%** |
+| fj_8_8 @ max | 1,252 | **7.4%** |
+
+L3 miss rate nearly doubles from max to 120K for both configs. At 120K, IBS shows continuation data is not in any cache level by the time it's needed again.
+
+---
+
+## Summary
+
+|  | custom_8_nio | affinity_8 | no_affinity_8 | fj_8_8 |
+|--|-------------|-----------|--------------|-------|
+| Max throughput | 174K | 168K | 159K | 161K |
+| CPUs at 120K | 5.94 | 6.95 | 6.95 | 6.82 |
+| DRAM misses/req @ 120K | 2,041 | 2,858 | 3,202 | 2,390 |
+| instructions/req @ 120K | 215,386 | 225,894 | 225,781 | 231,115 |
+| context switches @ 120K | 333K | 1,000K | 1,033K | 1,071K |
+| IBS DRAM samples @ 120K | 1,595 | 1,750 | 1,833 | 2,453 |
+
+custom_8_nio uses 13-15% less CPU at 120K, executing fewer instructions/req and fewer DRAM misses/req. The DRAM increase in FJ configs is broad — every category (Netty pipeline, continuation, HTTP client, kernel networking) shows more IBS DRAM samples than custom.
+
+affinity_8 and no_affinity_8 have similar metrics at 120K. At max throughput, affinity_8 achieves 168K vs 159K for no_affinity_8.
+
+fj_8_8 executes the most instructions/req (+7.3% vs custom), has the most IBS DRAM samples (2,453, +54% vs custom) including 174 from the EL→FJ handoff (absent in other configs), and 4-6x more cpu-migrations from 16 threads on 8 cores.
+
+At max throughput, both affinity_8 and fj_8_8 show substantially fewer DRAM misses/req (47-54% less) and context switches (84-99% less) compared to their 120K values. L3 miss rate drops from 14% to 7-9%. We do not have IBS data at max to identify where the DRAM reduction occurs.
+
+---
+
+## Data quality
+
+- All perf stat values are exact (no multiplexing) from 6-pass collection with ≤5 HW events per pass on AMD Zen4 (5 available GP counters after NMI watchdog).
+- L3 miss rate from pass E: `cache-references` and `cache-misses` in the same run.
+- perf mem uses AMD IBS sampling (~300K samples per run) with JIT symbol resolution.
+- All 120K runs hit 119.7K ± 0.1%.
+- *Max throughput during profiling runs is lower than REPORT.md due to perf stat overhead. fj_8_8: 143-149K (vs 161K). affinity_8: 152-163K (vs 168K).
+- perf c2c: ~760 HITMs at both load levels, ruling out false sharing.
@@ -0,0 +1,105 @@
+# Sustained Load Efficiency Analysis — 120K req/s Fixed Load
+
+## 1. Test Setup
+
+**Objective:** Compare scheduler configurations under a fixed 120K req/s load (≈70% of max throughput) to measure per-request efficiency when the system has headroom.
+
+| Parameter | Value |
+|-----------|-------|
+| Load | 120,000 req/s fixed rate |
+| Connections | 10,000 |
+| Mock think time | 30ms |
+| Load-gen threads | 4 |
+| Duration | 10s warmup + 20s measurement |
+| CPU pinning | server=8-15, mock=4-7, loadgen=0-3 |
+
+All server cores on CCD1, sharing 32MB L3.
+
+> **Glossary**
+> - **EL** — Event Loop (Netty I/O thread)
+> - **FJ** — ForkJoinPool (virtual thread scheduler)
+> - **IPC** — Instructions Per Cycle
+> - **nvcswch** — non-voluntary context switches (thread yielded CPU involuntarily)
+
+## 2. Configurations tested
+
+| Config | Event Loop | Scheduler | I/O | Threads | Affinity | Poller |
+|--------|-----------|-----------|-----|---------|----------|--------|
+| **custom_8_nio** | VirtualMultithreadIoELG | NettyScheduler | NIO | 8 | structural | POLLER_PER_CARRIER |
+| **affinity_8** | ManualIoELG | ForkJoinPool | NIO | 8 | roundRobin + inherit | VTHREAD_POLLERS |
+| **no_affinity_8** | ManualIoELG | ForkJoinPool | NIO | 8 | none | VTHREAD_POLLERS |
+| **fj_8_8** | MultiThreadIoELG | ForkJoinPool | NIO | 8+8 | none | VTHREAD_POLLERS |
+
+All configs achieved the target rate (119,695-119,810 req/s, within 0.1%).
+
+---
+
+## 3. Latency
+
+At 120K, latency does not differentiate the configs. All four are identical up to p90 (~30.5-32.8ms = mock delay + overhead). p90+ tail latency is not stable on this machine across runs.
+
+---
+
+## 4. CPU Usage and Per-Request Cost
+
+From deep profiling (6 perf stat passes, ≤5 HW events each). See [FINDINGS.md](FINDINGS.md) for full methodology.
+
+|  | custom_8_nio | no_affinity_8 | affinity_8 | fj_8_8 |
+|--|-------------|--------------|-----------|-------|
+| **CPUs utilized** | **5.94** | 6.95 | 6.95 | 6.82 |
+| **IPC** | 0.997 | 0.981 | 0.970 | 1.015 |
+| **instructions/req** | **215,386** | 225,781 | 225,894 | 231,115 |
+| **DRAM misses/req** | **2,041** | 3,202 | 2,858 | 2,390 |
+
+custom_8_nio uses 13-15% less CPU to serve the same 120K req/s. Two sources ([FINDINGS.md](FINDINGS.md)):
+1. Fewer instructions/req — no FJ scheduling overhead
+2. Fewer DRAM misses/req — perf mem (IBS) shows the DRAM increase in FJ configs is broad across every category (Netty pipeline, continuation, HTTP client, kernel networking)
+
+fj_8_8 has the highest IPC (1.015) but executes the most instructions/req (+7.3% vs custom). See section 7 for fj_8_8-specific costs.
+
+---
+
+## 5. Context Switches
+
+| Config | context switches (max) | context switches (120K) | nvcswch/s @ 120K | nvcswch spread @ 120K |
+|--------|-------------|---------------|-----------------|---------------------|
+| custom_8_nio | 1,151 | 333,017 | 291 | 1.14x |
+| affinity_8 | 11,727 | 1,000,118 | 1,556 | 1.09x |
+| no_affinity_8 | 164,813 | 1,033,285 | 1,299 | 1.15x |
+| fj_8_8 | 144,578 | 1,071,409 | 1,631 (FJ workers) | 1.04x |
+
+At 120K, all configs show dramatically more context switches as threads park between requests. custom_8_nio still has 3x fewer than others.
+
+The non-voluntary context switch imbalance observed at max load (no_affinity: 8.6x spread) has vanished at 120K — all configs are balanced (1.04-1.15x). custom_8_nio still has 4-5x fewer non-voluntary context switches/s.
+
+---
+
+## 6. Affinity at 120K vs Max Load
+
+**At 120K, affinity has no measurable effect.** affinity_8 vs no_affinity_8 metrics are within 2-3% — within run-to-run variance.
+
+At 120K, carriers are not saturated (6.95 CPUs out of 8). affinity_8 and no_affinity_8 produce similar metrics at this load level (both 6.95 CPUs, DRAM misses/req within 12%). At max throughput, affinity_8 achieves 168K vs 159K for no_affinity_8. See [FINDINGS.md](FINDINGS.md) for affinity_8's 120K-vs-max comparison.
+
+---
+
+## 7. fj_8_8 at 120K
+
+fj_8_8 (standard Netty, 8 EL + 8 FJ workers) has unique overhead at 120K:
+- **+7.3% instructions/req** vs custom — EL→FJ handoff adds scheduling work
+- **178K cpu-migrations** — 4-6x more than 8-thread configs (16 threads on 8 cores)
+- **174 IBS DRAM samples** in LinkedBlockingQueue + unparkVirtualThread — the handoff queue cost, absent in other configs
+- **Highest total IBS DRAM samples** (2,453, +54% vs custom) — increase is broad across all categories (see [FINDINGS.md](FINDINGS.md))
+
+At max throughput, migrations drop 97% and DRAM misses drop 47%, but IPC drops 14% (16 threads on 8 cores), giving fj_8_8 the lowest max throughput among 8-EL configs (161K vs 168K affinity, 174K custom).
+
+---
+
+## 8. Key Takeaways
+
+1. **custom_8_nio is the most efficient at every load level** — 13-15% less CPU, fewer instructions/req (no FJ overhead), fewer DRAM misses/req.
+
+2. **Affinity has no measurable effect at sub-maximal load.** affinity_8 and no_affinity_8 produce similar metrics at 120K. At max throughput, affinity_8 achieves higher throughput (168K vs 159K).
+
+3. **Non-voluntary context switch imbalance disappears at sub-maximal load.** The 8.6x spread seen at max load drops to 1.04-1.15x at 120K.
+
+4. **fj_8_8 is the least efficient 8-EL FJ config** — most instructions/req, most migrations, unique handoff queue DRAM costs, lowest max throughput among 8-EL configs.
@@ -0,0 +1,123 @@
+# Netty Virtual Thread Scheduler — Max Load Benchmark Report
+**Date:** 2026-03-02 | **Machine:** AMD Ryzen 9 7950X, Fedora 43, Linux 6.18 | **JDK:** Custom OpenJDK build (loom branch)
+
+## 1. Test Setup
+
+| Parameter | Value |
+|-----------|-------|
+| Load | max throughput |
+| Connections | 10,000 |
+| Mock think time | 30ms |
+| Load-gen threads | 4 |
+| Duration | 10s warmup + 20s measurement |
+| CPU pinning | server=8-15, mock=4-7, loadgen=0-3 |
+
+All server cores on CCD1, sharing 32MB L3.
+
+> **Glossary**
+> - **EL** — Event Loop (Netty I/O thread)
+> - **FJ** — ForkJoinPool (virtual thread scheduler)
+> - **IPC** — Instructions Per Cycle
+> - **nvcswch** — non-voluntary context switches (thread yielded CPU involuntarily)
+
+---
+
+## 2. Configurations tested
+
+| Config | Event Loop | Scheduler | I/O | Threads | Affinity | Poller |
+|--------|-----------|-----------|-----|---------|----------|--------|
+| **custom_8_epoll** | VirtualMultithreadIoELG | NettyScheduler | epoll | 8 | structural | POLLER_PER_CARRIER |
+| **custom_8_nio** | VirtualMultithreadIoELG | NettyScheduler | NIO | 8 | structural | POLLER_PER_CARRIER |
+| **affinity_8** | ManualIoELG | ForkJoinPool | NIO | 8 | roundRobin + inherit | VTHREAD_POLLERS |
+| **no_affinity_8** | ManualIoELG | ForkJoinPool | NIO | 8 | none | VTHREAD_POLLERS |
+| **fj_8_8** | MultiThreadIoELG | ForkJoinPool | NIO | 8+8 | none | VTHREAD_POLLERS |
+| **fj_4_4** | MultiThreadIoELG | ForkJoinPool | NIO | 4+4 | none | VTHREAD_POLLERS |
+
+---
+
+## 3. Throughput
+
+| Config | Requests/sec |
+|--------|-------------|
+| **custom_8_epoll** | **183,041** |
+| **custom_8_nio** | 174,374 |
+| **affinity_8** | 168,189 |
+| **fj_8_8** | 161,368 |
+| **no_affinity_8** | 158,798 |
+| **fj_4_4** | 136,362 |
+
+---
+
+## 4. CPU Usage and Per-Request Cost
+
+| Metric | custom_8_epoll | custom_8_nio | affinity_8 | fj_8_8 | no_affinity_8 | fj_4_4 |
+|--------|---------------|-------------|-----------|--------|--------------|--------|
+| **CPUs utilized** | 7.99 | 8.00 | 7.99 | 7.91 | 7.96 | 6.66 |
+| **IPC** | 1.08 | 1.09 | 1.05 | 0.99 | 1.03 | 0.99 |
+| **CPU migrations** | 271 | 138 | 1,836 | 4,158 | 181 | 1,562 |
+
+At max load, all 8-thread configs saturate the available cores (~8.0 CPUs). The efficiency difference shows in throughput per CPU: custom_8_epoll gets 22,938 req/s per CPU vs 19,950-20,449 for FJ configs.
+
+The IPC gap (1.09 custom vs 0.99 fj_8_8) is driven by DRAM misses — deep profiling ([FINDINGS.md](FINDINGS.md)) shows ManualEL FJ configs have 40-57% more DRAM misses/req. fj_8_8 has +17% more DRAM misses/req with additional costs from its EL→FJ handoff queue. The DRAM increase is broad across all categories (Netty pipeline, continuation, HTTP client, kernel networking).
+
+---
+
+## 5. Context Switches
+
+| Config | context switches | nvcswch/s range | nvcswch spread |
+|--------|---------------------|----------------|----------------------|
+| custom_8_epoll | 1,667 | 11-20 | 1.5x |
+| custom_8_nio | 1,151 | 15-25 | 1.3x |
+| affinity_8 | 11,727 | 120-288 | 1.6x |
+| no_affinity_8 | 164,813 | 257-2,214 | **8.6x** |
+| fj_8_8 | 144,578 | (EL: 95-280, FJ: 110-240) | 2.5x |
+| fj_4_4 | 37,220 | — | — |
+
+Custom scheduler produces 100-140x fewer context switches and near-zero non-voluntary context switches. no_affinity_8 shows massive non-voluntary context switch imbalance (8.6x spread) — affinity flattens this to 1.6x. This imbalance disappears at sub-maximal load (see [REPORT-120K.md](REPORT-120K.md)).
+
+---
+
+## 6. Affinity at Max Load (affinity_8 vs no_affinity_8)
+
+Same event loop, same FJ pool, only difference is affinity hints:
+
+| Metric | affinity_8 | no_affinity_8 | Delta |
+|--------|-----------|--------------|-------|
+| Requests/sec | 168,189 | 158,798 | **+6%** |
+| Context switches | 11,727 | 164,813 | **-93%** |
+| nvcswch spread | 1.6x | 8.6x | **-81%** |
+
+Affinity provides +6% throughput, 14x fewer context switches, and balanced worker load at max throughput. At sub-maximal load (120K), affinity has no measurable effect — affinity_8 and no_affinity_8 produce similar metrics ([FINDINGS.md](FINDINGS.md)).
+
+---
+
+## 7. Why Custom Beats FJ
+
+| Metric | custom_8_nio | affinity_8 | fj_8_8 |
+|--------|-------------|-----------|-------|
+| Requests/sec | 174,374 | 168,189 | 161,368 |
+| Context switches | 1,151 | 11,727 | 144,578 |
+| IPC | 1.09 | 1.05 | 0.99 |
+| nvcswch/s (avg) | 20 | 228 | ~175 |
+
+The custom scheduler runs I/O events and virtual thread tasks on the same carrier thread. A virtual thread resumes on the same carrier that received the I/O event for its connection — implicit data locality without affinity hints.
+
+Deep profiling ([FINDINGS.md](FINDINGS.md)) identified two sources of the efficiency gap:
+1. **Fewer DRAM misses/req** — perf mem (IBS) shows the DRAM increase in FJ configs is broad across every category (Netty pipeline, continuation, HTTP client, kernel networking)
+2. **Fewer instructions/req** — no FJ scheduling overhead, no EL→FJ handoff
+
+fj_8_8 (standard Netty, 8 EL + 8 FJ) additionally pays for the EL→FJ handoff queue (LinkedBlockingQueue + unparkVirtualThread) and 4x more cpu-migrations from 16 threads on 8 cores. It has the highest total IBS DRAM samples (+54% vs custom) — see [FINDINGS.md](FINDINGS.md).
+
+---
+
+## 8. Key Takeaways
+
+1. **custom_8_epoll is the most efficient config** — 183K req/s, highest IPC (1.08).
+
+2. **epoll vs NIO on custom scheduler:** epoll wins on throughput (+5%) and latency. Both achieve near-zero context switches.
+
+3. **Affinity helps FJ at max load** — +6% throughput, 14x fewer context switches, balanced worker load. But it cannot match custom's architectural advantage, and has no effect at sub-maximal load.
+
+4. **fj_8_8 is the least efficient config** — 16 threads cause high context switches (145K) and cpu-migrations (4.2K) at max load, plus unique DRAM costs from the EL→FJ handoff. Lowest throughput among 8-EL configs.
+
+5. **The IPC gap is DRAM misses** — not branch prediction or frontend stalls. ManualEL FJ configs have 40-57% more DRAM misses/req; fj_8_8 has +17% with additional handoff queue costs. The increase is broad across all categories — not concentrated in a few hotspots.