Skip to content

Commit 825a3c8

Browse files
committed
Add ForkJoinPool VT scheduling JMH benchmarks and benchmark analysis reports
1 parent 3d9ee77 commit 825a3c8

5 files changed

Lines changed: 769 additions & 0 deletions

File tree

Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# CPU Efficiency Investigation — Findings
2+
3+
## Setup
4+
5+
Deep profiling with perf stat (6 passes, ≤5 HW events each, no multiplexing) and perf mem (AMD IBS sampling, ~300K samples, JIT symbol resolution via `libperf-jvmti.so` + `perf inject --jit`).
6+
7+
Four configs at 120K fixed-rate. All hit ~119.7K ± 0.1% across all passes. affinity_8 and fj_8_8 also profiled at max throughput.
8+
9+
All server cores (8-15) are on CCD1 sharing the same 32MB L3. L3 is the last level cache — an L3 miss goes to DRAM.
10+
11+
> **Glossary**
12+
> - **EL** — Event Loop (Netty I/O thread)
13+
> - **FJ** — ForkJoinPool (virtual thread scheduler)
14+
> - **IPC** — Instructions Per Cycle
15+
> - **nvcswch** — non-voluntary context switches
16+
> - **IBS** — Instruction Based Sampling (AMD hardware profiling, tags each sample with exact data source)
17+
> - **CCD** — Core Complex Die (8 cores sharing L3)
18+
> - **DRAM** — off-chip main memory
19+
20+
---
21+
22+
## Question 1: Why does custom_8_nio use less CPU than FJ-based configs at 120K?
23+
24+
### Per-request metrics
25+
26+
| | custom_8_nio | no_affinity_8 | affinity_8 | fj_8_8 |
27+
|--|-------------|--------------|-----------|-------|
28+
| CPUs utilized | 5.94 | 6.95 | 6.95 | 6.82 |
29+
| IPC | 0.997 | 0.981 | 0.970 | 1.015 |
30+
| instructions/req | 215,386 | 225,781 | 225,894 | 231,115 |
31+
| DRAM misses/req | 2,041 | 3,202 | 2,858 | 2,390 |
32+
| context switches/10s | 333K | 1,033K | 1,000K | 1,071K |
33+
| cpu-migrations/10s | 46K | 35K | 31K | **178K** |
34+
35+
Delta vs custom_8_nio:
36+
37+
| Metric | no_affinity_8 | affinity_8 | fj_8_8 |
38+
|--------|--------------|-----------|-------|
39+
| instructions/req | +4.8% | +4.9% | **+7.3%** |
40+
| DRAM misses/req | **+56.9%** | **+40.0%** | +17.1% |
41+
| context switches | **3.1x** | **3.0x** | **3.2x** |
42+
| cpu-migrations | -26% | -34% | **+281%** |
43+
44+
Two sources of the CPU gap: FJ configs execute more instructions/req (scheduling overhead) and have more DRAM misses/req.
45+
46+
### Where DRAM accesses happen at 120K (perf mem, IBS)
47+
48+
The following table shows IBS sample counts tagged as "RAM hit" per category. These are not absolute DRAM miss counts — IBS samples memory accesses at a fixed rate and tags each sample with its data source (L1/L2/L3/RAM). The relative distribution across configs is valid (same sampling rate, same duration, same throughput).
49+
50+
| Category | custom_8_nio | no_affinity_8 | affinity_8 | fj_8_8 |
51+
|----------|:-----------:|:------------:|:---------:|:-----:|
52+
| Netty pipeline (write, channelRead, handler) | 171 | 386 | 311 | 393 |
53+
| Continuation (thaw, prepare, StackChunkFrame, run) | 25 | 85 | 60 | 153 |
54+
| HTTP client (MainClientExec, KeepAlive, HttpHost, ...) | 156 | 233 | 211 | 200 |
55+
| Kernel networking (sock_poll, epoll, tcp_*, lock_sock) | 421 | 459 | 459 | 635 |
56+
| FJ handoff (LinkedBlockingQueue, unparkVirtualThread) |||| 174 |
57+
| Other | 822 | 670 | 709 | 898 |
58+
| **Total** | **1,595** | **1,833** | **1,750** | **2,453** |
59+
60+
The DRAM increase in FJ configs is broad — not concentrated in a few hotspots. Every category shows more DRAM samples than custom: Netty pipeline (+82-130%), continuation (+140-512%), HTTP client (+28-49%), kernel networking (+9-51%). fj_8_8 additionally pays 174 samples for the EL→FJ handoff (LinkedBlockingQueue + unparkVirtualThread), absent in all other configs.
61+
62+
**Continuation thaw** accesses `stackChunkOopDesc` fields via pointer-chasing — each load depends on the previous load's result, so the CPU cannot prefetch ahead. IBS data source tagging shows these misses go straight from L2 to DRAM (near-zero L3 hits), meaning the data has been evicted from the entire cache hierarchy between thaw cycles.
63+
64+
**Netty pipeline** traverses a linked list of `ChannelHandlerContext` nodes — also pointer-chasing. The specific functions differ between configs (e.g. `CombinedChannelDuplexHandler.write` appears only in affinity_8, `SimpleChannelInboundHandler.channelRead` only in no_affinity_8) but the total pipeline DRAM cost is consistently 2-2.3x higher than custom across all FJ configs.
65+
66+
**fj_8_8** has the highest total DRAM samples (2,453, +54% vs custom). Its 16 threads on 8 cores cause 178K cpu-migrations — 4-6x more than 8-thread configs. Each migration moves a thread to a core where its working set is not in L1/L2.
67+
68+
---
69+
70+
## Question 2: What changes between 120K and max throughput?
71+
72+
### affinity_8: 120K vs max
73+
74+
| Metric | @ 120K | @ max (152-163K*) | Delta |
75+
|--------|--------|-------------|-------|
76+
| CPUs utilized | 6.95 | 8.00 | saturated |
77+
| IPC | 0.970 | 1.039 | **+7.1%** |
78+
| instructions/req | 225,894 | 224,723 | -0.5% (same) |
79+
| DRAM misses/req | 2,858 | 1,317 | **-54.0%** |
80+
| context switches/10s | 1,000K | 10K | **-99%** |
81+
82+
Same instructions/req at both load levels. At max, carriers are saturated (8.0 CPUs), DRAM misses/req drop 54%, and context switches drop 99%. No IBS data at max to show where the DRAM reduction comes from.
83+
84+
### fj_8_8: 120K vs max
85+
86+
| Metric | @ 120K | @ max (143-149K*) | Delta |
87+
|--------|--------|-------------|-------|
88+
| CPUs utilized | 6.82 | 7.90 | near-saturated |
89+
| IPC | 1.015 | 0.870 | **-14.3%** |
90+
| instructions/req | 231,115 | 199,160 | **-13.8%** |
91+
| DRAM misses/req | 2,390 | 1,260 | **-47.3%** |
92+
| context switches/10s | 1,071K | 170K | **-84.2%** |
93+
| cpu-migrations/10s | 178K | 6K | **-96.6%** |
94+
95+
Different pattern from affinity_8: instructions/req drop 14% at max while IPC also drops 14%, canceling out. The 16 threads stop migrating (-97%) and DRAM misses drop 47%. IPC drops with 16 threads on 8 cores.
96+
97+
### L3 miss rate (same-run)
98+
99+
| Config | L3 miss/req | L3 miss rate |
100+
|--------|------------|-------------|
101+
| custom_8_nio @ 120K | 2,145 | **11.8%** |
102+
| no_affinity_8 @ 120K | 2,590 | **13.1%** |
103+
| affinity_8 @ 120K | 2,576 | **14.0%** |
104+
| fj_8_8 @ 120K | 2,455 | **14.2%** |
105+
| affinity_8 @ max | 1,554 | **8.7%** |
106+
| fj_8_8 @ max | 1,252 | **7.4%** |
107+
108+
L3 miss rate nearly doubles from max to 120K for both configs. At 120K, IBS shows continuation data is not in any cache level by the time it's needed again.
109+
110+
---
111+
112+
## Summary
113+
114+
| | custom_8_nio | affinity_8 | no_affinity_8 | fj_8_8 |
115+
|--|-------------|-----------|--------------|-------|
116+
| Max throughput | 174K | 168K | 159K | 161K |
117+
| CPUs at 120K | 5.94 | 6.95 | 6.95 | 6.82 |
118+
| DRAM misses/req @ 120K | 2,041 | 2,858 | 3,202 | 2,390 |
119+
| instructions/req @ 120K | 215,386 | 225,894 | 225,781 | 231,115 |
120+
| context switches @ 120K | 333K | 1,000K | 1,033K | 1,071K |
121+
| IBS DRAM samples @ 120K | 1,595 | 1,750 | 1,833 | 2,453 |
122+
123+
custom_8_nio uses 13-15% less CPU at 120K, executing fewer instructions/req and fewer DRAM misses/req. The DRAM increase in FJ configs is broad — every category (Netty pipeline, continuation, HTTP client, kernel networking) shows more IBS DRAM samples than custom.
124+
125+
affinity_8 and no_affinity_8 have similar metrics at 120K. At max throughput, affinity_8 achieves 168K vs 159K for no_affinity_8.
126+
127+
fj_8_8 executes the most instructions/req (+7.3% vs custom), has the most IBS DRAM samples (2,453, +54% vs custom) including 174 from the EL→FJ handoff (absent in other configs), and 4-6x more cpu-migrations from 16 threads on 8 cores.
128+
129+
At max throughput, both affinity_8 and fj_8_8 show substantially fewer DRAM misses/req (47-54% less) and context switches (84-99% less) compared to their 120K values. L3 miss rate drops from 14% to 7-9%. We do not have IBS data at max to identify where the DRAM reduction occurs.
130+
131+
---
132+
133+
## Data quality
134+
135+
- All perf stat values are exact (no multiplexing) from 6-pass collection with ≤5 HW events per pass on AMD Zen4 (5 available GP counters after NMI watchdog).
136+
- L3 miss rate from pass E: `cache-references` and `cache-misses` in the same run.
137+
- perf mem uses AMD IBS sampling (~300K samples per run) with JIT symbol resolution.
138+
- All 120K runs hit 119.7K ± 0.1%.
139+
- *Max throughput during profiling runs is lower than REPORT.md due to perf stat overhead. fj_8_8: 143-149K (vs 161K). affinity_8: 152-163K (vs 168K).
140+
- perf c2c: ~760 HITMs at both load levels, ruling out false sharing.
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
# Sustained Load Efficiency Analysis — 120K req/s Fixed Load
2+
3+
## 1. Test Setup
4+
5+
**Objective:** Compare scheduler configurations under a fixed 120K req/s load (≈70% of max throughput) to measure per-request efficiency when the system has headroom.
6+
7+
| Parameter | Value |
8+
|-----------|-------|
9+
| Load | 120,000 req/s fixed rate |
10+
| Connections | 10,000 |
11+
| Mock think time | 30ms |
12+
| Load-gen threads | 4 |
13+
| Duration | 10s warmup + 20s measurement |
14+
| CPU pinning | server=8-15, mock=4-7, loadgen=0-3 |
15+
16+
All server cores on CCD1, sharing 32MB L3.
17+
18+
> **Glossary**
19+
> - **EL** — Event Loop (Netty I/O thread)
20+
> - **FJ** — ForkJoinPool (virtual thread scheduler)
21+
> - **IPC** — Instructions Per Cycle
22+
> - **nvcswch** — non-voluntary context switches (thread yielded CPU involuntarily)
23+
24+
## 2. Configurations tested
25+
26+
| Config | Event Loop | Scheduler | I/O | Threads | Affinity | Poller |
27+
|--------|-----------|-----------|-----|---------|----------|--------|
28+
| **custom_8_nio** | VirtualMultithreadIoELG | NettyScheduler | NIO | 8 | structural | POLLER_PER_CARRIER |
29+
| **affinity_8** | ManualIoELG | ForkJoinPool | NIO | 8 | roundRobin + inherit | VTHREAD_POLLERS |
30+
| **no_affinity_8** | ManualIoELG | ForkJoinPool | NIO | 8 | none | VTHREAD_POLLERS |
31+
| **fj_8_8** | MultiThreadIoELG | ForkJoinPool | NIO | 8+8 | none | VTHREAD_POLLERS |
32+
33+
All configs achieved the target rate (119,695-119,810 req/s, within 0.1%).
34+
35+
---
36+
37+
## 3. Latency
38+
39+
At 120K, latency does not differentiate the configs. All four are identical up to p90 (~30.5-32.8ms = mock delay + overhead). p90+ tail latency is not stable on this machine across runs.
40+
41+
---
42+
43+
## 4. CPU Usage and Per-Request Cost
44+
45+
From deep profiling (6 perf stat passes, ≤5 HW events each). See [FINDINGS.md](FINDINGS.md) for full methodology.
46+
47+
| | custom_8_nio | no_affinity_8 | affinity_8 | fj_8_8 |
48+
|--|-------------|--------------|-----------|-------|
49+
| **CPUs utilized** | **5.94** | 6.95 | 6.95 | 6.82 |
50+
| **IPC** | 0.997 | 0.981 | 0.970 | 1.015 |
51+
| **instructions/req** | **215,386** | 225,781 | 225,894 | 231,115 |
52+
| **DRAM misses/req** | **2,041** | 3,202 | 2,858 | 2,390 |
53+
54+
custom_8_nio uses 13-15% less CPU to serve the same 120K req/s. Two sources ([FINDINGS.md](FINDINGS.md)):
55+
1. Fewer instructions/req — no FJ scheduling overhead
56+
2. Fewer DRAM misses/req — perf mem (IBS) shows the DRAM increase in FJ configs is broad across every category (Netty pipeline, continuation, HTTP client, kernel networking)
57+
58+
fj_8_8 has the highest IPC (1.015) but executes the most instructions/req (+7.3% vs custom). See section 7 for fj_8_8-specific costs.
59+
60+
---
61+
62+
## 5. Context Switches
63+
64+
| Config | context switches (max) | context switches (120K) | nvcswch/s @ 120K | nvcswch spread @ 120K |
65+
|--------|-------------|---------------|-----------------|---------------------|
66+
| custom_8_nio | 1,151 | 333,017 | 291 | 1.14x |
67+
| affinity_8 | 11,727 | 1,000,118 | 1,556 | 1.09x |
68+
| no_affinity_8 | 164,813 | 1,033,285 | 1,299 | 1.15x |
69+
| fj_8_8 | 144,578 | 1,071,409 | 1,631 (FJ workers) | 1.04x |
70+
71+
At 120K, all configs show dramatically more context switches as threads park between requests. custom_8_nio still has 3x fewer than others.
72+
73+
The non-voluntary context switch imbalance observed at max load (no_affinity: 8.6x spread) has vanished at 120K — all configs are balanced (1.04-1.15x). custom_8_nio still has 4-5x fewer non-voluntary context switches/s.
74+
75+
---
76+
77+
## 6. Affinity at 120K vs Max Load
78+
79+
**At 120K, affinity has no measurable effect.** affinity_8 vs no_affinity_8 metrics are within 2-3% — within run-to-run variance.
80+
81+
At 120K, carriers are not saturated (6.95 CPUs out of 8). affinity_8 and no_affinity_8 produce similar metrics at this load level (both 6.95 CPUs, DRAM misses/req within 12%). At max throughput, affinity_8 achieves 168K vs 159K for no_affinity_8. See [FINDINGS.md](FINDINGS.md) for affinity_8's 120K-vs-max comparison.
82+
83+
---
84+
85+
## 7. fj_8_8 at 120K
86+
87+
fj_8_8 (standard Netty, 8 EL + 8 FJ workers) has unique overhead at 120K:
88+
- **+7.3% instructions/req** vs custom — EL→FJ handoff adds scheduling work
89+
- **178K cpu-migrations** — 4-6x more than 8-thread configs (16 threads on 8 cores)
90+
- **174 IBS DRAM samples** in LinkedBlockingQueue + unparkVirtualThread — the handoff queue cost, absent in other configs
91+
- **Highest total IBS DRAM samples** (2,453, +54% vs custom) — increase is broad across all categories (see [FINDINGS.md](FINDINGS.md))
92+
93+
At max throughput, migrations drop 97% and DRAM misses drop 47%, but IPC drops 14% (16 threads on 8 cores), giving fj_8_8 the lowest max throughput among 8-EL configs (161K vs 168K affinity, 174K custom).
94+
95+
---
96+
97+
## 8. Key Takeaways
98+
99+
1. **custom_8_nio is the most efficient at every load level** — 13-15% less CPU, fewer instructions/req (no FJ overhead), fewer DRAM misses/req.
100+
101+
2. **Affinity has no measurable effect at sub-maximal load.** affinity_8 and no_affinity_8 produce similar metrics at 120K. At max throughput, affinity_8 achieves higher throughput (168K vs 159K).
102+
103+
3. **Non-voluntary context switch imbalance disappears at sub-maximal load.** The 8.6x spread seen at max load drops to 1.04-1.15x at 120K.
104+
105+
4. **fj_8_8 is the least efficient 8-EL FJ config** — most instructions/req, most migrations, unique handoff queue DRAM costs, lowest max throughput among 8-EL configs.
Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# Netty Virtual Thread Scheduler — Max Load Benchmark Report
2+
**Date:** 2026-03-02 | **Machine:** AMD Ryzen 9 7950X, Fedora 43, Linux 6.18 | **JDK:** Custom OpenJDK build (loom branch)
3+
4+
## 1. Test Setup
5+
6+
| Parameter | Value |
7+
|-----------|-------|
8+
| Load | max throughput |
9+
| Connections | 10,000 |
10+
| Mock think time | 30ms |
11+
| Load-gen threads | 4 |
12+
| Duration | 10s warmup + 20s measurement |
13+
| CPU pinning | server=8-15, mock=4-7, loadgen=0-3 |
14+
15+
All server cores on CCD1, sharing 32MB L3.
16+
17+
> **Glossary**
18+
> - **EL** — Event Loop (Netty I/O thread)
19+
> - **FJ** — ForkJoinPool (virtual thread scheduler)
20+
> - **IPC** — Instructions Per Cycle
21+
> - **nvcswch** — non-voluntary context switches (thread yielded CPU involuntarily)
22+
23+
---
24+
25+
## 2. Configurations tested
26+
27+
| Config | Event Loop | Scheduler | I/O | Threads | Affinity | Poller |
28+
|--------|-----------|-----------|-----|---------|----------|--------|
29+
| **custom_8_epoll** | VirtualMultithreadIoELG | NettyScheduler | epoll | 8 | structural | POLLER_PER_CARRIER |
30+
| **custom_8_nio** | VirtualMultithreadIoELG | NettyScheduler | NIO | 8 | structural | POLLER_PER_CARRIER |
31+
| **affinity_8** | ManualIoELG | ForkJoinPool | NIO | 8 | roundRobin + inherit | VTHREAD_POLLERS |
32+
| **no_affinity_8** | ManualIoELG | ForkJoinPool | NIO | 8 | none | VTHREAD_POLLERS |
33+
| **fj_8_8** | MultiThreadIoELG | ForkJoinPool | NIO | 8+8 | none | VTHREAD_POLLERS |
34+
| **fj_4_4** | MultiThreadIoELG | ForkJoinPool | NIO | 4+4 | none | VTHREAD_POLLERS |
35+
36+
---
37+
38+
## 3. Throughput
39+
40+
| Config | Requests/sec |
41+
|--------|-------------|
42+
| **custom_8_epoll** | **183,041** |
43+
| **custom_8_nio** | 174,374 |
44+
| **affinity_8** | 168,189 |
45+
| **fj_8_8** | 161,368 |
46+
| **no_affinity_8** | 158,798 |
47+
| **fj_4_4** | 136,362 |
48+
49+
---
50+
51+
## 4. CPU Usage and Per-Request Cost
52+
53+
| Metric | custom_8_epoll | custom_8_nio | affinity_8 | fj_8_8 | no_affinity_8 | fj_4_4 |
54+
|--------|---------------|-------------|-----------|--------|--------------|--------|
55+
| **CPUs utilized** | 7.99 | 8.00 | 7.99 | 7.91 | 7.96 | 6.66 |
56+
| **IPC** | 1.08 | 1.09 | 1.05 | 0.99 | 1.03 | 0.99 |
57+
| **CPU migrations** | 271 | 138 | 1,836 | 4,158 | 181 | 1,562 |
58+
59+
At max load, all 8-thread configs saturate the available cores (~8.0 CPUs). The efficiency difference shows in throughput per CPU: custom_8_epoll gets 22,938 req/s per CPU vs 19,950-20,449 for FJ configs.
60+
61+
The IPC gap (1.09 custom vs 0.99 fj_8_8) is driven by DRAM misses — deep profiling ([FINDINGS.md](FINDINGS.md)) shows ManualEL FJ configs have 40-57% more DRAM misses/req. fj_8_8 has +17% more DRAM misses/req with additional costs from its EL→FJ handoff queue. The DRAM increase is broad across all categories (Netty pipeline, continuation, HTTP client, kernel networking).
62+
63+
---
64+
65+
## 5. Context Switches
66+
67+
| Config | context switches | nvcswch/s range | nvcswch spread |
68+
|--------|---------------------|----------------|----------------------|
69+
| custom_8_epoll | 1,667 | 11-20 | 1.5x |
70+
| custom_8_nio | 1,151 | 15-25 | 1.3x |
71+
| affinity_8 | 11,727 | 120-288 | 1.6x |
72+
| no_affinity_8 | 164,813 | 257-2,214 | **8.6x** |
73+
| fj_8_8 | 144,578 | (EL: 95-280, FJ: 110-240) | 2.5x |
74+
| fj_4_4 | 37,220 |||
75+
76+
Custom scheduler produces 100-140x fewer context switches and near-zero non-voluntary context switches. no_affinity_8 shows massive non-voluntary context switch imbalance (8.6x spread) — affinity flattens this to 1.6x. This imbalance disappears at sub-maximal load (see [REPORT-120K.md](REPORT-120K.md)).
77+
78+
---
79+
80+
## 6. Affinity at Max Load (affinity_8 vs no_affinity_8)
81+
82+
Same event loop, same FJ pool, only difference is affinity hints:
83+
84+
| Metric | affinity_8 | no_affinity_8 | Delta |
85+
|--------|-----------|--------------|-------|
86+
| Requests/sec | 168,189 | 158,798 | **+6%** |
87+
| Context switches | 11,727 | 164,813 | **-93%** |
88+
| nvcswch spread | 1.6x | 8.6x | **-81%** |
89+
90+
Affinity provides +6% throughput, 14x fewer context switches, and balanced worker load at max throughput. At sub-maximal load (120K), affinity has no measurable effect — affinity_8 and no_affinity_8 produce similar metrics ([FINDINGS.md](FINDINGS.md)).
91+
92+
---
93+
94+
## 7. Why Custom Beats FJ
95+
96+
| Metric | custom_8_nio | affinity_8 | fj_8_8 |
97+
|--------|-------------|-----------|-------|
98+
| Requests/sec | 174,374 | 168,189 | 161,368 |
99+
| Context switches | 1,151 | 11,727 | 144,578 |
100+
| IPC | 1.09 | 1.05 | 0.99 |
101+
| nvcswch/s (avg) | 20 | 228 | ~175 |
102+
103+
The custom scheduler runs I/O events and virtual thread tasks on the same carrier thread. A virtual thread resumes on the same carrier that received the I/O event for its connection — implicit data locality without affinity hints.
104+
105+
Deep profiling ([FINDINGS.md](FINDINGS.md)) identified two sources of the efficiency gap:
106+
1. **Fewer DRAM misses/req** — perf mem (IBS) shows the DRAM increase in FJ configs is broad across every category (Netty pipeline, continuation, HTTP client, kernel networking)
107+
2. **Fewer instructions/req** — no FJ scheduling overhead, no EL→FJ handoff
108+
109+
fj_8_8 (standard Netty, 8 EL + 8 FJ) additionally pays for the EL→FJ handoff queue (LinkedBlockingQueue + unparkVirtualThread) and 4x more cpu-migrations from 16 threads on 8 cores. It has the highest total IBS DRAM samples (+54% vs custom) — see [FINDINGS.md](FINDINGS.md).
110+
111+
---
112+
113+
## 8. Key Takeaways
114+
115+
1. **custom_8_epoll is the most efficient config** — 183K req/s, highest IPC (1.08).
116+
117+
2. **epoll vs NIO on custom scheduler:** epoll wins on throughput (+5%) and latency. Both achieve near-zero context switches.
118+
119+
3. **Affinity helps FJ at max load** — +6% throughput, 14x fewer context switches, balanced worker load. But it cannot match custom's architectural advantage, and has no effect at sub-maximal load.
120+
121+
4. **fj_8_8 is the least efficient config** — 16 threads cause high context switches (145K) and cpu-migrations (4.2K) at max load, plus unique DRAM costs from the EL→FJ handoff. Lowest throughput among 8-EL configs.
122+
123+
5. **The IPC gap is DRAM misses** — not branch prediction or frontend stalls. ManualEL FJ configs have 40-57% more DRAM misses/req; fj_8_8 has +17% with additional handoff queue costs. The increase is broad across all categories — not concentrated in a few hotspots.

0 commit comments

Comments
 (0)