@@ -151,47 +151,126 @@ perf stat uses `PROFILING_DELAY_SECONDS` and `PROFILING_DURATION_SECONDS`.
151151
152152## Example Runs
153153
154- ### Basic comparison: modes
154+ ### Choosing CPU pinning with ` lscpu -e `
155+
156+ Good benchmarking requires NUMA-aware CPU pinning. Start by inspecting your topology:
157+
158+ ``` bash
159+ $ lscpu -e
160+ CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE
161+ 0 0 0 0 0:0:0:0 yes # NUMA 0, physical core 0
162+ 1 0 0 1 1:1:1:0 yes # NUMA 0, physical core 1
163+ ...
164+ 8 1 0 8 8:8:8:1 yes # NUMA 1, physical core 8
165+ ...
166+ 16 0 0 0 0:0:0:0 yes # NUMA 0, SMT sibling of core 0
167+ 17 0 0 1 1:1:1:0 yes # NUMA 0, SMT sibling of core 1
168+ ```
169+
170+ Key rules:
171+ - ** Keep all benchmark components on the same NUMA node** to avoid cross-node memory latency
172+ - ** Use physical cores only** (avoid SMT siblings) for more stable results
173+ - ** Isolate noisy processes** (IDEs, browsers) on the other NUMA node
174+
175+ Example layout for a 16-core/2-NUMA system with 4 server threads:
176+
177+ | Component | CPUs | Rationale |
178+ | -----------| ------| -----------|
179+ | Load generator | 0-1 | 2 physical cores, enough to saturate |
180+ | Mock server | 2-3 | 2 physical cores for backend simulation |
181+ | Handoff server | 4-7 | 4 physical cores, one per event loop thread |
182+ | Other processes | 8-15 | Isolated on NUMA node 1 |
183+
184+ ### NETTY_SCHEDULER with 4 threads
185+
186+ ``` bash
187+ ./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 --io nio \
188+ --server-cpuset " 4-7" --mock-cpuset " 2-3" --load-cpuset " 0-1" \
189+ --jvm-args " -Xms8g -Xmx8g" \
190+ --connections 10000 --load-threads 4 \
191+ --mock-think-time 30 --mock-threads 4 \
192+ --perf-stat
193+ ```
194+
195+ ### Analyzing bottlenecks with perf stat
196+
197+ Use ` --perf-stat ` to get reliable hardware-level metrics. The ` perf-stat.txt ` output is the
198+ ground truth for CPU utilization — pidstat per-thread numbers can be misleading with virtual threads.
199+
200+ ```
201+ Performance counter stats for process id '95868':
202+
203+ 39,741,757,754 task-clock # 3.970 CPUs utilized
204+ 806 context-switches # 20.281 /sec
205+ 199,114,762,646 instructions # 1.17 insn per cycle
206+ 1,338,722,757 branch-misses # 3.08% of all branches
207+ ```
208+
209+ Key metrics to watch:
210+ - ** CPUs utilized** : how many cores the server is actually using (3.97 of 4 = fully saturated)
211+ - ** Context switches/sec** : lower is better; custom scheduler typically achieves 20-80/sec
212+ - ** IPC (insn per cycle)** : higher is better; >1.0 is good, <0.5 suggests memory stalls
213+ - ** Branch misses** : >5% suggests unpredictable control flow
214+
215+ If CPUs utilized equals your allocated core count, the server is CPU-bound — add more cores.
216+ If context switches are high (>10K/sec), the scheduler or OS is thrashing.
217+
218+ pidstat is still useful for spotting ** mock server or load generator bottlenecks** —
219+ check ` pidstat-mock.log ` and ` pidstat-loadgen.log ` to ensure they aren't saturated.
220+
221+ ### NON_VIRTUAL_NETTY (default mode)
155222
156223``` bash
157- # Custom scheduler mode
158- ./run-benchmark.sh --mode netty_scheduler
224+ ./run-benchmark.sh --threads 4 \
225+ --server-cpuset " 4-7" --mock-cpuset " 2-3" --load-cpuset " 0-1" \
226+ --connections 10000 --mock-think-time 30
227+ ```
228+
229+ ### VIRTUAL_NETTY mode
159230
160- # Virtual Netty mode, mockless
161- ./run-benchmark.sh --mode virtual_netty --threads 2 --mockless
231+ ``` bash
232+ ./run-benchmark.sh --mode VIRTUAL_NETTY --threads 4 --io nio \
233+ --server-cpuset " 4-7" --mock-cpuset " 2-3" --load-cpuset " 0-1" \
234+ --connections 10000 --mock-think-time 30
162235```
163236
164- ### With CPU pinning
237+ ### Mockless mode (skip HTTP call to mock, inline Jackson work)
165238
166239``` bash
167- ./run-benchmark.sh --mode netty_scheduler --threads 4 \
168- --server-cpuset 1-4 --mock-cpuset 0 --load-cpuset 5-7
240+ ./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 --mockless \
241+ --server-cpuset " 4-7" --load-cpuset " 0-1" \
242+ --connections 10000
169243```
170244
171- ### With profiling
245+ ### With async-profiler
172246
173247``` bash
174- ./run-benchmark.sh --mode netty_scheduler \
248+ ./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
249+ --server-cpuset " 4-7" --mock-cpuset " 2-3" --load-cpuset " 0-1" \
175250 --profiler --profiler-path /path/to/async-profiler \
176251 --warmup 15s --total-duration 45s
177252```
178253
179- ### With JFR events enabled (subset)
254+ ### Rate-limited test with wrk2
180255
181256``` bash
182- ./run-benchmark.sh --mode netty_scheduler --jfr --jfr-events NettyRunIo,VirtualThreadTaskRuns
257+ ./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
258+ --server-cpuset " 4-7" --mock-cpuset " 2-3" --load-cpuset " 0-1" \
259+ --rate 120000 --connections 10000 --total-duration 60s --warmup 15s
183260```
184261
185- ### Rate-limited test with wrk2
262+ ### With JFR events
186263
187264``` bash
188- ./run-benchmark.sh --rate 10000 --connections 200 --total-duration 60s --warmup 15s
265+ ./run-benchmark.sh --mode NETTY_SCHEDULER --threads 4 \
266+ --server-cpuset " 4-7" --mock-cpuset " 2-3" --load-cpuset " 0-1" \
267+ --jfr --jfr-events NettyRunIo,VirtualThreadTaskRuns
189268```
190269
191270### Mixed: CLI flags + env vars
192271
193272``` bash
194- SERVER_JVM_ARGS=" -XX:+PrintGCDetails" ./run-benchmark.sh --mode virtual_netty --threads 2
273+ SERVER_JVM_ARGS=" -XX:+PrintGCDetails" ./run-benchmark.sh --mode VIRTUAL_NETTY --threads 2
195274```
196275
197276## Output
0 commit comments