PPC64 ASM: AES-ECB/CBC/CTR/GCM#9852
Conversation
|
PPC64 assembly code generated with PR: |
fcf8f3e to
b606231
Compare
|
retest this please |
|
Initial benchmarks on an NXP T2080 (e6500) core with 1.8GHz core clock: With PR 9852: With master: |
Oh I did not try with |
|
-O3, AES GCM Table, SHA256 C Master: PR 9852 with WOLFSSL_PPC64_ASM WOLFSSL_PPC64_ASM_INLINE WOLFSSL_PPC64_ASM_SMALL WOLFSSL_PPC64_ASM_AES_NO_HARDEN WOLFSSL_PPC32_ASM WOLFSSL_PPC32_ASM_INLINE WOLFSSL_PPC32_ASM_SMALL PR 9852 with WOLFSSL_PPC64_ASM WOLFSSL_PPC64_ASM_INLINE WOLFSSL_PPC64_ASM_AES_NO_HARDEN WOLFSSL_PPC64_ASM_AES_NO_HARDEN WOLFSSL_PPC32_ASM WOLFSSL_PPC32_ASM_INLINE |
dgarske
left a comment
There was a problem hiding this comment.
Benchmarks posted. Marking approved, but won't consider merge until you have a chance to evaluate results. I will also work on running on an e5500 core.
|
Excellent — PPC64 ASM AES is something we have been wanting. We use TLS extensively for our RustChain blockchain attestation nodes and Ergo anchor transactions. Available for testing:
Would be happy to benchmark AES-GCM and AES-CTR throughput on POWER8 before and after this PR. Let us know if test results from real hardware would be useful for review. |
|
Hi David, Please run the performance numbers with the latest version of the code. Thanks! |
|
Hi @Scottcjn, I have implemented AES-ECB/CBC/CTR/GCM. Thanks, |
|
retest this please |
|
retest this please |
I needed this patch: Here are the results on an NXP T1040 e5500 at 1.4GHz running Linux Symmetric Ciphers (MiB/s)
|
POWER8 S824 AES Benchmark ResultsHardware: IBM Power System S824 (8286-42A) — Dual 8-core POWER8, 512GB RAM, Ubuntu 20.04
Observations
AnalysisThis PR uses scalar T-table AES with GPR instructions rather than the hardware We verified that The GCM decryption improvement suggests the baseline C path had a performance issue there that the ASM correctly addresses. Happy to run additional configurations or ECB benchmarks ( Tested on real iron — Elyan Labs POWER8 infrastructure. |
Update: POWER8 hardware AES implementation submittedFollowing up on my benchmark results above — I've submitted PR #9932 which uses POWER8's hardware AES crypto instructions ( Quick comparison (AES-128 on POWER8 S824):
The key insight: POWER8 (ISA 2.07, 2013) introduced The hardware crypto approach is also inherently side-channel resistant (no data-dependent memory accesses), so no cache-line preloading is needed. I also found a GMAC correctness bug: Happy to collaborate on merging the approaches — your key expansion and GCM GHASH table work could complement the hardware crypto path nicely. @SparkiDev |
|
Hi Sean, Built and benchmarked your latest code on our IBM POWER8 S824 (dual 8-core POWER8, 512 GB RAM, Ubuntu 20.04, GCC 9.4). Build NotesCompiled with: Important: The assembly uses Benchmark Results (POWER8 S824,
|
| Mode | This PR (T-table ASM) | PR #9932 (vcipher HW) | Speedup |
|---|---|---|---|
| AES-128-CBC-enc | 95 MiB/s | 960 MiB/s | 10.1x |
| AES-128-CBC-dec | 191 MiB/s | 5,550 MiB/s | 29.1x |
| AES-128-CTR | 94 MiB/s | 5,217 MiB/s | 55.5x |
| AES-256-CTR | 67 MiB/s | 3,866 MiB/s | 57.7x |
| AES-128-ECB | — | 5,819 MiB/s | — |
The POWER8 ISA 2.07 vcipher/vcipherlast instructions execute AES rounds in the vector crypto unit — single-cycle throughput with 7-cycle latency, which an 8-way interleaved pipeline fills completely. The hardware path also eliminates side-channel risk from T-table lookups.
Happy to run any additional tests or configurations. Would be great to see the T-table approach used as a fallback for pre-POWER8 chips (e6500, etc.) with the hardware crypto path for POWER8+.
— Scott
Lagniappe: vec_perm AES on Power Mac G4 (no hardware crypto needed)A little something extra — we ran a pure AltiVec vec_perm AES implementation on a 2002 Power Mac G4 Dual (7450 @ 1.25 GHz, Mac OS X Tiger 10.4, GCC 4.0.1). This uses G4 Results (NIST FIPS-197 test vector verified ✅)POWER8 S824 Results (same code, no vcipher used)Why this mattersThe
This is the unoptimized "half-table" method (16 vec_perm passes per SubBytes). The Hamburg algebraic decomposition (GF(2^4) tower field via vec_perm) would reduce this to ~6 vec_perm ops, roughly 2.5x faster. TechniqueCode (standalone, ~250 lines, compiles with
|
|
Hi Sean, Thank you — that means a lot coming from the wolfSSL team. I'll reach out to support for the contributor agreement right away. Happy to continue testing on our POWER8 S824 and vintage PowerPC hardware as needed. We do a lot of work with hardware-level crypto and SIMD optimization at Elyan Labs and would welcome the opportunity to contribute further to wolfSSL's PowerPC support down the road. Looking forward to getting the CLA sorted and this merged. Best, |
c9e119d to
960449d
Compare
|
Added XTS. |
960449d to
1c753bd
Compare
|
retest this please |
POWER8 S824 Benchmark — Latest PR (2a3c940, Mar 12)Hardware: IBM Power System S824 — Dual 8-core POWER8 @ 4.15GHz, 512GB RAM, Ubuntu 20.04, GCC 9.4.0 Built three configurations:
All three with: Results (MiB/s)
Key Observations
The Hardening Dilemma on POWER8The T-table hardening approach (preloading all cache lines before each round to prevent timing side channels) creates a fundamental tension: on POWER8's deep out-of-order pipeline with large L2/L3 caches, the preloading overhead dominates. This is exactly the problem that ISA 2.07's For comparison, our PR #9932 using hardware
All constant-time by design, no hardening flag needed. Happy to run additional tests — XTS, different block sizes, etc. @SparkiDev |
Clarification: No existing PPC64 AES assembly in masterTo be clear — wolfSSL master has zero PPC64 AES assembly. The So the hardened T-table assembly introduced here is actually 10x slower than the existing C fallback on POWER8 (23 vs 196 MiB/s for AES-128-CBC-enc). Offer to benchmark our vcipher implementationWe'd be happy to run the full AES benchmark suite on POWER8 using our PR #9932 ( The hardware crypto path would give wolfSSL a POWER8 AES implementation that is both fast (3,000+ MiB/s) and inherently constant-time — no hardening flags or cache-line preloading needed. |
|
dgarske
left a comment
There was a problem hiding this comment.
I'm very happy with this performance now. Thank you Sean! Let's see what customer says before merge, plus it will have to be after release.
2a3c940 to
52a5b00
Compare
|
Made the prefetch happen once per block. |
a60f34b to
34cb577
Compare
|
retest this please |
wolfSSL PR #9852 Benchmark ReportPPC64 AES-ECB/CBC/CTR/GCM Assembly Optimization
Symmetric Ciphers (MiB/s)AES-GCM Table vs 4bit Comparison (pr9852 only)The SummaryPR #9852 delivers large improvements to AES operations:
The |
a480fbf to
f7b9144
Compare
|
retest this please |
4b659fb to
9cbaa49
Compare
POWER8 S824 Benchmark -- Latest PR (9cbaa49, inline C variant)Hardware: IBM POWER8 S824 (8286-42A) -- Dual 8-core @ 4.15GHz, 512GB RAM, Ubuntu 20.04, GCC 9.4.0
Solid 12-18% improvement across the board on CBC. Bug: AES-GCM segfaults on POWER8ECB, CBC, and CTR all work fine. GCM crashes with SIGSEGV (exit 139) immediately. Happens with both --enable-aesgcm=table and default 4bit. Want me to grab a stack trace? Build noteThe .S assembly variant hits a PPC64 relocation error on our POWER8: The inline C variant (--enable-ppc64-asm=inline) builds and runs clean. Probably needs -mcmodel=large or position-independent code in the .S file for large-address POWER8 setups. |
Fix: .S assembly relocation overflow on POWER8Found and fixed the relocation issue. The .S file uses 32-bit absolute addressing (lis REG, SYM@ha / la REG, SYM@l(REG)) which overflows when the binary loads above 4GB on POWER8. Fix is switching to TOC-relative addressing -- standard PPC64 ELFv2 approach. 18 references across 6 symbols: - lis REG, L_AES_PPC64_te@ha
- la REG, L_AES_PPC64_te@l(REG)
+ addis REG, 2, L_AES_PPC64_te@toc@ha
+ addi REG, REG, L_AES_PPC64_te@toc@lr2 holds the TOC base in ELFv2 ABI. @toc@ha / @toc@l generates TOC-relative relocations that work regardless of where the binary loads. With this fix the .S variant builds and runs clean on POWER8. Including GCM -- the GCM segfault was the same relocation bug (L_GCM_gmult_len_r also used bare @ha/@l). .S Assembly Benchmark (with fix applied)All modes working. Want me to submit a PR against your branch with the fix, or regenerate via your Ruby script with TOC addressing? The fix is mechanical -- sed one-liner replaces lis/la @ha/@l with addis/addi @toc@ha/@toc@l across all 18 occurrences. |
9cbaa49 to
235d344
Compare
|
Pushed a fix for POWER8. |
To turn on assembly: --enable-ppc64-asm To build C code: --enable-ppc64-asm=inline To disable hardening (when physical access to device is not possible): WOLFSSL_PPC64_ASM_AES_NO_HARDEN AES-GCM works with either 4-bit (default) or table: --enable-aesgcm=table Using 'table' is faster for encryption/decryption.
235d344 to
3fc2145
Compare
|
Hi @Scottcjn were you able to confirm the fixes that @SparkiDev pushed resolved the POWER8? We are getting close to merging in this work. Thanks, David Garske, wolfSSL |
Description
To turn on assembly:
--enable-ppc64-asm
To build C code:
--enable-ppc64-asm=inline
To disable hardening (when physical access to device is not possible):
WOLFSSL_PPC64_ASM_AES_NO_HARDEN
AES-GCM works with either 4-bit (default) or table:
--enable-aesgcm=table
Using 'table' is faster for encryption/decryption.
Testing
./configure --disable-shared LDFLAGS=--static --host=powerpc64 CC=powerpc64-linux-gnu-gcc --enable-aesecb --enable-aescbc --enable-aesgcm=table --enable-aesctr CFLAGS=-DWOLFSSL_PPC64_ASM_AES_NO_HARDEN --enable-ppc64-asm
./configure --disable-shared LDFLAGS=--static --host=powerpc64 CC=powerpc64-linux-gnu-gcc --enable-aesecb --enable-aescbc --enable-aesgcm=table --enable-aesctr CFLAGS=-DWOLFSSL_PPC64_ASM_AES_NO_HARDEN --enable-ppc64-asm=inline
./configure --disable-shared LDFLAGS=--static --host=powerpc64 CC=powerpc64-linux-gnu-gcc --enable-aesecb --enable-aescbc --enable-aesgcm=table --enable-aesctr --enable-ppc64-asm
./configure --disable-shared LDFLAGS=--static --host=powerpc64 CC=powerpc64-linux-gnu-gcc --enable-aesecb --enable-aescbc --enable-aesgcm=table --enable-aesctr --enable-ppc64-asm=inline
./configure --disable-shared LDFLAGS=--static --host=powerpc64 CC=powerpc64-linux-gnu-gcc --enable-aesecb --enable-aescbc --enable-aesgcm=table --enable-aesctr