Testing SIMD instructions on ARM with Rust on Android
This blog post is the last one of a series exploring SIMD support with Rust on Android. In the previous two posts, I introduced how to compile Rust libraries for Android and detect SIMD instructions supported by the CPU at runtime.
Today, we’ll see how to effectively use the SIMD instructions themselves, and get the most performance out of them. After an introduction on running Rust benchmarks (and unit tests) on Android devices, we’ll measure the performance in various scenarios offered by Rust, and see that the overhead of CPU feature detection can be non-trivial. I’ll then describe various ways to reduce this overhead.
Lastly, I’ll present updated benchmarks on ARM of Horcrux, my Rust implementation of Shamir’s Secret Sharing, and see how they compare to Intel.
As always, my code is available on GitHub.
- Benchmarking setup on Android
- Overhead of CPU feature detection
- Reducing the overhead of dynamic detection
- Real-world benchmarks with Horcrux
- Conclusion
Benchmarking setup on Android
In my previous post, I’ve described the traditional way of embedding native Rust code in end-user Android apps.
However, this method involved a lot of steps, to compile a .so
library from Rust and invoke it from Java.
Besides, we couldn’t use println!
, but instead needed to write any console output via android.util.Log
.
This whole setup was cumbersome, and didn’t seem to play well with Rust’s built-in testing framework, which works via a simple #[test]
attribute (and likewise micro-benchmarks via #[bench]
).
However, I eventually realized that the built-in tests work on Android as well (thanks StackOverflow!), once we add a few lines of configuration.
No need to write a custom test framework or anything complicated.
So how does Rust’s unit tests work?
When running a Rust test suite with cargo test
, a separate Rust binary containing all the tests is compiled and then invoked.
Of course, in the case of cross-compiling (to Android), this binary cannot be invoked on the host platform (where we run cargo
).
But for Android, we can upload this native binary to the device via the Android Debug Bridge and invoke it directly!
To customize the invocation step, Cargo accepts a target.<triple>.runner
parameter, containing the path of a script to run (on the host) instead of the test binary.
You’ll typically specify the runner
script in ${HOME}/.cargo/config
(where we’ve already configured the linker
parameter for Android in this previous post).
With that, Cargo will invoke our runner script, passing the test binary as first parameter.
[target.aarch64-linux-android]
linker = "/path/to/ndk/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android30-clang"
runner = "/home/dev/android-runner.sh"
...
We now need to create the runner script.
The typical method (followed by flapigen-rs or dinghy) is to push the executable to Android’s /data/local/tmp/
directory via adb
, and invoke it from there via adb shell
, which will forward the standard output back to the host.
An important thing is to forward any remaining parameters to the binary ($@
in Bash), for example to be able to run benchmarks (which are compiled in the same test binary, but invoked with an additional --bench
flag).
#!/bin/bash
set -eu
echo "####################"
echo "# android-runner.sh invoked with: $@"
echo "####################"
# The binary to upload and run is the first argument.
BINARY_PATH="$1"
BINARY=`basename ${BINARY_PATH}`
# Remove the first parameter.
shift
# Push the test binary on the device via ADB.
adb push "${BINARY_PATH}" "/data/local/tmp/$BINARY"
adb shell "chmod 755 /data/local/tmp/$BINARY"
# Run the test binary, forwarding the remaining parameters, so that benchmarks,
# test filtering, etc. work.
adb shell "/data/local/tmp/$BINARY $@"
# Cleanup.
adb shell "rm /data/local/tmp/$BINARY"
With this setup, you can now run your tests/benchmarks directly on an Android device.
cargo +nightly bench --target aarch64-linux-android
# Running binaries in 32-bit mode also works on a 64-bit ARM CPU.
cargo +nightly bench --target armv7-linux-androideabi
# Specify compile-time CPU features with RUSTFLAGS to quickly prototype them.
RUSTFLAGS='-C target-feature=+aes' cargo +nightly bench --target aarch64-linux-android
Uploading and running a binary directly rather than passing by a traditional user application may seem surprising, but Android is fundamentally based on Linux, and this flow shouldn’t bypass the Android security model. Indeed, the user must have manually enabled ADB, and the binary will be sandboxed to run under the
shell
user privileges (e.g. without access to other apps’ data).
Overhead of CPU feature detection
Now that we can easily run Rust benchmarks on Android devices, let’s measure the runtime overhead of various CPU feature detection methods.
Implicit feature detection: beware of target-feature
!
The first scenario is to simply use the SIMD intrinsics from std::arch
without doing any feature detection.
This method is bypassing the unsafe
contract of intrinsics, as the compiled code will crash (with a SIGILL
signal on Linux) on CPUs that don’t support the relevant feature.
However, it is still interesting to benchmark as a baseline.
As an example, I’ve implemented this strategy in my haraka-rs repository.
At a high level, this library exposes a haraka256
hash function, which recursively calls smaller functions, until wrapper functions invoke the correct SIMD intrinsics depending on the architecture, e.g. vaeseq_u8
on ARM.
pub fn haraka256<const N_ROUNDS: usize>(dst: &mut [u8; 32], src: &[u8; 32]) {
// ...
for i in 0..N_ROUNDS {
aes_mix2(&mut s0, &mut s1, 4 * i);
}
// ...
}
#[inline(always)]
fn aes_mix2(s0: &mut Simd128, s1: &mut Simd128, rci: usize) {
aes2(s0, s1, rci);
mix2(s0, s1);
}
#[inline(always)]
fn aes2(s0: &mut Simd128, s1: &mut Simd128, rci: usize) {
Simd128::aesenc(s0, &HARAKA_CONSTANTS[rci]);
Simd128::aesenc(s1, &HARAKA_CONSTANTS[rci + 1]);
Simd128::aesenc(s0, &HARAKA_CONSTANTS[rci + 2]);
Simd128::aesenc(s1, &HARAKA_CONSTANTS[rci + 3]);
}
// No feature detection here, other than the CPU architecture.
#[cfg(any(target_arch = "arm", target_arch = "aarch64"))]
impl Simd128 {
// Function that invokes the ARM intrinsics.
#[inline(always)]
pub(crate) fn aesenc(block: &mut Self, key: &Self) {
unsafe {
let zero = vdupq_n_u8(0);
let x = vaeseq_u8(block.0, zero);
let y = vaesmcq_u8(x);
block.0 = veorq_u8(y, key.0);
}
}
}
The corresponding benchmark function looks like the following – see this blog post about why you should use the black_box
function in benchmarks.
#[bench]
fn bench_haraka256_5round(b: &mut Bencher) {
let src: [u8; 32] = /* ... */;
b.iter(|| {
let mut dst = [0; 32];
haraka256::<5>(&mut dst, black_box(&src));
dst
}
}
Let’s run this benchmark on an Android device that supports the aes
feature (needed for the vaeseq_u8
intrinsic).
Everything is #[inline(always)]
, so after compiler optimizations, we’d expect that the haraka256
function is just one linear block with the loops unrolled and everything inlined down to the SIMD instructions.
Below are the benchmark results for two cases: either with the default compilation flags, or with RUSTFLAGS='-C target-feature=+aes'
to explicitly specify that the aes
feature is supported.
Function | Target features | x86_64 (laptop) | aarch64 (Samsung S7) | armv7 (Samsung S7) |
---|---|---|---|---|
Haraka-256 | default | 46 ns | 395 ns | 2.2 μs |
Haraka-256 | +aes |
6 ns | 52 ns | 572 ns |
Haraka-512 | default | 72 ns | 405 ns | 6.6 μs |
Haraka-512 | +aes |
16 ns | 88 ns | 1.1 μs |
It turns out that running this code with target-feature=+aes
yields a large improvement over the default!
In this example, the performance gain of specifying the correct flags is around 5x, across all CPU architectures.
To take another data point, the same kind of performance improvements (up to 5x) were observed in RustCrypto/block-ciphers/165.
To explain this, let’s look at the assembly output with the Godbolt Compiler Explorer tool.
With the default features (-C opt-level=3 --target=aarch64-linux-android
), we can see that the intrinsics have not been inlined, but appear as separate functions like core...vaesmcq_u8
(Compiler Explorer).
This is a massive performance overhead, because instead of simply translating to a single CPU instruction (e.g. aesmc
), each intrinsic also adds instructions to load/store values, and a control-flow indirection towards the intrinsic function (bl
on ARM).
; Intrinsic appears as a standalone function.
core::core_arch::arm_shared::crypto::vaesmcq_u8:
ldr q0, [x0] ; Load
aesmc v0.16b, v0.16b ; Actual instruction
str q0, [x8] ; Store
ret
core::core_arch::arm_shared::crypto::vaeseq_u8:
ldr q0, [x0]
ldr q1, [x1]
aese v0.16b, v1.16b
str q0, [x8]
ret
haraka256_5:
sub sp, sp, #336
str x29, [sp, #288]
...
stp q0, q1, [sp, #48]
bl core::core_arch::arm_shared::crypto::vaeseq_u8
add x0, sp, #256
add x1, sp, #112
bl core::core_arch::arm_shared::crypto::vaesmcq_u8
adrp x8, .LCPI2_0
ldr q0, [sp, #256]
...
On the other hand, with explicit features (-C opt-level=3 -C target-feature=+aes --target=aarch64-linux-android
), the code is compiled into a nice and clean unrolled loop with everything inlined (Compiler Explorer).
haraka256_5:
ldp q0, q2, [x1]
movi v1.2d, #0000000000000000
adrp x8, .LCPI0_0
adrp x9, .LCPI0_1
mov v3.16b, v0.16b
aese v3.16b, v1.16b ; vaeseq_u8
aesmc v3.16b, v3.16b ; vaesmcq_u8
ldr q4, [x8, :lo12:.LCPI0_0]
mov v6.16b, v2.16b
ldr q5, [x9, :lo12:.LCPI0_1]
aese v6.16b, v1.16b
...
So, what prevents the intrinsics from being inlined without explicit compilation flags?
Under the hood, the vaeseq_u8
intrinsic is implemented as follows.
The important part for aarch64
is the target_feature(enable = "aes")
annotation, which means that the intrinsic requires the aes
feature.
/// AES single round encryption.
///
/// [Arm's documentation](https://developer.arm.com/architectures/instruction-sets/intrinsics/vaeseq_u8)
#[inline]
#[cfg_attr(not(target_arch = "arm"), target_feature(enable = "aes"))]
#[cfg_attr(target_arch = "arm", target_feature(enable = "crypto,v8"))]
#[cfg_attr(test, assert_instr(aese))]
pub unsafe fn vaeseq_u8(data: uint8x16_t, key: uint8x16_t) -> uint8x16_t {
vaeseq_u8_(data, key)
}
For completeness, the underlying conversion from the intrinsic to a CPU instruction is deferred to LLVM, via link_name = "llvm.aarch64.crypto.aese"
.
#[allow(improper_ctypes)]
extern "unadjusted" {
#[cfg_attr(target_arch = "aarch64", link_name = "llvm.aarch64.crypto.aese")]
#[cfg_attr(target_arch = "arm", link_name = "llvm.arm.neon.aese")]
fn vaeseq_u8_(data: uint8x16_t, key: uint8x16_t) -> uint8x16_t;
}
The real problem is that the target_feature
annotation on vaeseq_u8()
prevents inlining this function into other functions that don’t have the same target_feature
s, such as my aesenc()
function.
This was discussed in more detail on StackOverflow, and brought up to the Rust compiler project (rust-lang/rust/54353, rust-lang/rust/53069).
Arguably, this “inlining barrier” between functions with different features is an implementation detail (bug or feature) of LLVM, as the RFC for target_feature
doesn’t specify any inlining constraints.
To complete this section, what does the -C target-feature=+aes
command-line flag do?
My understanding is that it automatically annotates all functions as target_feature(enable = "aes")
, so everything is allowed to be inlined again.
If you’re using static CPU feature detection to gate your code (rather than using the SIMD intrinsics without any check), you’ll need to pass the corresponding
-C target-feature=...
flags to be able to compile it, so the result will be nice and optimized.
Dynamic feature detection
Let’s now discuss dynamic feature detection, which is ultimately what you’ll want if you distribute your application to a wide variety of device models.
To measure the overhead of dynamic detection, we’ll consider – like in the previous post – the following pmul
function, which simply translates to the vmull_p64
intrinsic.
// SIMD: dynamic detection.
pub fn pmul(a: u64, b: u64) -> u128 {
#[cfg(target_arch = "aarch64")]
{
use std::arch::is_aarch64_feature_detected;
if is_aarch64_feature_detected!("neon") && is_aarch64_feature_detected!("aes") {
// Safety: target_features "neon" and "aes" are available in this block.
return unsafe { pmul_aarch64_neon(a, b) };
}
}
pmul_nosimd(a, b)
}
// SIMD implementation.
#[cfg(target_arch = "aarch64")]
#[target_feature(enable = "neon", enable = "aes")]
unsafe fn pmul_aarch64_neon(a: u64, b: u64) -> u128 {
use std::arch::aarch64::vmull_p64;
// Safety: target_features "neon" and "aes" are available in this function.
vmull_p64(a, b)
}
// Fallback implementation
fn pmul_nosimd(a: u64, b: u64) -> u128 {
let mut tmp: u128 = b as u128;
let mut result: u128 = 0;
for i in 0..64 {
if a & (1 << i) != 0 {
result ^= tmp;
}
tmp <<= 1;
}
result
}
With the standard optimizations (-C opt-level=3 --target=aarch64-linux-android
), the code compiles as follows (Compiler Explorer).
- Firstly, the detection macro will load the static detection cache, and directly dispatch to the optimized or fallback path if the cache was already initialized.
If we’re trying to detect CPU features for the first time, the
detect_and_initialize
function will be invoked first. - Next, if we’re in the optimized path, we have an indirect call to the actual optimized function.
As explained in the previous section, this is due to the
target_feature(enable = ...)
annotation that prevents inlining.
pmul:
str x30, [sp, #-32]!
stp x20, x19, [sp, #16]
; Load detected features from the cache.
adrp x8, :got:_ZN10std_detect6detect5cache5CACHE17hf93aadb57bf0323bE
mov x19, x1
mov x20, x0
ldr x8, [x8, :got_lo12:_ZN10std_detect6detect5cache5CACHE17hf93aadb57bf0323bE]
ldr x8, [x8]
; If the cache is empty, jump to initialize it.
cbz x8, .LBB0_5
; If the feature is detected, jump to the optimized implementation.
tbnz x8, #37, .LBB0_6
.LBB0_2:
; Fallback implementation.
mov x9, xzr
; ...
ret
.LBB0_5:
; Call the detection function and initialize the cache.
bl std_detect::detect::cache::detect_and_initialize
mov x8, x0
tbz x0, #37, .LBB0_2
.LBB0_6:
; Optimized implementation: call the optimized function.
mov x0, x20
mov x1, x19
ldp x20, x19, [sp, #16]
ldr x30, [sp], #32
b pmul_aarch64_neon
pmul_aarch64_neon:
fmov d0, x0
fmov d1, x1
pmull v0.1q, v0.1d, v1.1d
mov x1, v0.d[1]
fmov x0, d0
ret
The assembly contains no instructions to check the
neon
feature: as explained in the previous post, this is because theaarch64-linux-android
target already assumes thatneon
is enabled.
With detection enabled at compile time (-C opt-level=3 -C target-feature=+aes --target=aarch64-linux-android
), the dynamic checks are removed by the compiler, as I’ve explained in the previous post.
So we end up with a nice and clean function wrapping the pmull
instruction that we’re benchmarking (Compiler Explorer).
pmul:
fmov d0, x0
fmov d1, x1
pmull v0.1q, v0.1d, v1.1d
mov x1, v0.d[1]
fmov x0, d0
ret
Running a micro-benchmark shows that dynamic detection is 2x slower for this pmul
function: 21 nanoseconds instead of 11 nanoseconds.
Although the SIMD version is much faster than the pmul_nosimd
fallback implementation (ca. 340 ns), a 2x overhead for dynamic detection is definitely non-trivial.
You may wonder whether this micro-benchmark isn’t too artificial, as the tested function essentially only contains the pmull
instruction.
I’ll discuss larger benchmarks further below, which confirm a ~1.5x overhead even on bigger functions.
Reducing the overhead of dynamic detection
As we’ve seen, dynamic detection can add a non-negligible overhead, especially when applied to small functions. If you’re going all the way to use SIMD intrinsics in Rust, you probably don’t want to add a 2x performance penalty to it. In this section, we’ll study how we can integrate dynamic detection to a top-level function that invokes other functions, which should hopefully amortize the cost.
The recursion problem
So far, we’ve applied dynamic detection in a rather straightforward scenario, following the example provided in the official documentation: a single function foo()
exists in optimized and fallback forms, and we’re dynamically dispatching it.
Let’s consider the following more practical scenario: a function foo()
calls a function bar()
(potentially many times), which itself has two possible implementations: optimized and fallback.
To minimize the overhead of feature detection, we want to only do it once at the top-level function foo
.
Firstly, let’s try to follow the official documentation’s example, by dynamically dispatching foo
to an “optimized” implementation, that itself just calls the fallback function but is annotated with the optimized target_feature
.
Then, the “fallback” implementation of foo()
will in fact try to dispatch to the optimized or fallback inner function bar()
, based on static cfg(target_feature = ...)
.
I’m illustrating this section with examples on Intel CPUs rather than ARM64, so that we can directly run them in the Rust Playground.
pub fn main() {
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
if is_x86_feature_detected!("avx2") {
return unsafe { foo_avx2() };
}
foo();
}
// An optimized foo implementation, which just calls the fallback foo (hopefully inline).
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[target_feature(enable = "avx2")]
unsafe fn foo_avx2() {
println!("foo_avx2()");
foo();
}
// Foo is invoking either variant of bar, based on static dispatch.
#[inline(always)]
fn foo() {
println!("foo()");
#[cfg(all(
any(target_arch = "x86", target_arch = "x86_64"),
target_feature = "avx2"
))]
bar_avx2();
#[cfg(not(all(
any(target_arch = "x86", target_arch = "x86_64"),
target_feature = "avx2"
)))]
bar();
}
// Bar has two possible implementations, depending on available features.
#[cfg(all(
any(target_arch = "x86", target_arch = "x86_64"),
target_feature = "avx2"
))]
#[inline(always)]
fn bar_avx2() {
println!("bar_avx2()");
/* Some intrinsics here */
}
#[inline(always)]
fn bar() {
println!("bar()");
/* Fallback implementation here */
}
Inlining
foo_avx2
with#[inline(always)]
is not possible.error: cannot use `#[inline(always)]` with `#[target_feature]`
Unfortunately, this code returns the following output (Rust Playground) – i.e. the optimized foo_avx2()
is called at the top level, but the non-optimized bar()
is called under the hood…
foo_avx2()
foo()
bar()
One might think that foo()
didn’t get properly inlined within foo_avx2()
, causing the miss of static detection.
So let’s try to put everything in one function.
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[target_feature(enable = "avx2")]
unsafe fn foo_avx2() {
println!("foo_avx2()");
#[cfg(target_feature = "avx2")]
println!("cfg(target_feature = \"avx2\") is detected here");
#[cfg(not(target_feature = "avx2"))]
println!("cfg(target_feature = \"avx2\") is NOT detected here");
}
Despite the function being annotated with target_feature(enable = ...)
, its body disappointingly cannot detect anything statically with cfg(target_feature = ...)
(Rust Playground).
foo_avx2()
cfg(target_feature = "avx2") is NOT detected here
In other words, static feature detection with cfg(target_feature = ...)
is only affected by features passed in the command line, not by features dynamically added with target_feature(enable = ...)
.
So our only hope to fully benefit from dynamic detection is to duplicate all the code with and without target features, and sprinkle unsafe
everywhere (Rust playground)… or is it?
pub fn main() {
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
if is_x86_feature_detected!("avx2") {
return unsafe { foo_avx2() };
}
foo();
}
fn foo() {
println!("foo()");
bar();
}
fn bar() {
println!("bar()");
}
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[target_feature(enable = "avx2")]
unsafe fn foo_avx2() {
println!("foo_avx2()");
bar_avx2();
}
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[target_feature(enable = "avx2")]
unsafe fn bar_avx2() {
println!("bar_avx2()");
}
Mixing dynamic and static detection: re-linking a dependency
So far, dynamic feature detection makes us face the following dilemma:
- either we apply dynamic detection only to the “leaf” functions, which incurs a non-trivial performance overhead,
- or we have to duplicate all of our code into two flavors, “optimized” and “fallback”.
In this section, I want to propose a third – very hacky – approach, which doesn’t cause any performance penalty nor require code duplication.
The idea is to isolate the part of the code that we want to optimize into a separate library that we’ll compile twice, with and without optimizations: libsimd.a
and libfallback.a
.
Critically, this library will internally be using static feature detection.
Then, the main code will perform dynamic detection, and dispatch to either of the two variants based on the supported features.
At a high level, this “re-linked” library exports either of two symbols, depending on the compile-time features.
#[cfg(not(target_feature = "aes"))]
mod fallback {
// Export the fallback implementation.
#[no_mangle]
pub unsafe extern "C" fn foo_fallback() {
// ...
}
}
#[cfg(target_feature = "aes")]
mod simd {
// Export the optimized implementation.
#[no_mangle]
pub unsafe extern "C" fn foo_simd() {
// ...
}
}
In the corresponding Cargo.toml
, we’ll specify the staticlib
crate type, to obtain a .a
file as output.
[lib]
crate-type = ["staticlib"]
[profile.release]
codegen-units = 1
panic = "abort"
We can then compile this library twice, with and without -C target-feature=+aes
, renaming the outputs for each case (e.g. libfallback.a
and libsimd.a
).
cargo +nightly build --target aarch64-linux-android --release
cp target/aarch64-linux-android/release/librelinked.a libfallback.a
RUSTFLAGS='-C target-feature=+aes' cargo +nightly build --target aarch64-linux-android --release
cp target/aarch64-linux-android/release/librelinked.a libsimd.a
To link this code into the main crate, we’re not using a normal Rust dependency.
Instead, we’ll declare the symbols of the “re-linked” library as extern "C"
.
pub fn foo() {
// Dynamic dispatch.
if is_aarch64_feature_detected!("aes") {
unsafe { foo_simd() }
} else {
unsafe { foo_fallback() }
}
}
// Symbol(s) from libfallback.a.
#[link(name = "fallback")]
extern "C" {
fn foo_fallback();
}
// Symbol(s) from libsimd.a.
#[link(name = "simd")]
extern "C" {
fn foo_simd();
}
Lastly, we compile the main crate with the -L
linker flag, indicating where to find the “re-linked” library.
RUSTFLAGS='-L /home/dev/build/relinked' cargo +nightly build --target aarch64-linux-android --release
This is a quick-and-dirty hack – in practice one should probably use a Cargo build script.
This approach turned out to work in my example, but because it’s quite hacky you may hit some linker-related issues.
Notably, I had to disable link-time optimizations (lto = true
) in the inner library, due to the rust_eh_personality
symbol being duplicated.
error: linking with `/home/dev/opt/android-sdk/ndk/25.1.8937393/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android30-clang` failed: exit status: 1
|
= note: "/home/dev/opt/android-sdk/ndk/25.1.8937393/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android30-clang" "-Wl,--version-script=/tmp/rustc9MJIHe/list" "/tmp/rustc9MJIHe/symbols.o" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/simd.simd.c6d283a5-cgu.0.rcgu.o" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/simd.k1r633uohw9806f.rcgu.rmeta" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/simd.d566y75et37yt2d.rcgu.o" "-Wl,--as-needed" "-L" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps" "-L" "/home/dev/build/android-simd/target/release/deps" "-L" "/home/dev/build/relinked" "-L" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib" "-lfallback" "-lsimd" "-Wl,-Bstatic" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/liblibc-426c9fcff770cf85.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libjni-6582db323d2c51dc.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libcesu8-aaf793d20b0e1ac1.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/liblog-c6207fb600ea3a38.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libcfg_if-13b938592ba33583.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libcombine-a6914966783fd2ca.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libmemchr-6e9017bfcbdcd015.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libbytes-ba7d281cd663b299.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libthiserror-17e5199790bbd2d6.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libjni_sys-442f3833be347c4c.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libstd-ca201f8924e1a745.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libpanic_abort-5eecab1447f44e6f.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libobject-9edec975292b096b.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libmemchr-7101bcf92ac73e01.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libaddr2line-133819781a63c739.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libgimli-47df885212c9ec97.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/librustc_demangle-7b8caa98eca7572d.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libstd_detect-9a1b49175d4e38cb.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libhashbrown-0ffd5b9fedd3b1ae.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libminiz_oxide-995414520fa49f31.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libadler-3c86b51ab749f965.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/librustc_std_workspace_alloc-28b6d4e7d7c1f355.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libunwind-394e28c2f903c2e9.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libcfg_if-3ed771790aba2d34.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/liblibc-f211a911193b255a.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/liballoc-11386607a3accfa5.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/librustc_std_workspace_core-001d5bd9a65e4337.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libcore-088bc0b43b3ec677.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libcompiler_builtins-78ef6e03d835568c.rlib" "-Wl,-Bdynamic" "-ldl" "-llog" "-lunwind" "-ldl" "-lm" "-lc" "-Wl,--eh-frame-hdr" "-Wl,-znoexecstack" "-L" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib" "-o" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libsimd.so" "-shared" "-Wl,-zrelro,-znow" "-Wl,-O1" "-nodefaultlibs"
= note: ld: error: duplicate symbol: rust_eh_personality
>>> defined at gcc.rs:244 (library/std/src/personality/gcc.rs:244)
>>> relinked-1bf68705c0def4c6.relinked.2bfeaf53-cgu.0.rcgu.o:(rust_eh_personality) in archive /home/dev/build/relinked/libfallback.a
>>> defined at gcc.rs:244 (library/std/src/personality/gcc.rs:244)
>>> relinked-1bf68705c0def4c6.relinked.2bfeaf53-cgu.0.rcgu.o:(.text.rust_eh_personality+0x0) in archive /home/dev/build/relinked/libsimd.a
ld: error: duplicate symbol: rust_eh_personality
>>> defined at gcc.rs:244 (library/std/src/personality/gcc.rs:244)
>>> relinked-1bf68705c0def4c6.relinked.2bfeaf53-cgu.0.rcgu.o:(rust_eh_personality) in archive /home/dev/build/relinked/libfallback.a
>>> defined at gcc.rs:244 (library/std/src/personality/gcc.rs:244)
>>> std-ca201f8924e1a745.std.add6f040-cgu.0.rcgu.o:(.text.rust_eh_personality+0x0) in archive /home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libstd-ca201f8924e1a745.rlib
clang-14: error: linker command failed with exit code 1 (use -v to see invocation)
error: could not compile `simd` due to previous error
To enable/disable LTO in certain contexts, you can define a custom profile in your
Cargo.toml
, and use it by passing--profile <custom name>
to Cargo.[profile.release-nolto] inherits = "release" lto = false
Towards better language support?
As we’ve seen, dynamic feature detection is not quite a zero-cost abstraction of the Rust language yet. Either we apply it to small functions, which has a performance cost, or we need to duplicate something (the source code or the compiled library), which isn’t an ideal abstraction.
The general problem of duplicating similar functions with different annotations comes up pretty often in Rust.
A typical example where code duplication is needed today is the async
keyword, because async
functions cannot be mixed with non-async
ones (unless we manually .await
them, which may not be optimal).
An example where code duplication is not needed today is generics: we can write some common logic abstracted over a trait once, and defer only the implementation-specific parts of the logic to instances of the trait.
A general way of solving this code duplication problem is to add an effect system to the programming language, as pointed out in the original RFC for target_feature
.
Closer to us, an initiative towards generics over effects was announced this summer for Rust, taking async
as a motivating example.
Somewhat relatedly, a post on Tyler Mandry’s blog discussed how to add contexts and capabilities to Rust, notably for allocators.
Having an effect system to automatically color functions with or without various target_feature
s would certainly help.
On the other side, the use case for SIMD might be quite niche compared to other examples like async, although the performance improvements can be quite significant when applicable.
Real-world benchmarks with Horcrux
In previous posts, I’ve presented benchmarks on Intel of my Horcrux implementation of Shamir’s Secret Sharing, as well as an optimized multiplication algorithm relying on dedicated instructions (clmul
on Intel, pmull
on ARM).
To better illustrate the overhead of CPU feature detection, let’s revisit these benchmarks on an Android phone with ARM64.
I’ll consider three scenarios:
- fallback implementation, not using any SIMD code,
- SIMD implementation with static detection, i.e. with CPU features enabled at compile-time,
- SIMD implementation with dynamic detection, applied at the level of the multiplication function.
In the last scenario, this means that we’ll have to pay the cost of dynamic detection each time the algorithm performs a multiplication, which is many times.
Firstly, the arithmetic operations show a similar pattern to Intel, where using the dedicated instructions is about 10x faster than the fallback! But we also see that the overhead of dynamic detection over static detection can be up to 1.5x for some values (mind the log scale), even for the inversion routine that consists of hundreds of multiplications.
We also notice that for larger fields – and above – the overhead is not noticeable. This can be explained by the fact that the multiplication routine becomes bigger (in a quadratic manner), so the relative overhead of dynamic detection just becomes smaller. It could also be that the optimized version needs to use the stack as all intermediate values cannot fit in registers anymore. At that point, dynamic detection just becomes another access to the CPU cache among many.
For the full benchmarks of the API-level Shamir operations (below), using dedicated instructions also yields a tremendous performance gain – up to 100x for some parameters! And here as well, the overhead of dynamic detection can be up to 1.5x.
Conclusion
To conclude, is any of this useful, beyond the toy examples that I presented – the cryptic Haraka function and Shamir’s Secret Sharing?
Firstly, there are in fact plenty of practical algorithms that strongly benefit from SIMD instructions. Here are a few examples.
- Rust’s standard
HashMap
implementation, using the Swiss Table algorithm (published by Google in 2018). - Searching a sub-string within a string, notably implemented in the
memchr
crate (although the ARM implementation does not yet use SIMD). This is also now part of the standard library (on Intel CPUs supporting SSE2). - Parsing JSON, as ported in the
simd-json
crate. More broadly, Daniel Lemire’s work: UTF-8 validation, Unicode processing, base-64 codec, etc. - Parsing numbers, on Intel and ARM.
- Sorting numbers (a blog post described porting it to Rust, but I couldn’t find the code to reproduce the results).
There’s a caveat though: not all algorithms easily translate from Intel to ARM, and in some cases SIMD doesn’t bring any performance gain.
So for example, Rust’s HashMap
doesn’t use any SIMD on ARM (see rust-lang/hashbrown/269).
Additionally, on a given CPU architecture the performance can vary from one CPU model to the next.
So in any case: benchmark, measure and profile your code!
The second aspect is whether dynamic feature detection (and its overhead) matters in practice.
As we’ve learned, all Android devices running on ARM64 support NEON, with the feature enabled at compile time. This NEON baseline already regroups most “general purpose” SIMD instructions on ARM. This means that dynamic CPU feature detection on (ARM-based) Android will mostly be relevant for the “niche” non-NEON instructions such as cryptographic primitives, that you’ll likely leave to dedicated libraries (e.g. RustCrypto).
However, features are definitely relevant on Intel for “general purpose” SIMD, as there are several generations supporting wider and wider vectors: 128-bit (SSE and its variants), 256-bit (AVX2), 512-bit (AVX-512 and its variants). Without dynamic detection, your performance will stay stuck at a fairly low baseline (e.g. the 20+ year-old SSE2 on x86_64). But as we’ve learned in this post, you’ll need to be mindful of the cost of feature detection until Rust has first-class language support for it, not only because of detection itself but also in terms of missed optimizations.
Comments
To react to this blog post please check the Mastodon thread and the Reddit thread.
You may also like