This blog post is the last one of a series exploring SIMD support with Rust on Android. In the previous two posts, I introduced how to compile Rust libraries for Android and detect SIMD instructions supported by the CPU at runtime.

Today, we’ll see how to effectively use the SIMD instructions themselves, and get the most performance out of them. After an introduction on running Rust benchmarks (and unit tests) on Android devices, we’ll measure the performance in various scenarios offered by Rust, and see that the overhead of CPU feature detection can be non-trivial. I’ll then describe various ways to reduce this overhead.

Lastly, I’ll present updated benchmarks on ARM of Horcrux, my Rust implementation of Shamir’s Secret Sharing, and see how they compare to Intel.

As always, my code is available on GitHub.


Benchmarking setup on Android

In my previous post, I’ve described the traditional way of embedding native Rust code in end-user Android apps. However, this method involved a lot of steps, to compile a .so library from Rust and invoke it from Java. Besides, we couldn’t use println!, but instead needed to write any console output via android.util.Log.

This whole setup was cumbersome, and didn’t seem to play well with Rust’s built-in testing framework, which works via a simple #[test] attribute (and likewise micro-benchmarks via #[bench]). However, I eventually realized that the built-in tests work on Android as well (thanks StackOverflow!), once we add a few lines of configuration. No need to write a custom test framework or anything complicated.

So how does Rust’s unit tests work?

When running a Rust test suite with cargo test, a separate Rust binary containing all the tests is compiled and then invoked. Of course, in the case of cross-compiling (to Android), this binary cannot be invoked on the host platform (where we run cargo). But for Android, we can upload this native binary to the device via the Android Debug Bridge and invoke it directly!

To customize the invocation step, Cargo accepts a target.<triple>.runner parameter, containing the path of a script to run (on the host) instead of the test binary. You’ll typically specify the runner script in ${HOME}/.cargo/config (where we’ve already configured the linker parameter for Android in this previous post). With that, Cargo will invoke our runner script, passing the test binary as first parameter.

[target.aarch64-linux-android]
linker = "/path/to/ndk/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android30-clang"
runner = "/home/dev/android-runner.sh"
...

We now need to create the runner script. The typical method (followed by flapigen-rs or dinghy) is to push the executable to Android’s /data/local/tmp/ directory via adb, and invoke it from there via adb shell, which will forward the standard output back to the host.

An important thing is to forward any remaining parameters to the binary ($@ in Bash), for example to be able to run benchmarks (which are compiled in the same test binary, but invoked with an additional --bench flag).

#!/bin/bash

set -eu

echo "####################"
echo "# android-runner.sh invoked with: $@"
echo "####################"

# The binary to upload and run is the first argument.
BINARY_PATH="$1"
BINARY=`basename ${BINARY_PATH}`
# Remove the first parameter.
shift

# Push the test binary on the device via ADB.
adb push "${BINARY_PATH}" "/data/local/tmp/$BINARY"
adb shell "chmod 755 /data/local/tmp/$BINARY"

# Run the test binary, forwarding the remaining parameters, so that benchmarks,
# test filtering, etc. work.
adb shell "/data/local/tmp/$BINARY $@"

# Cleanup.
adb shell "rm /data/local/tmp/$BINARY"

With this setup, you can now run your tests/benchmarks directly on an Android device.

cargo +nightly bench --target aarch64-linux-android

# Running binaries in 32-bit mode also works on a 64-bit ARM CPU.
cargo +nightly bench --target armv7-linux-androideabi

# Specify compile-time CPU features with RUSTFLAGS to quickly prototype them.
RUSTFLAGS='-C target-feature=+aes' cargo +nightly bench --target aarch64-linux-android

Uploading and running a binary directly rather than passing by a traditional user application may seem surprising, but Android is fundamentally based on Linux, and this flow shouldn’t bypass the Android security model. Indeed, the user must have manually enabled ADB, and the binary will be sandboxed to run under the shell user privileges (e.g. without access to other apps’ data).

Overhead of CPU feature detection

Now that we can easily run Rust benchmarks on Android devices, let’s measure the runtime overhead of various CPU feature detection methods.

Implicit feature detection: beware of target-feature!

The first scenario is to simply use the SIMD intrinsics from std::arch without doing any feature detection. This method is bypassing the unsafe contract of intrinsics, as the compiled code will crash (with a SIGILL signal on Linux) on CPUs that don’t support the relevant feature. However, it is still interesting to benchmark as a baseline.

As an example, I’ve implemented this strategy in my haraka-rs repository. At a high level, this library exposes a haraka256 hash function, which recursively calls smaller functions, until wrapper functions invoke the correct SIMD intrinsics depending on the architecture, e.g. vaeseq_u8 on ARM.

pub fn haraka256<const N_ROUNDS: usize>(dst: &mut [u8; 32], src: &[u8; 32]) {
    // ...
    for i in 0..N_ROUNDS {
        aes_mix2(&mut s0, &mut s1, 4 * i);
    }
    // ...
}

#[inline(always)]
fn aes_mix2(s0: &mut Simd128, s1: &mut Simd128, rci: usize) {
    aes2(s0, s1, rci);
    mix2(s0, s1);
}

#[inline(always)]
fn aes2(s0: &mut Simd128, s1: &mut Simd128, rci: usize) {
    Simd128::aesenc(s0, &HARAKA_CONSTANTS[rci]);
    Simd128::aesenc(s1, &HARAKA_CONSTANTS[rci + 1]);
    Simd128::aesenc(s0, &HARAKA_CONSTANTS[rci + 2]);
    Simd128::aesenc(s1, &HARAKA_CONSTANTS[rci + 3]);
}

// No feature detection here, other than the CPU architecture.
#[cfg(any(target_arch = "arm", target_arch = "aarch64"))]
impl Simd128 {
    // Function that invokes the ARM intrinsics.
    #[inline(always)]
    pub(crate) fn aesenc(block: &mut Self, key: &Self) {
        unsafe {
            let zero = vdupq_n_u8(0);
            let x = vaeseq_u8(block.0, zero);
            let y = vaesmcq_u8(x);
            block.0 = veorq_u8(y, key.0);
        }
    }
}

The corresponding benchmark function looks like the following – see this blog post about why you should use the black_box function in benchmarks.

#[bench]
fn bench_haraka256_5round(b: &mut Bencher) {
    let src: [u8; 32] = /* ... */;
    b.iter(|| {
        let mut dst = [0; 32];
        haraka256::<5>(&mut dst, black_box(&src));
        dst
    }
}

Let’s run this benchmark on an Android device that supports the aes feature (needed for the vaeseq_u8 intrinsic). Everything is #[inline(always)], so after compiler optimizations, we’d expect that the haraka256 function is just one linear block with the loops unrolled and everything inlined down to the SIMD instructions. Below are the benchmark results for two cases: either with the default compilation flags, or with RUSTFLAGS='-C target-feature=+aes' to explicitly specify that the aes feature is supported.

Function Target features x86_64 (laptop) aarch64 (Samsung S7) armv7 (Samsung S7)
Haraka-256 default 46 ns 395 ns 2.2 μs
Haraka-256 +aes 6 ns 52 ns 572 ns
Haraka-512 default 72 ns 405 ns 6.6 μs
Haraka-512 +aes 16 ns 88 ns 1.1 μs

It turns out that running this code with target-feature=+aes yields a large improvement over the default! In this example, the performance gain of specifying the correct flags is around 5x, across all CPU architectures. To take another data point, the same kind of performance improvements (up to 5x) were observed in RustCrypto/block-ciphers/165.

To explain this, let’s look at the assembly output with the Godbolt Compiler Explorer tool.

With the default features (-C opt-level=3 --target=aarch64-linux-android), we can see that the intrinsics have not been inlined, but appear as separate functions like core...vaesmcq_u8 (Compiler Explorer). This is a massive performance overhead, because instead of simply translating to a single CPU instruction (e.g. aesmc), each intrinsic also adds instructions to load/store values, and a control-flow indirection towards the intrinsic function (bl on ARM).

; Intrinsic appears as a standalone function.
core::core_arch::arm_shared::crypto::vaesmcq_u8:
        ldr     q0, [x0]        ; Load
        aesmc   v0.16b, v0.16b  ; Actual instruction
        str     q0, [x8]        ; Store
        ret

core::core_arch::arm_shared::crypto::vaeseq_u8:
        ldr     q0, [x0]
        ldr     q1, [x1]
        aese    v0.16b, v1.16b
        str     q0, [x8]
        ret

haraka256_5:
        sub     sp, sp, #336
        str     x29, [sp, #288]
...
        stp     q0, q1, [sp, #48]
        bl      core::core_arch::arm_shared::crypto::vaeseq_u8
        add     x0, sp, #256
        add     x1, sp, #112
        bl      core::core_arch::arm_shared::crypto::vaesmcq_u8
        adrp    x8, .LCPI2_0
        ldr     q0, [sp, #256]
...

On the other hand, with explicit features (-C opt-level=3 -C target-feature=+aes --target=aarch64-linux-android), the code is compiled into a nice and clean unrolled loop with everything inlined (Compiler Explorer).

haraka256_5:
        ldp     q0, q2, [x1]
        movi    v1.2d, #0000000000000000
        adrp    x8, .LCPI0_0
        adrp    x9, .LCPI0_1
        mov     v3.16b, v0.16b
        aese    v3.16b, v1.16b  ; vaeseq_u8
        aesmc   v3.16b, v3.16b  ; vaesmcq_u8
        ldr     q4, [x8, :lo12:.LCPI0_0]
        mov     v6.16b, v2.16b
        ldr     q5, [x9, :lo12:.LCPI0_1]
        aese    v6.16b, v1.16b
...

So, what prevents the intrinsics from being inlined without explicit compilation flags? Under the hood, the vaeseq_u8 intrinsic is implemented as follows. The important part for aarch64 is the target_feature(enable = "aes") annotation, which means that the intrinsic requires the aes feature.

/// AES single round encryption.
///
/// [Arm's documentation](https://developer.arm.com/architectures/instruction-sets/intrinsics/vaeseq_u8)
#[inline]
#[cfg_attr(not(target_arch = "arm"), target_feature(enable = "aes"))]
#[cfg_attr(target_arch = "arm", target_feature(enable = "crypto,v8"))]
#[cfg_attr(test, assert_instr(aese))]
pub unsafe fn vaeseq_u8(data: uint8x16_t, key: uint8x16_t) -> uint8x16_t {
    vaeseq_u8_(data, key)
}

For completeness, the underlying conversion from the intrinsic to a CPU instruction is deferred to LLVM, via link_name = "llvm.aarch64.crypto.aese".

#[allow(improper_ctypes)]
extern "unadjusted" {
    #[cfg_attr(target_arch = "aarch64", link_name = "llvm.aarch64.crypto.aese")]
    #[cfg_attr(target_arch = "arm", link_name = "llvm.arm.neon.aese")]
    fn vaeseq_u8_(data: uint8x16_t, key: uint8x16_t) -> uint8x16_t;
}

The real problem is that the target_feature annotation on vaeseq_u8() prevents inlining this function into other functions that don’t have the same target_features, such as my aesenc() function. This was discussed in more detail on StackOverflow, and brought up to the Rust compiler project (rust-lang/rust/54353, rust-lang/rust/53069).

Arguably, this “inlining barrier” between functions with different features is an implementation detail (bug or feature) of LLVM, as the RFC for target_feature doesn’t specify any inlining constraints.

To complete this section, what does the -C target-feature=+aes command-line flag do? My understanding is that it automatically annotates all functions as target_feature(enable = "aes"), so everything is allowed to be inlined again.

If you’re using static CPU feature detection to gate your code (rather than using the SIMD intrinsics without any check), you’ll need to pass the corresponding -C target-feature=... flags to be able to compile it, so the result will be nice and optimized.

Dynamic feature detection

Let’s now discuss dynamic feature detection, which is ultimately what you’ll want if you distribute your application to a wide variety of device models. To measure the overhead of dynamic detection, we’ll consider – like in the previous post – the following pmul function, which simply translates to the vmull_p64 intrinsic.

// SIMD: dynamic detection.
pub fn pmul(a: u64, b: u64) -> u128 {
    #[cfg(target_arch = "aarch64")]
    {
        use std::arch::is_aarch64_feature_detected;
        if is_aarch64_feature_detected!("neon") && is_aarch64_feature_detected!("aes") {
            // Safety: target_features "neon" and "aes" are available in this block.
            return unsafe { pmul_aarch64_neon(a, b) };
        }
    }
    pmul_nosimd(a, b)
}

// SIMD implementation.
#[cfg(target_arch = "aarch64")]
#[target_feature(enable = "neon", enable = "aes")]
unsafe fn pmul_aarch64_neon(a: u64, b: u64) -> u128 {
    use std::arch::aarch64::vmull_p64;

    // Safety: target_features "neon" and "aes" are available in this function.
    vmull_p64(a, b)
}

// Fallback implementation
fn pmul_nosimd(a: u64, b: u64) -> u128 {
    let mut tmp: u128 = b as u128;
    let mut result: u128 = 0;
    for i in 0..64 {
        if a & (1 << i) != 0 {
            result ^= tmp;
        }
        tmp <<= 1;
    }
    result
}

With the standard optimizations (-C opt-level=3 --target=aarch64-linux-android), the code compiles as follows (Compiler Explorer).

  • Firstly, the detection macro will load the static detection cache, and directly dispatch to the optimized or fallback path if the cache was already initialized. If we’re trying to detect CPU features for the first time, the detect_and_initialize function will be invoked first.
  • Next, if we’re in the optimized path, we have an indirect call to the actual optimized function. As explained in the previous section, this is due to the target_feature(enable = ...) annotation that prevents inlining.
pmul:
        str     x30, [sp, #-32]!
        stp     x20, x19, [sp, #16]
        ; Load detected features from the cache.
        adrp    x8, :got:_ZN10std_detect6detect5cache5CACHE17hf93aadb57bf0323bE
        mov     x19, x1
        mov     x20, x0
        ldr     x8, [x8, :got_lo12:_ZN10std_detect6detect5cache5CACHE17hf93aadb57bf0323bE]
        ldr     x8, [x8]
        ; If the cache is empty, jump to initialize it.
        cbz     x8, .LBB0_5
        ; If the feature is detected, jump to the optimized implementation.
        tbnz    x8, #37, .LBB0_6
.LBB0_2:
        ; Fallback implementation.
        mov     x9, xzr
        ; ...
        ret
.LBB0_5:
        ; Call the detection function and initialize the cache.
        bl      std_detect::detect::cache::detect_and_initialize
        mov     x8, x0
        tbz     x0, #37, .LBB0_2
.LBB0_6:
        ; Optimized implementation: call the optimized function.
        mov     x0, x20
        mov     x1, x19
        ldp     x20, x19, [sp, #16]
        ldr     x30, [sp], #32
        b       pmul_aarch64_neon

pmul_aarch64_neon:
        fmov    d0, x0
        fmov    d1, x1
        pmull   v0.1q, v0.1d, v1.1d
        mov     x1, v0.d[1]
        fmov    x0, d0
        ret

The assembly contains no instructions to check the neon feature: as explained in the previous post, this is because the aarch64-linux-android target already assumes that neon is enabled.

With detection enabled at compile time (-C opt-level=3 -C target-feature=+aes --target=aarch64-linux-android), the dynamic checks are removed by the compiler, as I’ve explained in the previous post. So we end up with a nice and clean function wrapping the pmull instruction that we’re benchmarking (Compiler Explorer).

pmul:
        fmov    d0, x0
        fmov    d1, x1
        pmull   v0.1q, v0.1d, v1.1d
        mov     x1, v0.d[1]
        fmov    x0, d0
        ret

Running a micro-benchmark shows that dynamic detection is 2x slower for this pmul function: 21 nanoseconds instead of 11 nanoseconds. Although the SIMD version is much faster than the pmul_nosimd fallback implementation (ca. 340 ns), a 2x overhead for dynamic detection is definitely non-trivial.

You may wonder whether this micro-benchmark isn’t too artificial, as the tested function essentially only contains the pmull instruction. I’ll discuss larger benchmarks further below, which confirm a ~1.5x overhead even on bigger functions.

Reducing the overhead of dynamic detection

As we’ve seen, dynamic detection can add a non-negligible overhead, especially when applied to small functions. If you’re going all the way to use SIMD intrinsics in Rust, you probably don’t want to add a 2x performance penalty to it. In this section, we’ll study how we can integrate dynamic detection to a top-level function that invokes other functions, which should hopefully amortize the cost.

The recursion problem

So far, we’ve applied dynamic detection in a rather straightforward scenario, following the example provided in the official documentation: a single function foo() exists in optimized and fallback forms, and we’re dynamically dispatching it. Let’s consider the following more practical scenario: a function foo() calls a function bar() (potentially many times), which itself has two possible implementations: optimized and fallback. To minimize the overhead of feature detection, we want to only do it once at the top-level function foo.

Firstly, let’s try to follow the official documentation’s example, by dynamically dispatching foo to an “optimized” implementation, that itself just calls the fallback function but is annotated with the optimized target_feature. Then, the “fallback” implementation of foo() will in fact try to dispatch to the optimized or fallback inner function bar(), based on static cfg(target_feature = ...).

I’m illustrating this section with examples on Intel CPUs rather than ARM64, so that we can directly run them in the Rust Playground.

pub fn main() {
    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    if is_x86_feature_detected!("avx2") {
        return unsafe { foo_avx2() };
    }
    foo();
}

// An optimized foo implementation, which just calls the fallback foo (hopefully inline).
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[target_feature(enable = "avx2")]
unsafe fn foo_avx2() {
    println!("foo_avx2()");
    foo();
}

// Foo is invoking either variant of bar, based on static dispatch.
#[inline(always)]
fn foo() {
    println!("foo()");

    #[cfg(all(
        any(target_arch = "x86", target_arch = "x86_64"),
        target_feature = "avx2"
    ))]
    bar_avx2();

    #[cfg(not(all(
        any(target_arch = "x86", target_arch = "x86_64"),
        target_feature = "avx2"
    )))]
    bar();
}

// Bar has two possible implementations, depending on available features.
#[cfg(all(
    any(target_arch = "x86", target_arch = "x86_64"),
    target_feature = "avx2"
))]
#[inline(always)]
fn bar_avx2() {
    println!("bar_avx2()");
    /* Some intrinsics here */
}

#[inline(always)]
fn bar() {
    println!("bar()");
    /* Fallback implementation here */
}

Inlining foo_avx2 with #[inline(always)] is not possible.

error: cannot use `#[inline(always)]` with `#[target_feature]`

Unfortunately, this code returns the following output (Rust Playground) – i.e. the optimized foo_avx2() is called at the top level, but the non-optimized bar() is called under the hood…

foo_avx2()
foo()
bar()

One might think that foo() didn’t get properly inlined within foo_avx2(), causing the miss of static detection. So let’s try to put everything in one function.

#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[target_feature(enable = "avx2")]
unsafe fn foo_avx2() {
    println!("foo_avx2()");

    #[cfg(target_feature = "avx2")]
    println!("cfg(target_feature = \"avx2\") is detected here");
    #[cfg(not(target_feature = "avx2"))]
    println!("cfg(target_feature = \"avx2\") is NOT detected here");
}

Despite the function being annotated with target_feature(enable = ...), its body disappointingly cannot detect anything statically with cfg(target_feature = ...) (Rust Playground).

foo_avx2()
cfg(target_feature = "avx2") is NOT detected here

In other words, static feature detection with cfg(target_feature = ...) is only affected by features passed in the command line, not by features dynamically added with target_feature(enable = ...).

So our only hope to fully benefit from dynamic detection is to duplicate all the code with and without target features, and sprinkle unsafe everywhere (Rust playground)… or is it?

pub fn main() {
    #[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
    if is_x86_feature_detected!("avx2") {
        return unsafe { foo_avx2() };
    }
    foo();
}

fn foo() {
    println!("foo()");
    bar();
}

fn bar() {
    println!("bar()");
}

#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[target_feature(enable = "avx2")]
unsafe fn foo_avx2() {
    println!("foo_avx2()");
    bar_avx2();
}

#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
#[target_feature(enable = "avx2")]
unsafe fn bar_avx2() {
    println!("bar_avx2()");
}

Mixing dynamic and static detection: re-linking a dependency

So far, dynamic feature detection makes us face the following dilemma:

  • either we apply dynamic detection only to the “leaf” functions, which incurs a non-trivial performance overhead,
  • or we have to duplicate all of our code into two flavors, “optimized” and “fallback”.

In this section, I want to propose a third – very hacky – approach, which doesn’t cause any performance penalty nor require code duplication. The idea is to isolate the part of the code that we want to optimize into a separate library that we’ll compile twice, with and without optimizations: libsimd.a and libfallback.a. Critically, this library will internally be using static feature detection. Then, the main code will perform dynamic detection, and dispatch to either of the two variants based on the supported features.

At a high level, this “re-linked” library exports either of two symbols, depending on the compile-time features.

#[cfg(not(target_feature = "aes"))]
mod fallback {
    // Export the fallback implementation.
    #[no_mangle]
    pub unsafe extern "C" fn foo_fallback() {
        // ...
    }
}

#[cfg(target_feature = "aes")]
mod simd {
    // Export the optimized implementation.
    #[no_mangle]
    pub unsafe extern "C" fn foo_simd() {
        // ...
    }
}

In the corresponding Cargo.toml, we’ll specify the staticlib crate type, to obtain a .a file as output.

[lib]
crate-type = ["staticlib"]

[profile.release]
codegen-units = 1
panic = "abort"

We can then compile this library twice, with and without -C target-feature=+aes, renaming the outputs for each case (e.g. libfallback.a and libsimd.a).

cargo +nightly build --target aarch64-linux-android --release
cp target/aarch64-linux-android/release/librelinked.a libfallback.a

RUSTFLAGS='-C target-feature=+aes' cargo +nightly build --target aarch64-linux-android --release
cp target/aarch64-linux-android/release/librelinked.a libsimd.a

To link this code into the main crate, we’re not using a normal Rust dependency. Instead, we’ll declare the symbols of the “re-linked” library as extern "C".

pub fn foo() {
    // Dynamic dispatch.
    if is_aarch64_feature_detected!("aes") {
        unsafe { foo_simd() }
    } else {
        unsafe { foo_fallback() }
    }
}

// Symbol(s) from libfallback.a.
#[link(name = "fallback")]
extern "C" {
    fn foo_fallback();
}

// Symbol(s) from libsimd.a.
#[link(name = "simd")]
extern "C" {
    fn foo_simd();
}

Lastly, we compile the main crate with the -L linker flag, indicating where to find the “re-linked” library.

RUSTFLAGS='-L /home/dev/build/relinked' cargo +nightly build --target aarch64-linux-android --release

This is a quick-and-dirty hack – in practice one should probably use a Cargo build script.

This approach turned out to work in my example, but because it’s quite hacky you may hit some linker-related issues. Notably, I had to disable link-time optimizations (lto = true) in the inner library, due to the rust_eh_personality symbol being duplicated.

error: linking with `/home/dev/opt/android-sdk/ndk/25.1.8937393/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android30-clang` failed: exit status: 1
  |
  = note: "/home/dev/opt/android-sdk/ndk/25.1.8937393/toolchains/llvm/prebuilt/linux-x86_64/bin/aarch64-linux-android30-clang" "-Wl,--version-script=/tmp/rustc9MJIHe/list" "/tmp/rustc9MJIHe/symbols.o" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/simd.simd.c6d283a5-cgu.0.rcgu.o" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/simd.k1r633uohw9806f.rcgu.rmeta" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/simd.d566y75et37yt2d.rcgu.o" "-Wl,--as-needed" "-L" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps" "-L" "/home/dev/build/android-simd/target/release/deps" "-L" "/home/dev/build/relinked" "-L" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib" "-lfallback" "-lsimd" "-Wl,-Bstatic" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/liblibc-426c9fcff770cf85.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libjni-6582db323d2c51dc.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libcesu8-aaf793d20b0e1ac1.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/liblog-c6207fb600ea3a38.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libcfg_if-13b938592ba33583.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libcombine-a6914966783fd2ca.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libmemchr-6e9017bfcbdcd015.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libbytes-ba7d281cd663b299.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libthiserror-17e5199790bbd2d6.rlib" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libjni_sys-442f3833be347c4c.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libstd-ca201f8924e1a745.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libpanic_abort-5eecab1447f44e6f.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libobject-9edec975292b096b.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libmemchr-7101bcf92ac73e01.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libaddr2line-133819781a63c739.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libgimli-47df885212c9ec97.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/librustc_demangle-7b8caa98eca7572d.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libstd_detect-9a1b49175d4e38cb.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libhashbrown-0ffd5b9fedd3b1ae.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libminiz_oxide-995414520fa49f31.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libadler-3c86b51ab749f965.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/librustc_std_workspace_alloc-28b6d4e7d7c1f355.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libunwind-394e28c2f903c2e9.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libcfg_if-3ed771790aba2d34.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/liblibc-f211a911193b255a.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/liballoc-11386607a3accfa5.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/librustc_std_workspace_core-001d5bd9a65e4337.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libcore-088bc0b43b3ec677.rlib" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libcompiler_builtins-78ef6e03d835568c.rlib" "-Wl,-Bdynamic" "-ldl" "-llog" "-lunwind" "-ldl" "-lm" "-lc" "-Wl,--eh-frame-hdr" "-Wl,-znoexecstack" "-L" "/home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib" "-o" "/home/dev/build/android-simd/target/aarch64-linux-android/release/deps/libsimd.so" "-shared" "-Wl,-zrelro,-znow" "-Wl,-O1" "-nodefaultlibs"
  = note: ld: error: duplicate symbol: rust_eh_personality
          >>> defined at gcc.rs:244 (library/std/src/personality/gcc.rs:244)
          >>>            relinked-1bf68705c0def4c6.relinked.2bfeaf53-cgu.0.rcgu.o:(rust_eh_personality) in archive /home/dev/build/relinked/libfallback.a
          >>> defined at gcc.rs:244 (library/std/src/personality/gcc.rs:244)
          >>>            relinked-1bf68705c0def4c6.relinked.2bfeaf53-cgu.0.rcgu.o:(.text.rust_eh_personality+0x0) in archive /home/dev/build/relinked/libsimd.a
          
          ld: error: duplicate symbol: rust_eh_personality
          >>> defined at gcc.rs:244 (library/std/src/personality/gcc.rs:244)
          >>>            relinked-1bf68705c0def4c6.relinked.2bfeaf53-cgu.0.rcgu.o:(rust_eh_personality) in archive /home/dev/build/relinked/libfallback.a
          >>> defined at gcc.rs:244 (library/std/src/personality/gcc.rs:244)
          >>>            std-ca201f8924e1a745.std.add6f040-cgu.0.rcgu.o:(.text.rust_eh_personality+0x0) in archive /home/dev/rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/aarch64-linux-android/lib/libstd-ca201f8924e1a745.rlib
          clang-14: error: linker command failed with exit code 1 (use -v to see invocation)
          

error: could not compile `simd` due to previous error

To enable/disable LTO in certain contexts, you can define a custom profile in your Cargo.toml, and use it by passing --profile <custom name> to Cargo.

[profile.release-nolto]
inherits = "release"
lto = false

Towards better language support?

As we’ve seen, dynamic feature detection is not quite a zero-cost abstraction of the Rust language yet. Either we apply it to small functions, which has a performance cost, or we need to duplicate something (the source code or the compiled library), which isn’t an ideal abstraction.

The general problem of duplicating similar functions with different annotations comes up pretty often in Rust. A typical example where code duplication is needed today is the async keyword, because async functions cannot be mixed with non-async ones (unless we manually .await them, which may not be optimal). An example where code duplication is not needed today is generics: we can write some common logic abstracted over a trait once, and defer only the implementation-specific parts of the logic to instances of the trait.

A general way of solving this code duplication problem is to add an effect system to the programming language, as pointed out in the original RFC for target_feature. Closer to us, an initiative towards generics over effects was announced this summer for Rust, taking async as a motivating example. Somewhat relatedly, a post on Tyler Mandry’s blog discussed how to add contexts and capabilities to Rust, notably for allocators.

Having an effect system to automatically color functions with or without various target_features would certainly help. On the other side, the use case for SIMD might be quite niche compared to other examples like async, although the performance improvements can be quite significant when applicable.

Real-world benchmarks with Horcrux

In previous posts, I’ve presented benchmarks on Intel of my Horcrux implementation of Shamir’s Secret Sharing, as well as an optimized multiplication algorithm relying on dedicated instructions (clmul on Intel, pmull on ARM). To better illustrate the overhead of CPU feature detection, let’s revisit these benchmarks on an Android phone with ARM64.

I’ll consider three scenarios:

  • fallback implementation, not using any SIMD code,
  • SIMD implementation with static detection, i.e. with CPU features enabled at compile-time,
  • SIMD implementation with dynamic detection, applied at the level of the multiplication function.

In the last scenario, this means that we’ll have to pay the cost of dynamic detection each time the algorithm performs a multiplication, which is many times.

Benchmarks for field operations

Firstly, the arithmetic operations show a similar pattern to Intel, where using the dedicated instructions is about 10x faster than the fallback! But we also see that the overhead of dynamic detection over static detection can be up to 1.5x for some values (mind the log scale), even for the inversion routine that consists of hundreds of multiplications.

We also notice that for larger fields – GF(2512)\mathrm{GF}(2^{512}) and above – the overhead is not noticeable. This can be explained by the fact that the multiplication routine becomes bigger (in a quadratic manner), so the relative overhead of dynamic detection just becomes smaller. It could also be that the optimized version needs to use the stack as all intermediate values cannot fit in registers anymore. At that point, dynamic detection just becomes another access to the CPU cache among many.

For the full benchmarks of the API-level Shamir operations (below), using dedicated instructions also yields a tremendous performance gain – up to 100x for some parameters! And here as well, the overhead of dynamic detection can be up to 1.5x.

Benchmarks for Shamir operations (compact shares)

Conclusion

To conclude, is any of this useful, beyond the toy examples that I presented – the cryptic Haraka function and Shamir’s Secret Sharing?

Firstly, there are in fact plenty of practical algorithms that strongly benefit from SIMD instructions. Here are a few examples.

There’s a caveat though: not all algorithms easily translate from Intel to ARM, and in some cases SIMD doesn’t bring any performance gain. So for example, Rust’s HashMap doesn’t use any SIMD on ARM (see rust-lang/hashbrown/269). Additionally, on a given CPU architecture the performance can vary from one CPU model to the next. So in any case: benchmark, measure and profile your code!

The second aspect is whether dynamic feature detection (and its overhead) matters in practice.

As we’ve learned, all Android devices running on ARM64 support NEON, with the feature enabled at compile time. This NEON baseline already regroups most “general purpose” SIMD instructions on ARM. This means that dynamic CPU feature detection on (ARM-based) Android will mostly be relevant for the “niche” non-NEON instructions such as cryptographic primitives, that you’ll likely leave to dedicated libraries (e.g. RustCrypto).

However, features are definitely relevant on Intel for “general purpose” SIMD, as there are several generations supporting wider and wider vectors: 128-bit (SSE and its variants), 256-bit (AVX2), 512-bit (AVX-512 and its variants). Without dynamic detection, your performance will stay stuck at a fairly low baseline (e.g. the 20+ year-old SSE2 on x86_64). But as we’ve learned in this post, you’ll need to be mindful of the cost of feature detection until Rust has first-class language support for it, not only because of detection itself but also in terms of missed optimizations.


Comments

To react to this blog post please check the Mastodon thread and the Reddit thread.


RSS | Mastodon | GitHub


You may also like

Detecting SIMD support on ARM with Android (and patching the Rust compiler for it)
Horcrux: Implementing Shamir's Secret Sharing in Rust (part 2)
Why my Rust benchmarks were wrong, or how to correctly use std::hint::black_box?
Compiling Rust libraries for Android apps: a deep dive
And 30 more posts on this blog!