This post is the second of a series on testing Rust’s support of SIMD instructions on ARM with Android. In the first post, we’ve seen how to compile Rust libraries for Android with the command-line tools, and tested that we could reliably detect the CPU architecture.

In this post, we’ll see how to detect if specific SIMD instructions are available on the exact CPU model we’re running on. I’d expected to simply use the std::arch::is_aarch64_feature_detected! macro – in which case this post would have been a 1-liner. However, as you may have guessed from the title, this didn’t work out-of-the-box for ARM on Android.

In the first section, I’ll present my journey down to the details of detecting CPU features. We’ll see that this isn’t trivial on ARM, and will review the existing detection methods, until we identify the issue in the Rust compiler.

In the second part, I’ll walk through how I compiled and tested my first patched Rust compiler, and how it worked out for Android. Patching the compiler seemed daunting, but fortunately it ended up easier than I expected.


Checking available SIMD instructions at runtime

In a previous post, I’ve described how to speed up Shamir’s Secret Sharing algorithm on Intel CPUs, by detecting available SIMD instructions at compile time. Likewise, for ARM on Android, we’ll need to confirm, beyond the CPU architecture, that each SIMD instruction is available on the specific CPU model we’re running on.

In the case of Shamir’s Secret Sharing, I wanted to test ARM’s vmull_p64 instruction for polynomial multiplication, which Rust labels as requiring the neon and aes target features.

Dynamic CPU feature detection in Rust

In the previous post, I’ve already described how to use Rust’s static CPU feature detection. However, this requires to know the available CPU features at compile time, which doesn’t really apply to the model of publishing an Android application, that will run on a wide variety of device models.

For this purpose, Rust also supports dynamic CPU feature detection, checking features available on the current CPU at runtime. This is done via the per-architecture std::arch::is_..._feature_detected! macros (e.g. is_aarch64_feature_detected! on 64-bit ARM). Each CPU has its own feature labels, so I’ve written a utility to automate the process and output the features available on the current CPU.

I’ve tested it on the following scenarios (see my previous post about what happens if you use the wrong jniLibs/ sub-folder).

Device Library
64-bit ARM Android phone Correct 64-bit library (jniLibs/arm64-v8a/)
64-bit ARM Android phone Fallback 32-bit library (due to incorrect folder jniLibs/arm64/)
64-bit Intel emulator Correct 64-bit library (jniLibs/x86_64/)
64-bit Intel emulator Fallback 32-bit library (due to incorrect folder jniLibs/x64/)

I obtained the following results.

  • Android phone on ARM CPU, 64-bit library.
SUPPORTED_ABIS: [arm64-v8a, armeabi-v7a, armeabi]
OS.ARCH: aarch64
Your CPU architecture is aarch64
Detected 3 enabled features:
    [asimd, fp, neon]
Detected 39 disabled features:
    [aes, bf16, bti, crc, dit, dotprod, dpb, dpb2, f32mm, f64mm, fcma, fhm, flagm, fp16, frintts, i8mm, jsconv, lse, lse2, mte, paca, pacg, pmull, rand, rcpc, rcpc2, rdm, sb, sha2, sha3, sm4, ssbs, sve, sve2, sve2-aes, sve2-bitperm, sve2-sha3, sve2-sm4, tme]
  • Android phone on ARM CPU, 32-bit library.
SUPPORTED_ABIS: [arm64-v8a, armeabi-v7a, armeabi]
OS.ARCH: armv8l
Your CPU architecture is arm
Detected 0 enabled features:
    []
Detected 7 disabled features:
    [aes, crc, crypto, i8mm, neon, pmull, sha2]
  • Android emulator on Intel CPU, 64-bit library.
SUPPORTED_ABIS: [x86_64, x86]
OS.ARCH: x86_64
Your CPU architecture is x86_64
Detected 18 enabled features:
    [abm, aes, avx, cmpxchg16b, f16c, fxsr, lzcnt, mmx, pclmulqdq, popcnt, sse, sse2, sse3, sse4.1, sse4.2, ssse3, tsc, xsave]
Detected 32 disabled features:
    [adx, avx2, avx512bf16, avx512bitalg, avx512bw, avx512cd, avx512dq, avx512er, avx512f, avx512gfni, avx512ifma, avx512pf, avx512vaes, avx512vbmi, avx512vbmi2, avx512vl, avx512vnni, avx512vp2intersect, avx512vpclmulqdq, avx512vpopcntdq, bmi1, bmi2, fma, rdrand, rdseed, rtm, sha, sse4a, tbm, xsavec, xsaveopt, xsaves]
  • Android emulator on Intel CPU, 32-bit library.
SUPPORTED_ABIS: [x86_64, x86]
OS.ARCH: i686
Your CPU architecture is x86
Detected 18 enabled features:
    [abm, aes, avx, cmpxchg16b, f16c, fxsr, lzcnt, mmx, pclmulqdq, popcnt, sse, sse2, sse3, sse4.1, sse4.2, ssse3, tsc, xsave]
Detected 32 disabled features:
    [adx, avx2, avx512bf16, avx512bitalg, avx512bw, avx512cd, avx512dq, avx512er, avx512f, avx512gfni, avx512ifma, avx512pf, avx512vaes, avx512vbmi, avx512vbmi2, avx512vl, avx512vnni, avx512vp2intersect, avx512vpclmulqdq, avx512vpopcntdq, bmi1, bmi2, fma, rdrand, rdseed, rtm, sha, sse4a, tbm, xsavec, xsaveopt, xsaves]

As an additional check, I looked at the CPU features reported as enabled on the same Intel CPU when running on the host rather than within the Android emulator. This shows a few more features that are not forwarded to the emulator!

Your CPU architecture is x86_64
Detected 28 enabled features:
    [abm, adx, aes, avx, avx2, bmi1, bmi2, cmpxchg16b, f16c, fma, fxsr, lzcnt, mmx, pclmulqdq, popcnt, rdrand, rdseed, sse, sse2, sse3, sse4.1, sse4.2, ssse3, tsc, xsave, xsavec, xsaveopt, xsaves]
Detected 22 disabled features:
    [avx512bf16, avx512bitalg, avx512bw, avx512cd, avx512dq, avx512er, avx512f, avx512gfni, avx512ifma, avx512pf, avx512vaes, avx512vbmi, avx512vbmi2, avx512vl, avx512vnni, avx512vp2intersect, avx512vpclmulqdq, avx512vpopcntdq, rtm, sha, sse4a, tbm]

Overall, these results were quite disappointing. Although the neon feature (base feature for SIMD on ARM) was detected on AArch64, the required features for the vmull_p64 instruction were not. Neither aes – label for vmull_p64 – nor pmull – which looks more suitable for the vmull_p64 instruction – were detected.

So I decided to cheat a bit and try running the instruction anyway.

// SIMD detection using the official methods.
pub fn pmul(a: u64, b: u64) -> (u128, &'static str) {
    #[cfg(target_arch = "aarch64")]
    {
        use std::arch::is_aarch64_feature_detected;
        if is_aarch64_feature_detected!("neon") && is_aarch64_feature_detected!("aes") {
            // Safety: target_features "neon" and "aes" are available in this block.
            return unsafe { pmul_aarch64_neon(a, b) };
        }
    }
    pmul_nosimd(a, b)
}

// SIMD detection with a bit of cheating.
pub fn pmul_cheat(a: u64, b: u64) -> (u128, &'static str) {
    #[cfg(target_arch = "aarch64")]
    {
        use std::arch::is_aarch64_feature_detected;
        // FIXME: Here we cheat and omit to detect the "aes" feature.
        if is_aarch64_feature_detected!("neon") {
            return unsafe { pmul_aarch64_neon(a, b) };
        }
    }
    pmul_nosimd(a, b)
}

// SIMD implementation.
#[cfg(target_arch = "aarch64")]
#[target_feature(enable = "neon", enable = "aes")]
unsafe fn pmul_aarch64_neon(a: u64, b: u64) -> (u128, &'static str) {
    use std::arch::aarch64::vmull_p64;

    // Safety: target_features "neon" and "aes" are available in this function.
    let result: u128 = vmull_p64(a, b);
    (result, "aarch64_neon")
}

// Fallback implementation
pub fn pmul_nosimd(a: u64, b: u64) -> (u128, &'static str) {
    let mut tmp: u128 = b as u128;
    let mut result: u128 = 0;
    for i in 0..64 {
        if a & (1 << i) != 0 {
            result ^= tmp;
        }
        tmp <<= 1;
    }
    (result, "nosimd")
}

Running this code on the real device seemed to work. This code didn’t trigger any crash (the system didn’t send a SIGILL signal), and it also returned the expected result.

Testing polynomial multiplication instructions
pmul_nosimd(1234567890abcdef, fedcba0987654321) = 0e038d8eab3af47a1f31f87ebb8c810f [strategy = nosimd]
pmul_cheat(1234567890abcdef, fedcba0987654321) = 0e038d8eab3af47a1f31f87ebb8c810f [strategy = aarch64_neon]
pmul(1234567890abcdef, fedcba0987654321) = 0e038d8eab3af47a1f31f87ebb8c810f [strategy = nosimd]

So it seemed that something was off with the way Rust detected the CPU features at runtime. Either my device supported the aes feature, or the vmull_p64 instruction was mis-labeled as requiring this feature when it didn’t require it.

Review of CPU feature detection methods

Let’s try to understand in more details how the CPU features can be obtained. On Intel, there is a specific cpuid instruction which allows to obtain all sorts of information about the current CPU, including the supported features. This instruction can be called in any application, making dynamic feature detection straightforward and fast.

On ARM, things are more complicated. There are some CPUID registers, which can be read via a dedicated mrs instruction, but these registers can only be accessed in privileged mode (i.e. in the operating system), not in userspace applications. Therefore, multiple workarounds have been developed to make the features accessible.

To better understand, I’ve dug into the Rust standard library source code, in particular std_detect/src/detect/os/aarch64.rs.

The first option is to still call the mrs instruction, on operating systems that trap it. The idea is that calling mrs from userspace will trigger a fault, but the OS can catch it and in turn call mrs itself (from the privileged mode) and return back the results to the userspace application.

#[cfg(target_arch = "aarch64")]
fn parse_mrs() -> u64 {
    // ID_AA64ISAR0_EL1 - Instruction Set Attribute Register 0
    let aa64isar0: u64;
    unsafe {
        asm!(
            "mrs {}, ID_AA64ISAR0_EL1",
            out(reg) aa64isar0,
            options(pure, nomem, preserves_flags, nostack)
        );
    }
    aa64isar0
}

The Rust code mentions that this is implemented on Linux >= 4.11 (see ARM64 CPU Feature Registers). However, calling it from an Android application triggered a SIGILL signal (i.e. we’re trying to call an illegal instruction), so my test phone didn’t support that.

5892 D SimdApplication: Your CPU architecture is aarch64
--------- beginning of crash
5892 F libc    : Fatal signal 4 (SIGILL), code 1, fault addr 0x7ceeb080b0 in tid 5892 (rustapplication)
5976 I crash_dump64: obtaining output fd from tombstoned
4782 I /system/bin/tombstoned: received crash request for pid 5892
5976 I crash_dump64: performing dump of process 5892 (target tid = 5892)
5976 F DEBUG   : *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
5976 F DEBUG   : Build fingerprint: 'samsung/heroltexx/herolte:8.0.0/R16NW/G930FXXU8ETI2:user/release-keys'
5976 F DEBUG   : Revision: '8'
5976 F DEBUG   : ABI: 'arm64'
5976 F DEBUG   : pid: 5892, tid: 5892, name: rustapplication  >>> com.example.myrustapplication <<<
5976 F DEBUG   : signal 4 (SIGILL), code 1 (ILL_ILLOPC), fault addr 0x7ceeb080b0
5976 F DEBUG   :     x0   0000000000000080  x1   0000007d114e8d00  x2   0000007d11400000  x3   0000000000000008
5976 F DEBUG   :     x4   00000000000000e8  x5   0000007d0f81892d  x6   0000000000000000  x7   0000007fd566c888
5976 F DEBUG   :     x8   000000000000000f  x9   ebd10a228ed9f2f7  x10  00000000000000e8  x11  0000000000000000
5976 F DEBUG   :     x12  000000000000000b  x13  0000000000000001  x14  ffffffffffffffff  x15  0000000000000000
5976 F DEBUG   :     x16  0000007d122f0cc0  x17  0000007d1228e4fc  x18  0000000000000020  x19  0000007fd566cff0
5976 F DEBUG   :     x20  0000000000000027  x21  0000007ceec76c00  x22  0000007ceebc73b4  x23  000000000000000f
5976 F DEBUG   :     x24  0000007ceeb02278  x25  0000007fd566d1d8  x26  0000000000000000  x27  0000000000000000
5976 F DEBUG   :     x28  0000000000000000  x29  0000007fd566d010  x30  0000007ceeb080a4
5976 F DEBUG   :     sp   0000007fd566ce20  pc   0000007ceeb080b0  pstate 0000000060000000
5976 F DEBUG   : 
5976 F DEBUG   : backtrace:
5976 F DEBUG   :     #00 pc 000000000000f0b0  /data/app/com.example.myrustapplication-mSiOxJoO5o8TP6sJWzy6jQ==/lib/arm64/libsimd.so (offset 0xdd000)
5976 F DEBUG   :     #01 pc 000000000000f0a0  /data/app/com.example.myrustapplication-mSiOxJoO5o8TP6sJWzy6jQ==/lib/arm64/libsimd.so (offset 0xdd000)

Taking a closer look, some more methods are implemented specifically on Linux in std_detect/src/detect/os/linux/aarch64.rs. One way is to call the getauxval function, which allows to retrieve a bitmask of hardware capabilities in the AT_HWCAP (and AT_HWCAP2 on 32-bit ARM) entry(ies).

HWCAP features found in getauxval: 00000000000000ff

As a fallback, the same information is available in the /proc/self/auxv file (entry 16). Note however that Android applications in release mode won’t be able to read this file1.

Contents of /proc/self/auxv:
     0 = 0000000000000000 / 0000000000000000000000000000000000000000000000000000000000000000
     3 = 0000007d166cd040 / 0000000000000000000000000111110100010110011011001101000001000000
     4 = 0000000000000038 / 0000000000000000000000000000000000000000000000000000000000111000
...
    16 = 00000000000000ff / 0000000000000000000000000000000000000000000000000000000011111111
...

In both cases, on my device the output is 0x00000000000000ff, whose bitmask format is defined in Linux in arch/arm64/include/uapi/asm/hwcap.h. In my example the last 8 bits are all set, so it looks like the aes and pmull features should be supported (bits 3 and 4).

#define HWCAP_FP		(1 << 0)
#define HWCAP_ASIMD		(1 << 1)
#define HWCAP_EVTSTRM		(1 << 2)
#define HWCAP_AES		(1 << 3)
#define HWCAP_PMULL		(1 << 4)
#define HWCAP_SHA1		(1 << 5)
#define HWCAP_SHA2		(1 << 6)
#define HWCAP_CRC32		(1 << 7)

As a last resort to detect CPU features, we can parse the /proc/cpuinfo file (std_detect/src/detect/os/linux/cpuinfo.rs), which should contain the same information as getauxval but in human-readable format.

processor	: 0
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x41
CPU architecture: 8
CPU variant	: 0x0
CPU part	: 0xd03
CPU revision	: 4

...

processor	: 5
Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer	: 0x53
CPU architecture: 8
CPU variant	: 0x1
CPU part	: 0x001
CPU revision	: 1

Looking at the outputs, my Android device indeed supports the pmull feature needed for the vmull_p64 instruction, which made me think that there is something missing in the std_detect crate.

Taking a step back, the root of the std_detect crate is dispatching detection code for various CPU architectures and operating systems. In particular, the branch for Linux is here.

    } else if #[cfg(all(target_os = "linux", feature = "libc"))] {
        #[path = "os/linux/mod.rs"]
        mod os;
    } else if ...
...
    } else {
        #[path = "os/other.rs"]
        mod os;
    }

However, while Android is based on Linux, I remembered that Rust has a specific target_os = "android", that we used in the previous post. The target_os for Android is defined here in the Rust compiler’s source code. On Android, the target_os = "linux" branch therefore doesn’t apply in the std_detect crate, and a fallback is used instead, which doesn’t detect anything!

Now, you may be wondering why a few features are detected on aarch64, if the fallback detects nothing?

Detected 3 enabled features:
    [asimd, fp, neon]

Digging deeper, it turns out that these features are defined at compile time for the whole aarch64_linux_android target, the rationale being that these features are implemented on all arm64-v8a Android devices. For each feature (e.g. neon), this enables cfg(target_feature = "neon") at compile time, and the std::arch::is_..._feature_detected! macros precisely have a bypass to return true whenever a feature is enabled at compile time.

To summarize, we’ll need to fix the Rust compiler’s std_detect crate to also run the Linux code for Android targets, for example with the following patch.

-    } else if #[cfg(all(target_os = "linux", feature = "libc"))] {
+    } else if #[cfg(all(any(target_os = "linux", target_os = "android"), feature = "libc"))] {
         #[path = "os/linux/mod.rs"]
         mod os;

I originally thought of the following patch (is there any Android system without libc?), but as pointed out in the review of my pull request, this may be needed when building a no_std program on Android.

-    } else if #[cfg(all(target_os = "linux", feature = "libc"))] {
+    } else if #[cfg(any(all(target_os = "linux", feature = "libc"), target_os = "android"))] {
         #[path = "os/linux/mod.rs"]
         mod os;

Patching the Rust compiler for Android targets

Now that we’ve found the likely cause of mis-detection of CPU features, let’s build a patched version of the Rust compiler to test our changes. While this seemed like a daunting process at first, it turned out to be easier than I expected.

The overall documentation to build and run the compiler can be found in the rustc development guide. I’ll detail here the specific steps we need to patch our compiler for the Android targets.

Before getting started, you may need to install the following packages. The full list is documented here.

apt-get install -y \
    python3 \
    curl \
    cmake \
    git

The x.py script

The first step is to clone the source code. I’m using --depth=1 to avoid downloading the whole history – already more than 200,000 commits – which would use a lot of space on my machine – more than 1 GB in the .git/ folder by cloning everything.

$ git clone --depth=1 https://github.com/rust-lang/rust
Cloning into 'rust'...
remote: Enumerating objects: 40781, done.
remote: Counting objects: 100% (40781/40781), done.
remote: Compressing objects: 100% (34891/34891), done.
remote: Total 40781 (delta 5433), reused 17959 (delta 4551), pack-reused 0
Receiving objects: 100% (40781/40781), 25.09 MiB | 9.15 MiB/s, done.
Resolving deltas: 100% (5433/5433), done.
Updating files: 100% (38652/38652), done.

For reproducibility, all the outputs shown in this section are pinned to Rust version 1.67.0-nightly (1286ee23e 2022-11-05). As discussed on rustup/issues/817, you can obtain the commit corresponding to the latest nightly by running:

wget https://static.rust-lang.org/dist/channel-rust-nightly.toml -O - 2> /dev/null | grep -E "(commit|version)"

You can then fetch the Rust repository at a specific commit with the following commands (see this StackOverflow answer).

mkdir rust
cd rust/
git init
git remote add origin https://github.com/rust-lang/rust
git fetch --depth 1 origin 1286ee23e4e2dec8c1696d3d76c6b26d97bbcf82
git checkout FETCH_HEAD

Now that we have the source code, we’ll build the compiler itself. This requires going through multiple stages, in a process called bootstrapping and documented in the rustc development guide (I also recommend the RustConf 2022 talk on the topic). Everything goes via the x.py script at the root of the compiler’s source code.

This x.py Python script is actually a thin wrapper around a Rust binary (rustbuild, found in the src/bootstrap/ folder). Invoking it will already execute a few steps.

$ ./x.py
  • Giving some information about the bootstrapping process.
info: Downloading and building bootstrap before processing --help
      command. See src/bootstrap/README.md for help with common
      commands.
  • Downloading the latest beta compiler.
downloading https://static.rust-lang.org/dist/2022-09-20/rust-std-beta-x86_64-unknown-linux-gnu.tar.xz
############################################################################################################################################################################################################# 100.0%
extracting /home/dev/rustc-build/rust/build/cache/2022-09-20/rust-std-beta-x86_64-unknown-linux-gnu.tar.xz
downloading https://static.rust-lang.org/dist/2022-09-20/rustc-beta-x86_64-unknown-linux-gnu.tar.xz
############################################################################################################################################################################################################# 100.0%
extracting /home/dev/rustc-build/rust/build/cache/2022-09-20/rustc-beta-x86_64-unknown-linux-gnu.tar.xz
downloading https://static.rust-lang.org/dist/2022-09-20/cargo-beta-x86_64-unknown-linux-gnu.tar.xz
############################################################################################################################################################################################################# 100.0%
extracting /home/dev/rustc-build/rust/build/cache/2022-09-20/cargo-beta-x86_64-unknown-linux-gnu.tar.xz
  • Compiling the rustbuild binary.
Building rustbuild
    Updating crates.io index
  Downloaded memchr v2.5.0
...
  Downloaded 53 crates (5.1 MB) in 1.13s
   Compiling memchr v2.5.0
...
    Finished dev [unoptimized] target(s) in 1m 37s

Lastly the help is printed.

Usage: x.py <subcommand> [options] [<paths>...]

Subcommands:
    build, b    Compile either the compiler or libraries
    check, c    Compile either the compiler or libraries, using cargo check
    clippy      Run clippy (uses rustup/cargo-installed clippy binary)
    fix         Run cargo fix
    fmt         Run rustfmt
    test, t     Build and run some test suites
    bench       Build and run some benchmarks
    doc, d      Build documentation
    clean       Clean out build directories
    dist        Build distribution artifacts
    install     Install distribution artifacts
    run, r      Run tools contained in this repository
    setup       Create a config.toml (making it easier to use `x.py` itself)

To learn more about a subcommand, run `./x.py <subcommand> -h`

Configuring the build process with config.toml

Before proceeding, we need to create a config.toml file that the build system will be able to use. One way is to invoke the setup command to do it interactively.

$ ./x.py setup

This executes the following steps.

  • Downloading the latest beta compiler and compiling rustbuild – skipped if you’ve already invoked x.py before.
Building rustbuild
    Finished dev [unoptimized] target(s) in 0.05s
  • An interactive selection of what to configure for. I’ve chosen the library option, as we’re just patching the std_detect crate, part of the Rust standard library.
Welcome to the Rust project! What do you want to do with x.py?
a) library: Contribute to the standard library
b) compiler: Contribute to the compiler itself
c) codegen: Contribute to the compiler, and also modify LLVM or codegen
d) tools: Contribute to tools which depend on the compiler, but do not modify it directly (e.g. rustdoc, clippy, miri)
e) user: Install Rust from source
Please choose one (a/b/c/d/e): a
  • Downloading a bunch of git submodules.
Updating submodule src/tools/rust-installer
...
Submodule 'src/rust-installer' (https://github.com/rust-lang/rust-installer.git) registered for path 'src/tools/rust-installer'
Submodule path 'src/tools/rust-installer': checked out '300b5ec61ef38855a07e6bb4955a37aa1c414c00'
Submodule 'src/tools/cargo' (https://github.com/rust-lang/cargo.git) registered for path 'src/tools/cargo'
Submodule path 'src/tools/cargo': checked out '9286a1beba5b28b115bad67de2ae91fb1c61eb0b'
Submodule 'library/backtrace' (https://github.com/rust-lang/backtrace-rs.git) registered for path 'library/backtrace'
Submodule path 'library/backtrace': checked out '07872f28cd8a65c3c7428811548dc85f1f2fb05b'
Submodule 'library/stdarch' (https://github.com/rust-lang/stdarch.git) registered for path 'library/stdarch'
Submodule path 'library/stdarch': checked out '790411f93c4b5eada3c23abb4c9a063fb0b24d99'
Submodule 'crates/intrinsic-test/acle' (https://github.com/ARM-software/acle.git) registered for path 'library/stdarch/crates/intrinsic-test/acle'
Submodule path 'library/stdarch/crates/intrinsic-test/acle': checked out '5626f85f469f419db16f20b1614863aeb377c22b'
  • An interactive configuration for adding a git hook for the tidy check. I’ve just skipped this part.
`x.py` will now use the configuration at /.../rust/src/bootstrap/defaults/config.library.toml

Added `stage1` rustup toolchain; try `cargo +stage1 build` on a separate rust project to run a newly-built toolchain

Rust's CI will automatically fail if it doesn't pass `tidy`, the internal tool for ensuring code quality.
If you'd like, x.py can install a git hook for you that will automatically run `tidy --bless` before
pushing your code to ensure your code is up to par. If you decide later that this behavior is
undesirable, simply delete the `pre-push` file from .git/hooks.
Would you like to install the git hook?: [y/N] 
Ok, skipping installation!

To get started, try one of the following commands:
- `x.py check`
- `x.py build`
- `x.py test library/std`
- `x.py doc`
For more suggestions, see https://rustc-dev-guide.rust-lang.org/building/suggested.html

Instead of running ./x.py setup, you can also directly create the following config.toml file.

# Includes one of the default files in src/bootstrap/defaults
profile = "library"
changelog-seen = 2

Building the compiler

Once the build system is configured, we first build a stage 0 compiler.

$ ./x.py build --stage 0

Like for the previous x.py invocations, this step isn’t strictly necessary, you can directly use the x.py build --stage 1 command shown below. I’m simply breaking down the steps here for clarity.

Under the hood, this stage 0 build consists of the following steps.

  • If you’re running x.py for the first time (you directly provided the config.toml file), the steps mentioned in the previous section are executed to initialize the build system: downloading the latest beta compiler, compiling the rustbuild binary, and downloading the git submodules. If you’ve run x.py before, they are skipped.
  • Building the stage 0 std artifacts for the host architecture (in my case x86_64-unknown-linux-gnu).
Building stage0 std artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
  Downloaded hashbrown v0.12.3
...
  Downloaded 11 crates (2.0 MB) in 0.48s
   Compiling compiler_builtins v0.1.82
...
    Finished release [optimized] target(s) in 32.25s
Copying stage0 std from stage0 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / x86_64-unknown-linux-gnu)

You can then add this stage 0 compiler to rustup, to be able to invoke it via cargo +stage0 build.

$ rustup toolchain link stage0 build/x86_64-unknown-linux-gnu/stage0

Once we have this stage 0 compiler, we can apply our patch (which belonged to the library/stdarch submodule downloaded in a previous step), and build a stage 1 compiler for the Android targets.

A pre-requisite will be to configure the path to the Android NDK in the compiler’s config.toml.

[target.aarch64-linux-android]
android-ndk = "/home/dev/opt/android-sdk/ndk/25.1.8937393/toolchains/llvm/prebuilt/linux-x86_64"

[target.armv7-linux-androideabi]
android-ndk = "/home/dev/opt/android-sdk/ndk/25.1.8937393/toolchains/llvm/prebuilt/linux-x86_64"

[target.i686-linux-android]
android-ndk = "/home/dev/opt/android-sdk/ndk/25.1.8937393/toolchains/llvm/prebuilt/linux-x86_64"

[target.x86_64-linux-android]
android-ndk = "/home/dev/opt/android-sdk/ndk/25.1.8937393/toolchains/llvm/prebuilt/linux-x86_64"

If these entries are missing in the config.toml, the build script will fail with the following error.

thread 'main' panicked at '

couldn't find required command: "aarch64-linux-android-clang"

', sanity.rs:59:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

We can now build the stage 1 compiler, specifying the relevant targets.

$ ./x.py build --stage 1 \
    --target x86_64-unknown-linux-gnu \
    --target aarch64-linux-android \
    --target armv7-linux-androideabi \
    --target i686-linux-android \
    --target x86_64-linux-android

Although we’re only using this compiler for the Android targets, it is necessary to compile for the host target as well (--target x86_64-unknown-linux-gnu in my case). Otherwise, when you’ll use this compiler to build your Android libraries, you’ll encounter errors where the std crate cannot be found.

$ cargo +stage1 build --target aarch64-linux-android --release
    Updating crates.io index
...
  Downloaded 18 crates (1.4 MB) in 0.55s
...
   Compiling libc v0.2.132
error[E0463]: can't find crate for `std`
...
error: could not compile `libc` due to 55 previous errors

This will run the following steps.

  • Re-building the stage0 standard library. This is because we just applied our patch to std_detect now.
Building rustbuild
    Finished dev [unoptimized] target(s) in 0.05s
Building stage0 std artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
   Compiling std_detect v0.1.5 (/home/dev/rustc-build/rust/library/stdarch/crates/std_detect)
...
    Finished release [optimized] target(s) in 3.32s
Copying stage0 std from stage0 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / x86_64-unknown-linux-gnu)
  • Downloading some pre-built LLVM.
downloading https://ci-artifacts.rust-lang.org/rustc-builds/1286ee23e4e2dec8c1696d3d76c6b26d97bbcf82/rust-dev-nightly-x86_64-unknown-linux-gnu.tar.xz
############################################################################################################################################################################################################# 100.0%
extracting /home/dev/rustc-build/rust/build/cache/llvm-1286ee23e4e2dec8c1696d3d76c6b26d97bbcf82-false/rust-dev-nightly-x86_64-unknown-linux-gnu.tar.xz to /home/dev/rustc-build/rust/build/x86_64-unknown-linux-gnu/ci-llvm
  • Building a proper stage0 compiler. It seems that the previous step cheated a bit: even though the stage0 standard library was built, the pre-built beta compiler was used. This step can be manually built with x.py build --stage 0 compiler.
Building stage0 compiler artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
  Downloaded cpufeatures v0.2.1
...
  Downloaded 112 crates (5.2 MB) in 1.36s (largest was `snap` at 1.1 MB)
   Compiling proc-macro2 v1.0.46
...
    Finished release [optimized] target(s) in 7m 40s
Copying stage0 rustc from stage0 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / x86_64-unknown-linux-gnu)
  • Building the stage1 standard library, for all the targets that we specified. This step can be manually built with x.py build --stage 1 library/std (providing the relevant --targets).
Assembling stage1 compiler (x86_64-unknown-linux-gnu)
Building stage1 std artifacts (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu)
...
    Finished release [optimized] target(s) in 46.78s
Copying stage1 std from stage1 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / x86_64-unknown-linux-gnu)
Building stage1 std artifacts (x86_64-unknown-linux-gnu -> aarch64-linux-android)
...
    Finished release [optimized] target(s) in 46.63s
Copying stage1 std from stage1 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / aarch64-linux-android)
Building stage1 std artifacts (x86_64-unknown-linux-gnu -> armv7-linux-androideabi)
...
    Finished release [optimized] target(s) in 44.43s
Copying stage1 std from stage1 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / armv7-linux-androideabi)
Building stage1 std artifacts (x86_64-unknown-linux-gnu -> i686-linux-android)
...
    Finished release [optimized] target(s) in 45.59s
Copying stage1 std from stage1 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / i686-linux-android)
Building stage1 std artifacts (x86_64-unknown-linux-gnu -> x86_64-linux-android)
...
    Finished release [optimized] target(s) in 45.88s
Copying stage1 std from stage1 (x86_64-unknown-linux-gnu -> x86_64-unknown-linux-gnu / x86_64-linux-android)
  • Building rustdoc. This can be done directly with x.py build --stage 1 src/tools/rustdoc.
Building rustdoc for stage1 (x86_64-unknown-linux-gnu)
  Downloaded unicase v2.6.0
...
  Downloaded 10 crates (531.5 KB) in 0.78s
   Compiling proc-macro2 v1.0.46
...
    Finished release [optimized] target(s) in 1m 27s

Even though we could in principle skip building rustdoc by specifying --stage 1 library/std only, in practice this prevents compiling crates you may be depending on such as proc-macro2.

error[E0463]: can't find crate for `proc_macro`
   --> /home/dev/.cargo/registry/src/index.crates.io-93a9e06d8945f2c0/proc-macro2-1.0.47/src/lib.rs:122:1
    |
122 | extern crate proc_macro;
    | ^^^^^^^^^^^^^^^^^^^^^^^^ can't find crate

error[E0635]: unknown feature `proc_macro_span`
  --> /home/dev/.cargo/registry/src/index.crates.io-93a9e06d8945f2c0/proc-macro2-1.0.47/src/lib.rs:92:13
   |
92 |     feature(proc_macro_span, proc_macro_span_shrink)
   |             ^^^^^^^^^^^^^^^

error[E0635]: unknown feature `proc_macro_span_shrink`
  --> /home/dev/.cargo/registry/src/index.crates.io-93a9e06d8945f2c0/proc-macro2-1.0.47/src/lib.rs:92:30
   |
92 |     feature(proc_macro_span, proc_macro_span_shrink)
   |                              ^^^^^^^^^^^^^^^^^^^^^^

Some errors have detailed explanations: E0463, E0635.
For more information about an error, try `rustc --explain E0463`.
error: could not compile `proc-macro2` due to 3 previous errors

Lastly, let’s add our patched compiler to rustup, so that we can use it with cargo +stage1 build.

$ rustup toolchain link stage1 build/x86_64-unknown-linux-gnu/stage1

Application to CPU feature detection

After re-compiling my Android Rust libraries with the patched compiler (cargo +stage1 build ...), I verified that the correct features were detected on my test phone.

  • Android phone on ARM-64 CPU, 64-bit library. Newly detected features: aes, crc, pmull, sha2.
SUPPORTED_ABIS: [arm64-v8a, armeabi-v7a, armeabi]
OS.ARCH: aarch64
Your CPU architecture is aarch64
Detected 7 enabled features:
    [aes, asimd, crc, fp, neon, pmull, sha2]
Detected 35 disabled features:
    [bf16, bti, dit, dotprod, dpb, dpb2, f32mm, f64mm, fcma, fhm, flagm, fp16, frintts, i8mm, jsconv, lse, lse2, mte, paca, pacg, rand, rcpc, rcpc2, rdm, sb, sha3, sm4, ssbs, sve, sve2, sve2-aes, sve2-bitperm, sve2-sha3, sve2-sm4, tme]
...
HWCAP features found in getauxval: 00000000000000ff
Found 8 features in /proc/cpuinfo:
    [aes, asimd, crc32, evtstrm, fp, pmull, sha1, sha2]
...
Testing polynomial multiplication instructions
pmul_nosimd(1234567890abcdef, fedcba0987654321) = 0e038d8eab3af47a1f31f87ebb8c810f [strategy = nosimd]
pmul_cheat(1234567890abcdef, fedcba0987654321) = 0e038d8eab3af47a1f31f87ebb8c810f [strategy = aarch64_neon]
pmul(1234567890abcdef, fedcba0987654321) = 0e038d8eab3af47a1f31f87ebb8c810f [strategy = aarch64_neon]
  • Android phone on ARM-64 CPU, 32-bit library. Newly detected features: aes, crc, crypto, neon, pmull, sha2.
SUPPORTED_ABIS: [arm64-v8a, armeabi-v7a, armeabi]
OS.ARCH: armv8l
Your CPU architecture is arm
Detected 6 enabled features:
    [aes, crc, crypto, neon, pmull, sha2]
Detected 1 disabled features:
    [i8mm]
...
HWCAP features found in /proc/self/auxv (13 bits are set): 0037b0d6 / 00000000001101111011000011010110
Found 18 features in /proc/cpuinfo:
    [aes, crc32, edsp, evtstrm, fastmult, half, idiva, idivt, lpae, neon, pmull, sha1, sha2, thumb, tls, vfp, vfpv3, vfpv4]
...
Testing polynomial multiplication instructions
pmul_nosimd(1234567890abcdef, fedcba0987654321) = 0e038d8eab3af47a1f31f87ebb8c810f [strategy = nosimd]
pmul_cheat(1234567890abcdef, fedcba0987654321) = 0e038d8eab3af47a1f31f87ebb8c810f [strategy = nosimd]
pmul(1234567890abcdef, fedcba0987654321) = 0e038d8eab3af47a1f31f87ebb8c810f [strategy = nosimd]

Resources needed to build rustc

Overall, I was pleasantly surprised by the amount of resources needed to build rustc. I was expecting that to use a lot of disk space, CPU time and RAM, but it turned out well manageable on a laptop.

Here is a summary of the resources that were needed for Rust version 1.67.0-nightly (1286ee23e 2022-11-05).

Disk space

Let’s start with disk space. I was originally planning to compile everything in RAM if possible, via a tmpfs, but the disk usage turned out around 10 GB, more than my laptop’s RAM. Still, this means that with at least 16 GB of RAM, you should be able to compile everything in RAM – potentially making the build process a bit faster.

On this front, it turns out that I had filed an issue 3 years ago to see if some low-hanging fruits could be optimized, in particular the cloning of git submodules. Although there was no immediate improvement at that time – and I had kind of given up on building my own rustc – I was pleasantly surprised to see that my issue was acted upon in the meantime (rust-lang/rust/76653, rust-lang/rust/89757).

To give you an idea, I’ve made the following flame graph of the disk space used by my build. The contents of these folders are documented here.

Flame graph Disk usage (flame graph) of building rustc (click to open an interactive view).

A few insights from this.

  • The compiler itself (stage0-rustc/) uses more than 4 GB.
  • Each instance of the standard library (stage0-std/ and each flavor of stage1-std/) takes about 500 MB.
  • Some files are copied in multiple places, but using hard links so that the data is only stored once on disk (in that case my flame graph only shows one of the copies). This is notably the case of the stage0-sysroot/ and stage1/ folders. This saves about 2 GB.
  • Among all the compilation artifacts, most of the incremental folders are filled with files named query-cache.bin and dep-graph.bin, in total 45% of the space (the graph is searchable).
  • The source code and .git/ folders use a negligible space (notably thanks to rust-lang/rust/89757).

If one wants to optimize for disk space, there would certainly be room for improvement by removing the intermediate compilation artifacts after each step of the build process.

Creating this graph from disk usage was an interesting side project in itself. As it turns out, Brendan Gregg’s flame graph tool takes input in a very simple format, which is not restricted to visualizing stack traces of profiled software.

One could almost create a suitable input with a bunch of Unix tools, but in this form, directories would be counted twice, completely messing up the graph.

# List the size of all files and directories.
du -a -l --bytes > du.txt
# Convert the results into a format suitable for flamegraph.pl.
cat du.txt | sed 's/\//;/g' | sed 's/\t.;/\t/g' | awk '{print $2,$1}' | head -n -1 > du.samples
# Generate the flame graph.
flamegraph.pl --title "Disk usage" --countname "bytes" --nametype "File:" du.samples > du.svg

Instead, I ended up writing a small Rust program to traverse the filesystem and print file sizes as “stack traces” in the correct format.

Runtime based on CPUs and RAM

The next aspect is CPU and RAM usage. As it turns out, both are related, in the sense that the build script allows to configure a number of jobs executing in parallel, and the more there are jobs, the higher the peak memory usage. Interestingly, if jobs run out of memory, the build process doesn’t crash but seems to hang forever, with a ridiculous disk I/O speed in the background (on my machine, htop reported ~100 MB/s for each thread).

I’ve simulated a few scenarios of CPUs/RAM combinations with Docker’s --cpus and --memory parameters, to see how it affected the overall build time. A nice recent feature is that Rust will automatically adjust the number of jobs to the number of CPUs available to Docker, rather than the total physical CPUs of the machine (rust-lang/rust/97925, rust-lang/rust/100586).

Another nice feature is that you can profile the compiler, and in particular obtain a time-based break-down of the dependency graph as an HTML page with CARGOFLAGS="--timings" ./x.py build. The tables below contains links to these profiling reports.

Firstly, here is an overview of the timings for each phase (--stage 0 and --stage 1), depending on the number of CPUs. For each case, I’m reporting the minimum required RAM that I measured: providing less than that would either cause an out-of-memory crash, or some hanging with very high disk I/O.

CPUs Minimum RAM stage 0 (total) stage 1 (total)
1 2 GB 3:46 31:23
2 2 GB 3:12 20:44
4 3 GB 2:45 15:09
8 5 GB 2:44 13:08

In detail, the break-down is as follows.

CPUs / RAM rustbuild 0-std 0-rustc
1 / 2 GB 2:00 1:06 21:03
2 / 2 GB 1:43 0:46 13:17
4 / 3 GB 1:34 0:36 9:07
8 / 5 GB 1:37 0:32 7:40
CPUs / RAM 1-std host 1-std aarch64 1-std armv7 1-std i686 1-std x86_64 rustdoc
1 / 2 GB 1:18 1:17 1:13 1:14 1:13 3:46
2 / 2 GB 0:57 1:01 0:54 1:00 0:56 2:24
4 / 3 GB 0:49 0:49 0:47 0:50 0:51 1:40
8 / 5 GB 0:46 0:46 0:44 0:45 0:45 1:27

Overall, with 4 CPUs or more, you can now expect to build a stage 1 Rust compiler in under 20 minutes. This seems more optimistic than the official documentation.

Conclusion

We’ve seen that detecting CPU features is more complex on ARM than on Intel. Indeed, contrary to Intel’s cpuid instruction, ARM’s msr is only available in privileged mode. This means that specific support must be provided by the operating system.

On Android, we can rely on the same mechanisms as Linux, but Rust was missing support for it because Android has its own target_os definition (despite being arguably a specialization of Linux).

Building a patched version of the Rust compiler has proven easier than I expected, taking 15 minutes on a laptop with reasonable resources. The important part was correctly configuring the Android NDK path for each target.

You can follow progress on my pull request to the Rust compiler. In the meantime, my code is reproducible on GitHub.

In the next blog post, we’ll see how to practically use the SIMD instructions that we’ve detected.



Comments

To react to this blog post please check the Mastodon thread, the Twitter thread and the Reddit thread.


Mastodon | Reddit | GitHub | Twitter | RSS


You may also like

Compiling Rust libraries for Android apps: a deep dive
Tutorial: Profiling Rust applications in Docker with perf
Horcrux: Implementing Shamir's Secret Sharing in Rust (part 2)
Lessons learned from stracing a password manager in Docker
And 25 more posts on this blog!