In a previous blog post, I described some benchmarks I wrote for a program written in Rust. While presenting the results, I mentioned a strange behavior: things that should have been very fast (a few nanoseconds) were reported as instantaneous. I wrote that it was probably fine given that the bigger benchmarks (above 10 nanoseconds) were seemingly working well, following the expected asymptotic behavior. However, this prompted me to dig deeper to understand what was going on under the hood.

In this blog post, I’ll present my findings, including some misconceptions I had about the std::hint::black_box function, and how to use this hint function properly to benchmark your code as faithfully as possible.


What’s behind the #[bench] interface?

The first tool to get started with benchmarking in Rust is the #[bench] attribute, built into the language and documented under the test unstable feature. From this documentation, we learn that writing a Rust benchmark should be as simple as the following.

#![feature(test)]

extern crate test;

pub fn add_two(a: i32) -> i32 {
    a + 2
}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;

    #[bench]
    fn bench_add_two(b: &mut Bencher) {
        b.iter(|| add_two(2));
    }
}

Besides the #[bench] attribute, the most important part is the Bencher API, and notably its iter() function, which takes as input the computation that we want to benchmark. The underlying implementation is a bit complicated, notably with code to aim for accurate clock measurements, statistics to keep running the benchmark until convergence is reached, etc. However, the gist of it is contained in the (private) ns_iter_inner function.

fn ns_iter_inner<T, F>(inner: &mut F, k: u64) -> u64
where
    F: FnMut() -> T,
{
    let start = Instant::now();
    for _ in 0..k {
        black_box(inner());
    }
    start.elapsed().as_nanos() as u64
}

Essentially, we run k iterations of the function being benchmarked, and measure the time taken by this loop.

But there is another detail: the black_box function is applied to the inner function being benchmarked. Or more precisely, it is applied to the result of the benchmarked function.

So what is there behind this black_box function? It is defined as std::hint::black_box, with the following description.

An identity function that hints to the compiler to be maximally pessimistic about what black_box could do.

Unlike std::convert::identity, a Rust compiler is encouraged to assume that black_box can use dummy in any possible valid way that Rust code is allowed to without introducing undefined behavior in the calling code. This property makes black_box useful for writing code in which certain optimizations are not desired, such as benchmarks.

Note however, that black_box is only (and can only be) provided on a “best-effort” basis. The extent to which it can block optimizations may vary depending upon the platform and code-gen backend used. Programs cannot rely on black_box for correctness in any way.

So this black_box function has been designed specifically for benchmarks, with a best-effort caveat. It is implemented as follows.

#[inline]
#[unstable(feature = "bench_black_box", issue = "64102")]
#[cfg_attr(not(bootstrap), allow(unused_mut))]
#[cfg_attr(bootstrap, allow(deprecated))]
pub fn black_box<T>(mut dummy: T) -> T {
    #[cfg(bootstrap)]
    // SAFETY: the inline assembly is a no-op.
    unsafe {
        llvm_asm!("" : : "r"(&mut dummy) : "memory" : "volatile");
        dummy
    }

    #[cfg(not(bootstrap))]
    {
        crate::intrinsics::black_box(dummy)
    }
}

We can see that in “bootstrapping” mode, there is some assembly that takes the dummy object passed to the black box, and marks the memory for this object as “volatile”, to prevent the compiler from optimizing based on its value. In “real” mode, it defers to the black_box intrinsic, defined directly in the compiler (and looks similar to the bootstrapping mode as far as I can tell).

The black_box function actually has its own GitHub issue. An interesting comment from this issue is an alternate implementation of a “better” black box.

pub fn better_black_box(mut x: u64) -> u64 {
    unsafe { asm!("/* {x} */", x = inout(reg) x); }
    x
}

There is also an “even better” proposal.

    asm!("/* {x} */", x = inout(reg) int_x, options(nostack));

In the latest Rust version, the bootstrapping part has been removed, so std::hint::black_box is just calling the intrinsic.

#[inline]
#[unstable(feature = "bench_black_box", issue = "64102")]
pub fn black_box<T>(dummy: T) -> T {
    crate::intrinsics::black_box(dummy)
}

Alternate benchmarking frameworks

One thing of interest is that there are alternate benchmarking frameworks for Rust, besides the standard Bencher implementation.

The most popular one is the criterion library. Its API is very similar to the standard library’s, with its own Bencher struct. The Bencher::iter function is essentially the base ns_iter_inner function mentioned above (without all the statistics around it).

#[inline(never)]
pub fn iter<O, R>(&mut self, mut routine: R)
where
    R: FnMut() -> O,
{
    self.iterated = true;
    let time_start = Instant::now();
    let start = self.measurement.start();
    for _ in 0..self.iters {
        black_box(routine());
    }
    self.value = self.measurement.end(start);
    self.elapsed_time = time_start.elapsed();
}

Its black_box implementation is however slightly different than the standard library’s (no intrinsic here).

pub fn black_box<T>(dummy: T) -> T {
    unsafe {
        let ret = std::ptr::read_volatile(&dummy);
        std::mem::forget(dummy);
        ret
    }
}

Effect of std::hint::black_box on a benchmarking loop

Now that we’ve seen that benchmark measurements boil down to running a loop of the benchmarked function with the black_box applied to it, let’s take a step back and check the assembly output of various loops.

For this, I’ve used the awesome Godbolt Compiler Explorer tool, setting the programming language to “Rust”, the compiler to “rustc nightly” and compiler options to -C opt-level=3 to enable optimizations (otherwise, non-optimized code is not a faithful benchmark). I’ll also report the assembly outputs (for x86_64) here. I wouldn’t consider myself an assembly expert, but this tool is an amazing way to learn some bits of assembly by example. Notably, it has documentation tooltips that appear when hovering over assembly instructions!

So, let’s begin with a simple snippet of code, trying to benchmark the “addition” function, but without the black_box.

pub fn bench_add() {
    let a = 1;
    let b = 2;
    for _ in 0..1024 {
        let x = a + b;
    }
}

As you may imagine, given that none of the variables in the function are used anywhere, the compiler can optimize everything as “dead code”, and reduce the function to a return instruction (Compiler Explorer).

bench_add:
        ; Just return, nothing else!
        ret

Let’s now try to wrap the addition in the black_box function, as the benchmarking frameworks do.

#![feature(bench_black_box)]

use std::hint::black_box;

pub fn bench_add() {
    let a = 1;
    let b = 2;
    for _ in 0..1024 {
        let x = black_box(a + b);
    }
}

Here we can see that the function is not completely optimized anymore, so the black_box does have an effect (Compiler Explorer)!

bench_add:
        sub     rsp, 4              ; Allocate 4 bytes on the stack.
        mov     eax, 1024           ; Set the loop counter.
        mov     rcx, rsp            ; No idea what this is for??
; Start of the loop.
.LBB0_1:
        mov     dword ptr [rsp], 3  ; Write the result on the stack.
        add     eax, -1             ; Decrement the loop counter.
        jne     .LBB0_1             ; Back to the start of the loop.
        add     rsp, 4              ; Deallocate 4 bytes on the stack.
        ret                         ; Return.

Let’s try to decompose the assembly.

  • The first and last instructions respectively increment and decrement the stack pointer rsp by 4 bytes.
  • The second instruction sets a loop counter to 1024 in the eax register. You can see that later on, another instruction decrements eax by 1, followed by a conditional jump to the .LBB0_1 label. Together, this part represents our loop of 1024 iterations.
  • I have no idea what the mov rcx, rsp instruction is for. The rcx register is not read anywhere after that.
  • Last, the body of the loop is a single instruction, mov dword ptr [rsp], 3, which moves the value 3 into the address contained in the rsp register, i.e. in the memory slot that was created on the stack.

As you can see, there is no addition being benchmarked here! The result of a + b is directly optimized by the compiler as being equal to 3, and all that black_box does is to force it to be written on the stack.

So what can we do? If we go back to the #[bench] documentation, some optimization gotchas are mentioned, along with recommendation.

The first remark in the gotchas is that the function being benchmarked should actually return a useful result (which will be passed to the black_box function under the hood). So far, we’ve already done that, but let’s see what happens if we simulate benchmarking a function that returns nothing (i.e. ()), for example the function || { let x = a + b; } (with the semi-colon ; at the end).

#![feature(bench_black_box)]

use std::hint::black_box;

pub fn bench_add() {
    let a = 1;
    let b = 2;
    for _ in 0..1024 {
        let x = a + b;
        black_box(());
    }
}

In this case, we can see that indeed the compiler didn’t even bother writing the result of the addition on the stack (Compiler Explorer). However, the loop structure is still present, so the black_box acts as a “barrier” preventing to optimize the code “surrounding” it.

; Just an empty loop with 1024 iterations.
bench_add:
        mov     eax, 1024
.LBB0_1:
        add     eax, -1
        jne     .LBB0_1
        ret

Interestingly, a loop that does nothing but invoking black_box(()) is optimized down to the same assembly (Compiler Explorer).

#![feature(bench_black_box)]

use std::hint::black_box;

pub fn bench_nothing() {
    for _ in 0..1024 {
        black_box(());
    }
}
bench_nothing:
        mov     eax, 1024
.LBB0_1:
        add     eax, -1
        jne     .LBB0_1
        ret

Another possibility mentioned in the optimization gotchas is to put a black_box function around the input(s). Let’s try to do that.

#![feature(bench_black_box)]

use std::hint::black_box;

pub fn bench_add() {
    let a = 1;
    let b = 2;
    for _ in 0..1024 {
        let x = black_box(a) + black_box(b);
    }
}

Here we can see that the inputs 1 and 2 are written on the stack on every iteration, however no addition is computed from them (Compiler Explorer)!

bench_add:
        ; Allocate 8 bytes on the stack, and prepare the loop.
        push    rax
        mov     eax, 1024
        ; No idea what these are for??
        mov     rcx, rsp
        lea     rdx, [rsp + 4]
.LBB0_1:
        ; Write the inputs on the stack.
        mov     dword ptr [rsp], 1
        mov     dword ptr [rsp + 4], 2
        ; Loop iteration.
        add     eax, -1
        jne     .LBB0_1
        ; Deallocate 8 bytes on the stack.
        pop     rax
        ret

Therefore, a crucial point is to put a black_box both on the inputs and on the outputs.

#![feature(bench_black_box)]

use std::hint::black_box;

pub fn bench_add() {
    let a = 1;
    let b = 2;
    for _ in 0..1024 {
        black_box(black_box(a) + black_box(b));
    }
}

In this case, we finally see an addition being computed on each iteration (Compiler Explorer).

bench_add:
        ; Prepare stack and loop.
        sub     rsp, 12
        mov     eax, 1024
        ; Still don't know what's the purpose of these.
        mov     rcx, rsp
        lea     rdx, [rsp + 4]
        lea     rsi, [rsp + 8]
.LBB0_1:
        ; Write the inputs on the stack.
        mov     dword ptr [rsp], 1
        mov     edi, dword ptr [rsp]
        mov     dword ptr [rsp + 4], 2
        ; Benchmark the function!
        add     edi, dword ptr [rsp + 4]
        ; Write the result on the stack.
        mov     dword ptr [rsp + 8], edi
        ; Loop iteration.
        add     eax, -1
        jne     .LBB0_1
        ; Deallocate the stack.
        add     rsp, 12
        ret

There are also overhead instructions to write the inputs and outputs on the stack, but that can be mitigated by using the “better” black box implementation for example.

use std::arch::asm;

fn better_black_box(mut x: u64) -> u64 {
    unsafe { asm!("/* {x} */", x = inout(reg) x, options(nostack)); }
    x
}

pub fn bench_add() {
    let a = 1;
    let b = 2;
    for _ in 0..1024 {
        let x = better_black_box(
            better_black_box(a) + better_black_box(b)
        );
    }
}

At this point, there isn’t a lot left to optimize or de-optimize (Compiler Explorer). Each iteration of the loop sets the inputs into registers, computes the addition, and updates the loop counter.

bench_add:
        ; No stack allocation, just prepare the loop.
        mov     eax, 1024
.LBB0_1:
        ; Write the inputs into registers.
        mov     ecx, 1
        mov     edx, 2
        ; Benchmark the function!
        add     rdx, rcx
        ; Loop iteration.
        add     eax, -1
        jne     .LBB0_1
        ; Just return after the loop.
        ret

Application to real benchmarks

I hope that exploring this example of benchmarking a simple addition convinced you of the effect of using the black_box both on inputs and outputs. However, you may wonder whether this simplified example really generalized to real-world benchmarks. So let’s now explore some real benchmarks found in the standard library. I’ll then finish with my original benchmarks that prompted me to write this blog post.

Example 1: char::to_digit

The first example I want to consider is the char::to_digit method. This converts a character into a number in the given base (a.k.a. radix), for example '5' is converted to Some(5) in base 10, 'b' is converted to Some(11) in base 16. However, 'b' is converted to None in base 10, because this hexadecimal character is not a valid digit in base 10.

pub fn to_digit(self, radix: u32) -> Option<u32>

Interestingly, there are some benchmarks written for this method, which boil down to the following loops.

#![feature(bench_black_box)]

use std::hint::black_box;

pub fn bench_to_digit_radix_2() {
    let chars = ['0', 'x', '2', '5', 'A', 'f', '7', '8', '9'];
    for _ in 0..1024 {
        black_box(chars.iter().cycle().take(10_000).map(|c| c.to_digit(2)).min());
    }
}

pub fn bench_to_digit_radix_16() {
    let chars = ['0', 'x', '2', '5', 'A', 'f', '7', '8', '9'];
    for _ in 0..1024 {
        black_box(chars.iter().cycle().take(10_000).map(|c| c.to_digit(16)).min());
    }
}

pub fn bench_to_digit_radix_36() {
    let chars = ['0', 'x', '2', '5', 'A', 'f', '7', '8', '9'];
    for _ in 0..1024 {
        black_box(chars.iter().cycle().take(10_000).map(|c| c.to_digit(36)).min());
    }
}

The intent of these benchmarks was seemingly to call the function enough times so that the benchmarks don’t show a result that’s too fast. However, when compiling them with -C opt-level=3, we obtain the following assembly (Compiler Explorer).

bench_to_digit_radix_2:
        ; Prepare the stack.
        push    rax
        xor     eax, eax
        ; When interpreted as [u32; 2] (in little endian), this is equal to [0, 0x48].
        ; This is a representation of None as an Option<u32>.
        movabs  rcx, 309237645312 ; 0x4800000000
        mov     rdx, rsp
; Main loop.
.LBB0_4:
        ; Increment main loop counter.
        add     eax, 1
        ; Inner loop counter.
        mov     esi, 9981
; Inner loop.
.LBB0_2:
        lea     rdi, [rsi - 9]
        cmp     rsi, 8
        mov     rsi, rdi
        ja      .LBB0_2
        ; Write the result on the stack.
        mov     qword ptr [rsp], rcx
        ; Main loop iteration.
        cmp     eax, 1024
        jne     .LBB0_4
        ; Deallocate the stack.
        pop     rax
        ret

As you can see, the benchmark doesn’t run the to_digit function 10’000 times per iteration at all! The result (None) is pre-computed in the beginning, and the black_box only makes sure to copy this result on the stack, like we saw with the addition benchmark.

In the end, this benchmark really measures approximately 10’000 iterations of an empty loop (it’s unclear to me why the loop counter isn’t initialized to exactly 10’000 but to 9’981 instead), which the compiler somehow didn’t manage to optimize away.

The benchmarks for other radices are very similar, they only differ by the result itself.

bench_to_digit_radix_16:
        push    rax
        xor     eax, eax
        ; [0, 0x21] = None
        movabs  rcx, 141733920768 ; 0x2100000000
        mov     rdx, rsp
.LBB1_4:
        add     eax, 1
        mov     esi, 9981
.LBB1_2:
        lea     rdi, [rsi - 9]
        cmp     rsi, 8
        mov     rsi, rdi
        ja      .LBB1_2
        ; Copy the result on the stack.
        mov     qword ptr [rsp], rcx
        cmp     eax, 1024
        jne     .LBB1_4
        pop     rax
        ret

bench_to_digit_radix_36:
        push    rax
        xor     eax, eax
        mov     rcx, rsp
.LBB2_4:
        add     eax, 1
        mov     edx, 9981
.LBB2_2:
        lea     rsi, [rdx - 9]
        cmp     rdx, 8
        mov     rdx, rsi
        ja      .LBB2_2
        ; Write the result directly on the stack.
        ; [1, 0] = Some(0)
        mov     qword ptr [rsp], 1
        cmp     eax, 1024
        jne     .LBB2_4
        pop     rax
        ret

On the contrary, adding just a black_box for the input character of to_digit produces more than 1000 lines of assembly!

#![feature(bench_black_box)]

use std::hint::black_box;

pub fn bench_to_digit_radix_2() {
    let chars = ['0', 'x', '2', '5', 'A', 'f', '7', '8', '9'];
    for _ in 0..1024 {
        black_box(chars.iter().cycle().take(10_000).map(|c| black_box(c).to_digit(2)).min());
    }
}

Loop over array indices or direct iteration?

Another curiosity I found while playing with this example is how the following two variants of the same function were compiled.

#![feature(bench_black_box)]

use std::hint::black_box;

pub fn bench_to_digits_indexed() {
    let chars = ['0', '1', '2', '4'];
    for _ in 0..1024 {
        for i in 0..4 {
            black_box(chars[i].to_digit(8));
        }
    }
}

pub fn bench_to_digits_forin() {
    let chars = ['0', '1', '2', '4'];
    for _ in 0..1024 {
        for c in chars {
            black_box(c.to_digit(8));
        }
    }
}

The bench_to_digits_indexed function uses an index-based for loop to go through the chars array, and the resulting assembly is as optimized as one would expect (Compiler Explorer).

bench_to_digits_indexed:
        push    rax
        mov     eax, 1024
        mov     rcx, rsp
        movabs  rdx, 4294967297  ; 0x100000001
        movabs  rsi, 8589934593  ; 0x200000001
        movabs  rdi, 17179869185 ; 0x400000001
.LBB0_1:
        mov     qword ptr [rsp], 1   ; Some(0)
        mov     qword ptr [rsp], rdx ; Some(1)
        mov     qword ptr [rsp], rsi ; Some(2)
        mov     qword ptr [rsp], rdi ; Some(4)
        add     eax, -1
        jne     .LBB0_1
        pop     rax
        ret

On the other hand, the bench_to_digits_forin function with a loop iterating directly into the array appears much less optimized, despite using the highest optimization level -C opt-level=3 (Compiler Explorer).

.LCPI0_0:
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   4
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
        .byte   0
bench_to_digits_forin:
        sub     rsp, 56
        mov     eax, 1024
        movaps  xmm0, xmmword ptr [rip + .LCPI0_0]
        movabs  r9, 223338299442
        movabs  r10, 210453397552
        mov     esi, 48
        mov     cl, 1
        add     esi, -48
        lea     r8, [rsp + 8]
        jmp     .LBB0_1
.LBB0_2:
        add     eax, -1
        je      .LBB0_3
.LBB0_1:
        movaps  xmmword ptr [rsp + 16], xmm0
        mov     qword ptr [rsp + 40], r9
        mov     qword ptr [rsp + 32], r10
        test    cl, cl
        je      .LBB0_2
        xor     edi, edi
        cmp     esi, 8
        setb    dil
        mov     dword ptr [rsp + 8], edi
        mov     dword ptr [rsp + 12], esi
        mov     edi, dword ptr [rsp + 36]
        cmp     edi, 1114112
        je      .LBB0_2
        add     edi, -48
        xor     edx, edx
        cmp     edi, 8
        setb    dl
        mov     dword ptr [rsp + 8], edx
        mov     dword ptr [rsp + 12], edi
        mov     edi, dword ptr [rsp + 40]
        cmp     edi, 1114112
        je      .LBB0_2
        add     edi, -48
        xor     edx, edx
        cmp     edi, 8
        setb    dl
        mov     dword ptr [rsp + 8], edx
        mov     dword ptr [rsp + 12], edi
        mov     edi, dword ptr [rsp + 44]
        cmp     edi, 1114112
        je      .LBB0_2
        add     edi, -48
        xor     edx, edx
        cmp     edi, 8
        setb    dl
        mov     dword ptr [rsp + 8], edx
        mov     dword ptr [rsp + 12], edi
        jmp     .LBB0_2
.LBB0_3:
        add     rsp, 56
        ret

This “less optimized” assembly appears very similar to adding an input black_box to the “indexed” function (Compiler Explorer).

pub fn bench_to_digits_indexed() {
    let chars = ['0', '1', '2', '4'];
    for _ in 0..1024 {
        for i in 0..4 {
            black_box(black_box(chars[i]).to_digit(8));
        }
    }
}

Example 2: str::parse::<f64>

My second example is the str::parse::<f64> method, implemented by f64::from_str to convert a string into a floating-point number. This function is benchmarked here in the standard library, again without any black_box on the input.

#![feature(bench_black_box)]

use std::hint::black_box;

pub fn bench_parse_f64() {
    for _ in 0..1024 {
        black_box("0.0".parse::<f64>());
    }
}

A smart enough compiler could directly optimize the expression "0.0".parse::<f64>() into its result Ok(0f64). However, the actual assembly is more than 400 lines, so this time I’d say that the benchmarks are “lucky” that the compiler didn’t optimize everything.

I don’t have much more to comment on this example, other than the fact that until such expressions are fully optimized by the compiler, there will be work left to improve compilers :) Of course, there’s also a trade-off to make between how far optimizations go and how long the code takes to compile.

Application to my benchmarks: arithmetic in GF(2n)\mathrm{GF}(2^n)

Let’s now go back to the original benchmarks from my previous post, which showed “zero nanoseconds” results. In these benchmarks, I tried to measure multiplication algorithms in finite fields of the form GF(2n)\mathrm{GF}(2^n) for various values of nn. I’ll leave the mathematical details to the previous post, but let’s now focus on the assembly produced by the two “instantaneous” benchmarks, which were for n=8n = 8 and n=128n = 128.

Arithmetic in GF(28)\mathrm{GF}(2^8)

After simplifying the code by removing all the const generic magic from my original code, the first case (n=8n = 8) looks like the following.

#[derive(Clone, Copy)]
struct GF8(u8);

impl GF8 {
    fn mul(mut self, other: &Self) -> Self {
        let mut result = Self(0);
        for i in 0..8 {
            if other.0 & (1 << i) != 0 {
                result.add_assign(&self);
            }
            self.shl1();
        }
        result
    }

    fn shl1(&mut self) {
        let d = self.0;
        self.0 = d << 1;
        let carry = d >> 7;
        if carry != 0 {
            self.0 ^= 1 ^ (1 << 4) ^ (1 << 3) ^ (1 << 1);
        }
    }

    fn add_assign(&mut self, other: &Self) {
        self.0 ^= other.0;
    }
}

Let’s consider the following two simplified benchmarking scenarios: the first one with a simple black_box surrounding the multiplication’s result, and the second one with black_box applied both on inputs and outputs.

#![feature(bench_black_box)]

use std::hint::black_box;

// Gist of my original benchmark.
pub fn bench_gf8_mul_box() {
    let test_value = GF8(!0);
    let x = test_value;
    let y = test_value;
    for _ in 0..1024 {
        black_box(x.mul(&y));
    }
}

// Improved benchmark.
pub fn bench_gf8_mul_2box() {
    let test_value = GF8(!0);
    let x = test_value;
    let y = test_value;
    for _ in 0..1024 {
        black_box(black_box(x).mul(&black_box(y)));
    }
}

As you may now expect, the first one is simply reduced to a loop that writes the multiplication result directly to the stack (Compiler Explorer). It’s then unsurprising that this loop of 3 instructions takes (on average) less than 1 nanosecond per iteration to execute, hence the benchmark result I obtained in my previous post!

bench_gf8_mul_box:
        sub     rsp, 1
        mov     eax, 1024
        mov     rcx, rsp
.LBB0_1:
        ; The compiler simplified everything to the result!
        mov     byte ptr [rsp], 19
        add     eax, -1
        jne     .LBB0_1
        add     rsp, 1
        ret

On the other hand, the benchmarking loop with a black_box on both sides expands to the full multiplication algorithm (Compiler Explorer).

bench_gf8_mul_2box:
        ; Prepare the loop and the stack.
        push    rbx
        sub     rsp, 3
        mov     r11d, 1024
        mov     r8, rsp
        lea     r9, [rsp + 1]
        lea     r10, [rsp + 2]
.LBB1_1:
        ; Write the inputs on the stack.
        mov     byte ptr [rsp], -1
        mov     byte ptr [rsp + 1], -1
        ; Load the inputs from the stack.
        movzx   ecx, byte ptr [rsp]
        movzx   edi, byte ptr [rsp + 1]
        ; Compute the multiplication.
        mov     edx, edi
        and     dl, 1
; Many more instructions
; ...
        and     dil, al
        xor     dil, dl
        ; Write the result on the stack.
        mov     byte ptr [rsp + 2], dil
        ; Loop iteration.
        add     r11d, -1
        jne     .LBB1_1
        ; Deallocate the stack.
        add     rsp, 3
        pop     rbx
        ret

Arithmetic in GF(2128)\mathrm{GF}(2^{128})

The second case was for n=128n = 128, with a substantially more complex code using specialized SIMD instructions. Therefore, I compiled it with -C opt-level=3 -C target-cpu=native to enable the relevant CPU features.

#![feature(bench_black_box)]

use std::hint::black_box;

pub fn bench_gf128_mul_box() {
    let test_value = GF128([!0; 2]);
    let x = test_value;
    let y = test_value;
    for _ in 0..1024 {
        black_box(GF128::mul(&x, &y));
    }
}

pub fn bench_gf128_mul_2box() {
    let test_value = GF128([!0; 2]);
    let x = test_value;
    let y = test_value;
    for _ in 0..1024 {
        black_box(GF128::mul(&black_box(x), &black_box(y)));
    }
}

#[derive(Clone, Copy)]
struct GF128([u64; 2]);

const A: usize = 7;
const B: usize = 2;
const C: usize = 1;

impl GF128 {
    #[cfg(all(
        target_arch = "x86_64",
        target_feature = "sse2",
        target_feature = "pclmulqdq"
    ))]
    fn mul(x: &Self, y: &Self) -> Self {
        use std::arch::x86_64::{__m128i, _mm_clmulepi64_si128, _mm_set_epi64x, _mm_storeu_si128};

        let mut words = [0u64; 2];
        let mut carry = [0u64; 2];

        for i in 0..2 {
            // Safety: target_feature "sse2" is available in this function.
            let xi: __m128i = unsafe { _mm_set_epi64x(0, x.0[i] as i64) };
            for j in 0..2 {
                // Safety: target_feature "sse2" is available in this function.
                let yj: __m128i = unsafe { _mm_set_epi64x(0, y.0[j] as i64) };
                // Safety: target_feature "pclmulqdq" is available in this function.
                let clmul: __m128i = unsafe { _mm_clmulepi64_si128(xi, yj, 0) };
                let mut cc: [u64; 2] = [0u64, 0u64];
                // Safety:
                // - target_feature "sse2" is available in this function,
                // - cc points to 128 bits (no alignment required by this function).
                unsafe { _mm_storeu_si128(&mut cc as *mut _ as *mut __m128i, clmul) };

                let ij = i + j;
                if ij < 2 {
                    words[ij] ^= cc[0];
                } else {
                    carry[ij - 2] ^= cc[0];
                }

                let ij1 = ij + 1;
                if ij1 < 2 {
                    words[ij1] ^= cc[1];
                } else {
                    carry[ij1 - 2] ^= cc[1];
                }
            }
        }

        Self::propagate_carries(words, carry)
    }

    fn propagate_carries(mut words: [u64; 2], carry: [u64; 2]) -> Self {
        for i in 0..2 {
            let c = carry[i];
            words[i] ^= c ^ (c << A) ^ (c << B) ^ (c << C);
            if i + 1 < 2 {
                words[i + 1] ^=
                    (c >> (64 - A)) ^ (c >> (64 - B)) ^ (c >> (64 - C));
            } else {
                let c = (c >> (64 - A)) ^ (c >> (64 - B)) ^ (c >> (64 - C));
                words[0] ^= c ^ (c << A) ^ (c << B) ^ (c << C);
                words[1] ^=
                    (c >> (64 - A)) ^ (c >> (64 - B)) ^ (c >> (64 - C));
            }
        }

        Self(words)
    }

    fn add_assign(&mut self, other: &Self) {
        for i in 0..2 {
            self.0[i] ^= other.0[i];
        }
    }
}

Here, the case of a single black_box applied to the output is interesting (Compiler Explorer). Although the compiler didn’t manage to optimize away all the math, it still recognized that the same value is computed over and over again, and therefore extracted out the computation before the loop! The loop itself is only copying the previously computed result to the stack, and is therefore very fast.

bench_gf128_mul_box:
        ; Allocate 16 bytes (128 bits) on the stack.
        sub     rsp, 16
        ; Compute the multiplication, storing the result in rdx & rax.
        vpcmpeqd        xmm0, xmm0, xmm0
        vpclmulqdq      xmm0, xmm0, xmm0, 0
        vmovq   rcx, xmm0
        vpextrq rax, xmm0, 1
        mov     rdx, rcx
        shl     rdx, 7
; Many more instructions
; ...
        xor     rax, rdi
        xor     rax, rcx
        ; Prepare the loop.
        mov     ecx, 1024
        mov     rsi, rsp
.LBB0_1:
        ; Write the multiplication result on the stack.
        mov     qword ptr [rsp], rdx
        mov     qword ptr [rsp + 8], rax
        ; Loop iteration.
        dec     ecx
        jne     .LBB0_1
        ; Deallocate the stack.
        add     rsp, 16
        ret

The full benchmark with black_boxes on both sides keeps the full computation inside the loop.

bench_gf128_mul_2box:
        ; Prepare the stack.
        push    r15
        push    r14
        push    rbx
        sub     rsp, 16
        ; Prepare the loop.
        mov     r11d, 1024
        mov     r8, rsp
.LBB1_1:
        ; Load the first input.
        mov     qword ptr [rsp + 8], -1
        mov     qword ptr [rsp], -1
        vmovdqu xmm0, xmmword ptr [rsp]
        vmovq   xmm3, qword ptr [rsp + 8]
        ; Load the second input.
        mov     qword ptr [rsp + 8], -1
        mov     qword ptr [rsp], -1
        vmovdqu xmm1, xmmword ptr [rsp]
        vmovq   xmm4, qword ptr [rsp + 8]
        ; Compute the multiplication.
        vpclmulqdq      xmm2, xmm0, xmm4, 0
        vpclmulqdq      xmm4, xmm3, xmm4, 0
        vpclmulqdq      xmm3, xmm3, xmm1, 0
        vpxor   xmm5, xmm3, xmm2
        vpshufd xmm5, xmm5, 238
; Many more instructions
; ...
        xor     rbx, r14
        xor     rbx, rdx
        ; Write the result on the stack.
        mov     qword ptr [rsp + 8], rbx
        mov     qword ptr [rsp], rax
        ; Loop iteration.
        dec     r11d
        jne     .LBB1_1
        ; Deallocate the stack.
        add     rsp, 16
        pop     rbx
        pop     r14
        pop     r15
        ret

Updated benchmarks

I have now updated my code to include a black_box both on inputs and outputs, and recomputed my benchmarks. Here are the updated results and a comparison with my previous blog post.

The first graph was showing micro-benchmarks of the field operations (multiplication and inversion). Below is the updated graph with the black_box applied on inputs. I’ve overlaid some dashed lines that represent the previous benchmarks.

Benchmarks for field operations

We can see that fixing the black_box affects mostly the fastest benchmarks, those running under 20 nanoseconds. The effects described in the previous section for GF(28)\mathrm{GF}(2^8) and GF(2128)\mathrm{GF}(2^{128}) are clearly visible.

However, I haven’t put the dashed lines for the slower benchmarks, as there was no noticeable effect beyond noise. My hypothesis is that the compiler is able to fully optimize only “simple enough” functions, up to a certain number of instructions which happen to run under 20ns in practice. More complex functions – like we’ve seen with std::f64::parse – are too complex for the current compiler to optimize, so the back_box has no effect on their benchmarks in practice.

This effect is visible on the following benchmark, regarding the full secret sharing operations, which all take more than a microsecond to run. This graph is essentially the same as in the previous post, the small differences can likely be attributed to noise.

Benchmarks for Shamir operations (compact shares)

My last benchmark was focused on the clmul optimizations. Here again, apart from the multiplication operation for the smallest fields, the missing black_box didn’t have much effect on the results.

Benchmarks for Shamir operations (clmul algorithm)

Conclusion

Rust’s black_box function is still not stabilized. One reason has been some discussion around its naming, and my opinion is that the “box” part is potentially misleading, because a “box” implies that it encapsulates a computation from its start to its end. The following code can easily be interpreted as putting a box around the function f being benchmarked:

loop {
    black_box(f());
}

when in reality the “box” is only around the output of f, so that code is equivalent to the following.

loop {
    let x = f();
    black_box(x);
}

This makes it clearer that f() can be optimized by the compiler, such as being pre-computed before the loop as we’ve seen in practical examples.

Other names were proposed for black_box, such as using the term “fence”, which I think is a better representation of what this function really does. In order to properly benchmark a function, we need fences for each of its input(s) and output(s), which allows to faithfully benchmark the computation between the input(s) and output(s), i.e. the function.

On the other hand, putting only one fence on the output(s) doesn’t allow to benchmark the computation that leads to it: as we’ve seen, the compiler is free to optimize away all or part of the code until the output.

Another thing to keep in mind is that benchmarking is inherently imperfect, because any measurement mechanism adds overhead, due to the measurement itself (e.g. writing on the stack for the current default black_box implementation). Even calling a function on a clock to get the current time adds overhead.

Now, does this all mean that benchmarking is pointless? I don’t think so, a (micro-)benchmark is a tool among others to help optimizing your code. Another very useful tool is profiling, to understand which part of your code is worth optimizing in practice. Profiling tools range from running perf on a single program on a Linux machine (see this post on my blog), to larger-scale continuous profiling in production with tools like Prodfiler.

Another advice is to try Compiler Explorer to learn assembly by example. By testing various optimization levels you’ll learn a lot about what the compiler does under the hood.

One last thing we’ve seen is that even if Rust is already quite a fast programming language, there is still room for even more optimizations in the compiler, such as being able to fully reduce expressions like "0.0".parse::<f64>() to 0f64 at compile time.


Comments

To react to this blog post please check the Twitter thread and the Reddit thread.


RSS | Mastodon | GitHub | Reddit | Twitter


You may also like

Horcrux: Implementing Shamir's Secret Sharing in Rust (part 2)
STV-rs: Single Transferable Vote implementation in Rust
Testing SIMD instructions on ARM with Rust on Android
Asynchronous streams in Rust (part 1) - Futures, buffering and mysterious compilation error messages
And 29 more posts on this blog!