Why my Rust benchmarks were wrong, or how to correctly use std::hint::black_box?
In a previous blog post, I described some benchmarks I wrote for a program written in Rust. While presenting the results, I mentioned a strange behavior: things that should have been very fast (a few nanoseconds) were reported as instantaneous. I wrote that it was probably fine given that the bigger benchmarks (above 10 nanoseconds) were seemingly working well, following the expected asymptotic behavior. However, this prompted me to dig deeper to understand what was going on under the hood.
In this blog post, I’ll present my findings, including some misconceptions I had about the std::hint::black_box
function, and how to use this hint function properly to benchmark your code as faithfully as possible.
- What’s behind the
#[bench]
interface? - Application to real benchmarks
- Application to my benchmarks: arithmetic in
- Conclusion
What’s behind the #[bench]
interface?
The first tool to get started with benchmarking in Rust is the #[bench]
attribute, built into the language and documented under the test
unstable feature.
From this documentation, we learn that writing a Rust benchmark should be as simple as the following.
#![feature(test)]
extern crate test;
pub fn add_two(a: i32) -> i32 {
a + 2
}
#[cfg(test)]
mod tests {
use super::*;
use test::Bencher;
#[bench]
fn bench_add_two(b: &mut Bencher) {
b.iter(|| add_two(2));
}
}
Besides the #[bench]
attribute, the most important part is the Bencher
API, and notably its iter()
function, which takes as input the computation that we want to benchmark.
The underlying implementation is a bit complicated, notably with code to aim for accurate clock measurements, statistics to keep running the benchmark until convergence is reached, etc.
However, the gist of it is contained in the (private) ns_iter_inner
function.
fn ns_iter_inner<T, F>(inner: &mut F, k: u64) -> u64
where
F: FnMut() -> T,
{
let start = Instant::now();
for _ in 0..k {
black_box(inner());
}
start.elapsed().as_nanos() as u64
}
Essentially, we run k
iterations of the function being benchmarked, and measure the time taken by this loop.
But there is another detail: the black_box
function is applied to the inner
function being benchmarked.
Or more precisely, it is applied to the result of the benchmarked function.
So what is there behind this black_box
function?
It is defined as std::hint::black_box
, with the following description.
An identity function that hints to the compiler to be maximally pessimistic about what
black_box
could do.Unlike
std::convert::identity
, a Rust compiler is encouraged to assume thatblack_box
can usedummy
in any possible valid way that Rust code is allowed to without introducing undefined behavior in the calling code. This property makesblack_box
useful for writing code in which certain optimizations are not desired, such as benchmarks.Note however, that
black_box
is only (and can only be) provided on a “best-effort” basis. The extent to which it can block optimizations may vary depending upon the platform and code-gen backend used. Programs cannot rely onblack_box
for correctness in any way.
So this black_box
function has been designed specifically for benchmarks, with a best-effort caveat.
It is implemented as follows.
#[inline]
#[unstable(feature = "bench_black_box", issue = "64102")]
#[cfg_attr(not(bootstrap), allow(unused_mut))]
#[cfg_attr(bootstrap, allow(deprecated))]
pub fn black_box<T>(mut dummy: T) -> T {
#[cfg(bootstrap)]
// SAFETY: the inline assembly is a no-op.
unsafe {
llvm_asm!("" : : "r"(&mut dummy) : "memory" : "volatile");
dummy
}
#[cfg(not(bootstrap))]
{
crate::intrinsics::black_box(dummy)
}
}
We can see that in “bootstrapping” mode, there is some assembly that takes the dummy
object passed to the black box, and marks the memory for this object as “volatile”, to prevent the compiler from optimizing based on its value.
In “real” mode, it defers to the black_box
intrinsic, defined directly in the compiler (and looks similar to the bootstrapping mode as far as I can tell).
The black_box
function actually has its own GitHub issue.
An interesting comment from this issue is an alternate implementation of a “better” black box.
pub fn better_black_box(mut x: u64) -> u64 {
unsafe { asm!("/* {x} */", x = inout(reg) x); }
x
}
There is also an “even better” proposal.
asm!("/* {x} */", x = inout(reg) int_x, options(nostack));
In the latest Rust version, the bootstrapping part has been removed, so
std::hint::black_box
is just calling the intrinsic.#[inline] #[unstable(feature = "bench_black_box", issue = "64102")] pub fn black_box<T>(dummy: T) -> T { crate::intrinsics::black_box(dummy) }
Alternate benchmarking frameworks
One thing of interest is that there are alternate benchmarking frameworks for Rust, besides the standard Bencher
implementation.
The most popular one is the criterion
library.
Its API is very similar to the standard library’s, with its own Bencher
struct.
The Bencher::iter
function is essentially the base ns_iter_inner
function mentioned above (without all the statistics around it).
#[inline(never)]
pub fn iter<O, R>(&mut self, mut routine: R)
where
R: FnMut() -> O,
{
self.iterated = true;
let time_start = Instant::now();
let start = self.measurement.start();
for _ in 0..self.iters {
black_box(routine());
}
self.value = self.measurement.end(start);
self.elapsed_time = time_start.elapsed();
}
Its black_box
implementation is however slightly different than the standard library’s (no intrinsic here).
pub fn black_box<T>(dummy: T) -> T {
unsafe {
let ret = std::ptr::read_volatile(&dummy);
std::mem::forget(dummy);
ret
}
}
Effect of std::hint::black_box
on a benchmarking loop
Now that we’ve seen that benchmark measurements boil down to running a loop of the benchmarked function with the black_box
applied to it, let’s take a step back and check the assembly output of various loops.
For this, I’ve used the awesome Godbolt Compiler Explorer tool, setting the programming language to “Rust”, the compiler to “rustc nightly” and compiler options to -C opt-level=3
to enable optimizations (otherwise, non-optimized code is not a faithful benchmark).
I’ll also report the assembly outputs (for x86_64) here.
I wouldn’t consider myself an assembly expert, but this tool is an amazing way to learn some bits of assembly by example.
Notably, it has documentation tooltips that appear when hovering over assembly instructions!
So, let’s begin with a simple snippet of code, trying to benchmark the “addition” function, but without the black_box
.
pub fn bench_add() {
let a = 1;
let b = 2;
for _ in 0..1024 {
let x = a + b;
}
}
As you may imagine, given that none of the variables in the function are used anywhere, the compiler can optimize everything as “dead code”, and reduce the function to a return instruction (Compiler Explorer).
bench_add:
; Just return, nothing else!
ret
Let’s now try to wrap the addition in the black_box
function, as the benchmarking frameworks do.
#![feature(bench_black_box)]
use std::hint::black_box;
pub fn bench_add() {
let a = 1;
let b = 2;
for _ in 0..1024 {
let x = black_box(a + b);
}
}
Here we can see that the function is not completely optimized anymore, so the black_box
does have an effect (Compiler Explorer)!
bench_add:
sub rsp, 4 ; Allocate 4 bytes on the stack.
mov eax, 1024 ; Set the loop counter.
mov rcx, rsp ; No idea what this is for??
; Start of the loop.
.LBB0_1:
mov dword ptr [rsp], 3 ; Write the result on the stack.
add eax, -1 ; Decrement the loop counter.
jne .LBB0_1 ; Back to the start of the loop.
add rsp, 4 ; Deallocate 4 bytes on the stack.
ret ; Return.
Let’s try to decompose the assembly.
- The first and last instructions respectively increment and decrement the stack pointer
rsp
by 4 bytes. - The second instruction sets a loop counter to 1024 in the
eax
register. You can see that later on, another instruction decrementseax
by 1, followed by a conditional jump to the.LBB0_1
label. Together, this part represents our loop of 1024 iterations. - I have no idea what the
mov rcx, rsp
instruction is for. Thercx
register is not read anywhere after that. - Last, the body of the loop is a single instruction,
mov dword ptr [rsp], 3
, which moves the value 3 into the address contained in thersp
register, i.e. in the memory slot that was created on the stack.
As you can see, there is no addition being benchmarked here!
The result of a + b
is directly optimized by the compiler as being equal to 3, and all that black_box
does is to force it to be written on the stack.
So what can we do?
If we go back to the #[bench]
documentation, some optimization gotchas are mentioned, along with recommendation.
The first remark in the gotchas is that the function being benchmarked should actually return a useful result (which will be passed to the black_box
function under the hood).
So far, we’ve already done that, but let’s see what happens if we simulate benchmarking a function that returns nothing (i.e. ()
), for example the function || { let x = a + b; }
(with the semi-colon ;
at the end).
#![feature(bench_black_box)]
use std::hint::black_box;
pub fn bench_add() {
let a = 1;
let b = 2;
for _ in 0..1024 {
let x = a + b;
black_box(());
}
}
In this case, we can see that indeed the compiler didn’t even bother writing the result of the addition on the stack (Compiler Explorer).
However, the loop structure is still present, so the black_box
acts as a “barrier” preventing to optimize the code “surrounding” it.
; Just an empty loop with 1024 iterations.
bench_add:
mov eax, 1024
.LBB0_1:
add eax, -1
jne .LBB0_1
ret
Interestingly, a loop that does nothing but invoking black_box(())
is optimized down to the same assembly (Compiler Explorer).
#![feature(bench_black_box)]
use std::hint::black_box;
pub fn bench_nothing() {
for _ in 0..1024 {
black_box(());
}
}
bench_nothing:
mov eax, 1024
.LBB0_1:
add eax, -1
jne .LBB0_1
ret
Another possibility mentioned in the optimization gotchas is to put a black_box
function around the input(s).
Let’s try to do that.
#![feature(bench_black_box)]
use std::hint::black_box;
pub fn bench_add() {
let a = 1;
let b = 2;
for _ in 0..1024 {
let x = black_box(a) + black_box(b);
}
}
Here we can see that the inputs 1 and 2 are written on the stack on every iteration, however no addition is computed from them (Compiler Explorer)!
bench_add:
; Allocate 8 bytes on the stack, and prepare the loop.
push rax
mov eax, 1024
; No idea what these are for??
mov rcx, rsp
lea rdx, [rsp + 4]
.LBB0_1:
; Write the inputs on the stack.
mov dword ptr [rsp], 1
mov dword ptr [rsp + 4], 2
; Loop iteration.
add eax, -1
jne .LBB0_1
; Deallocate 8 bytes on the stack.
pop rax
ret
Therefore, a crucial point is to put a black_box
both on the inputs and on the outputs.
#![feature(bench_black_box)]
use std::hint::black_box;
pub fn bench_add() {
let a = 1;
let b = 2;
for _ in 0..1024 {
black_box(black_box(a) + black_box(b));
}
}
In this case, we finally see an addition being computed on each iteration (Compiler Explorer).
bench_add:
; Prepare stack and loop.
sub rsp, 12
mov eax, 1024
; Still don't know what's the purpose of these.
mov rcx, rsp
lea rdx, [rsp + 4]
lea rsi, [rsp + 8]
.LBB0_1:
; Write the inputs on the stack.
mov dword ptr [rsp], 1
mov edi, dword ptr [rsp]
mov dword ptr [rsp + 4], 2
; Benchmark the function!
add edi, dword ptr [rsp + 4]
; Write the result on the stack.
mov dword ptr [rsp + 8], edi
; Loop iteration.
add eax, -1
jne .LBB0_1
; Deallocate the stack.
add rsp, 12
ret
There are also overhead instructions to write the inputs and outputs on the stack, but that can be mitigated by using the “better” black box implementation for example.
use std::arch::asm;
fn better_black_box(mut x: u64) -> u64 {
unsafe { asm!("/* {x} */", x = inout(reg) x, options(nostack)); }
x
}
pub fn bench_add() {
let a = 1;
let b = 2;
for _ in 0..1024 {
let x = better_black_box(
better_black_box(a) + better_black_box(b)
);
}
}
At this point, there isn’t a lot left to optimize or de-optimize (Compiler Explorer). Each iteration of the loop sets the inputs into registers, computes the addition, and updates the loop counter.
bench_add:
; No stack allocation, just prepare the loop.
mov eax, 1024
.LBB0_1:
; Write the inputs into registers.
mov ecx, 1
mov edx, 2
; Benchmark the function!
add rdx, rcx
; Loop iteration.
add eax, -1
jne .LBB0_1
; Just return after the loop.
ret
Application to real benchmarks
I hope that exploring this example of benchmarking a simple addition convinced you of the effect of using the black_box
both on inputs and outputs.
However, you may wonder whether this simplified example really generalized to real-world benchmarks.
So let’s now explore some real benchmarks found in the standard library.
I’ll then finish with my original benchmarks that prompted me to write this blog post.
Example 1: char::to_digit
The first example I want to consider is the char::to_digit
method.
This converts a character into a number in the given base (a.k.a. radix), for example '5'
is converted to Some(5)
in base 10, 'b'
is converted to Some(11)
in base 16.
However, 'b'
is converted to None
in base 10, because this hexadecimal character is not a valid digit in base 10.
pub fn to_digit(self, radix: u32) -> Option<u32>
Interestingly, there are some benchmarks written for this method, which boil down to the following loops.
#![feature(bench_black_box)]
use std::hint::black_box;
pub fn bench_to_digit_radix_2() {
let chars = ['0', 'x', '2', '5', 'A', 'f', '7', '8', '9'];
for _ in 0..1024 {
black_box(chars.iter().cycle().take(10_000).map(|c| c.to_digit(2)).min());
}
}
pub fn bench_to_digit_radix_16() {
let chars = ['0', 'x', '2', '5', 'A', 'f', '7', '8', '9'];
for _ in 0..1024 {
black_box(chars.iter().cycle().take(10_000).map(|c| c.to_digit(16)).min());
}
}
pub fn bench_to_digit_radix_36() {
let chars = ['0', 'x', '2', '5', 'A', 'f', '7', '8', '9'];
for _ in 0..1024 {
black_box(chars.iter().cycle().take(10_000).map(|c| c.to_digit(36)).min());
}
}
The intent of these benchmarks was seemingly to call the function enough times so that the benchmarks don’t show a result that’s too fast.
However, when compiling them with -C opt-level=3
, we obtain the following assembly (Compiler Explorer).
bench_to_digit_radix_2:
; Prepare the stack.
push rax
xor eax, eax
; When interpreted as [u32; 2] (in little endian), this is equal to [0, 0x48].
; This is a representation of None as an Option<u32>.
movabs rcx, 309237645312 ; 0x4800000000
mov rdx, rsp
; Main loop.
.LBB0_4:
; Increment main loop counter.
add eax, 1
; Inner loop counter.
mov esi, 9981
; Inner loop.
.LBB0_2:
lea rdi, [rsi - 9]
cmp rsi, 8
mov rsi, rdi
ja .LBB0_2
; Write the result on the stack.
mov qword ptr [rsp], rcx
; Main loop iteration.
cmp eax, 1024
jne .LBB0_4
; Deallocate the stack.
pop rax
ret
As you can see, the benchmark doesn’t run the to_digit
function 10’000 times per iteration at all!
The result (None
) is pre-computed in the beginning, and the black_box
only makes sure to copy this result on the stack, like we saw with the addition benchmark.
In the end, this benchmark really measures approximately 10’000 iterations of an empty loop (it’s unclear to me why the loop counter isn’t initialized to exactly 10’000 but to 9’981 instead), which the compiler somehow didn’t manage to optimize away.
The benchmarks for other radices are very similar, they only differ by the result itself.
bench_to_digit_radix_16:
push rax
xor eax, eax
; [0, 0x21] = None
movabs rcx, 141733920768 ; 0x2100000000
mov rdx, rsp
.LBB1_4:
add eax, 1
mov esi, 9981
.LBB1_2:
lea rdi, [rsi - 9]
cmp rsi, 8
mov rsi, rdi
ja .LBB1_2
; Copy the result on the stack.
mov qword ptr [rsp], rcx
cmp eax, 1024
jne .LBB1_4
pop rax
ret
bench_to_digit_radix_36:
push rax
xor eax, eax
mov rcx, rsp
.LBB2_4:
add eax, 1
mov edx, 9981
.LBB2_2:
lea rsi, [rdx - 9]
cmp rdx, 8
mov rdx, rsi
ja .LBB2_2
; Write the result directly on the stack.
; [1, 0] = Some(0)
mov qword ptr [rsp], 1
cmp eax, 1024
jne .LBB2_4
pop rax
ret
On the contrary, adding just a black_box
for the input character of to_digit
produces more than 1000 lines of assembly!
#![feature(bench_black_box)]
use std::hint::black_box;
pub fn bench_to_digit_radix_2() {
let chars = ['0', 'x', '2', '5', 'A', 'f', '7', '8', '9'];
for _ in 0..1024 {
black_box(chars.iter().cycle().take(10_000).map(|c| black_box(c).to_digit(2)).min());
}
}
Loop over array indices or direct iteration?
Another curiosity I found while playing with this example is how the following two variants of the same function were compiled.
#![feature(bench_black_box)]
use std::hint::black_box;
pub fn bench_to_digits_indexed() {
let chars = ['0', '1', '2', '4'];
for _ in 0..1024 {
for i in 0..4 {
black_box(chars[i].to_digit(8));
}
}
}
pub fn bench_to_digits_forin() {
let chars = ['0', '1', '2', '4'];
for _ in 0..1024 {
for c in chars {
black_box(c.to_digit(8));
}
}
}
The bench_to_digits_indexed
function uses an index-based for loop to go through the chars
array, and the resulting assembly is as optimized as one would expect (Compiler Explorer).
bench_to_digits_indexed:
push rax
mov eax, 1024
mov rcx, rsp
movabs rdx, 4294967297 ; 0x100000001
movabs rsi, 8589934593 ; 0x200000001
movabs rdi, 17179869185 ; 0x400000001
.LBB0_1:
mov qword ptr [rsp], 1 ; Some(0)
mov qword ptr [rsp], rdx ; Some(1)
mov qword ptr [rsp], rsi ; Some(2)
mov qword ptr [rsp], rdi ; Some(4)
add eax, -1
jne .LBB0_1
pop rax
ret
On the other hand, the bench_to_digits_forin
function with a loop iterating directly into the array appears much less optimized, despite using the highest optimization level -C opt-level=3
(Compiler Explorer).
.LCPI0_0:
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 4
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
.byte 0
bench_to_digits_forin:
sub rsp, 56
mov eax, 1024
movaps xmm0, xmmword ptr [rip + .LCPI0_0]
movabs r9, 223338299442
movabs r10, 210453397552
mov esi, 48
mov cl, 1
add esi, -48
lea r8, [rsp + 8]
jmp .LBB0_1
.LBB0_2:
add eax, -1
je .LBB0_3
.LBB0_1:
movaps xmmword ptr [rsp + 16], xmm0
mov qword ptr [rsp + 40], r9
mov qword ptr [rsp + 32], r10
test cl, cl
je .LBB0_2
xor edi, edi
cmp esi, 8
setb dil
mov dword ptr [rsp + 8], edi
mov dword ptr [rsp + 12], esi
mov edi, dword ptr [rsp + 36]
cmp edi, 1114112
je .LBB0_2
add edi, -48
xor edx, edx
cmp edi, 8
setb dl
mov dword ptr [rsp + 8], edx
mov dword ptr [rsp + 12], edi
mov edi, dword ptr [rsp + 40]
cmp edi, 1114112
je .LBB0_2
add edi, -48
xor edx, edx
cmp edi, 8
setb dl
mov dword ptr [rsp + 8], edx
mov dword ptr [rsp + 12], edi
mov edi, dword ptr [rsp + 44]
cmp edi, 1114112
je .LBB0_2
add edi, -48
xor edx, edx
cmp edi, 8
setb dl
mov dword ptr [rsp + 8], edx
mov dword ptr [rsp + 12], edi
jmp .LBB0_2
.LBB0_3:
add rsp, 56
ret
This “less optimized” assembly appears very similar to adding an input black_box
to the “indexed” function (Compiler Explorer).
pub fn bench_to_digits_indexed() {
let chars = ['0', '1', '2', '4'];
for _ in 0..1024 {
for i in 0..4 {
black_box(black_box(chars[i]).to_digit(8));
}
}
}
Example 2: str::parse::<f64>
My second example is the str::parse::<f64>
method, implemented by f64::from_str
to convert a string into a floating-point number.
This function is benchmarked here in the standard library, again without any black_box
on the input.
#![feature(bench_black_box)]
use std::hint::black_box;
pub fn bench_parse_f64() {
for _ in 0..1024 {
black_box("0.0".parse::<f64>());
}
}
A smart enough compiler could directly optimize the expression "0.0".parse::<f64>()
into its result Ok(0f64)
.
However, the actual assembly is more than 400 lines, so this time I’d say that the benchmarks are “lucky” that the compiler didn’t optimize everything.
I don’t have much more to comment on this example, other than the fact that until such expressions are fully optimized by the compiler, there will be work left to improve compilers :) Of course, there’s also a trade-off to make between how far optimizations go and how long the code takes to compile.
Application to my benchmarks: arithmetic in
Let’s now go back to the original benchmarks from my previous post, which showed “zero nanoseconds” results. In these benchmarks, I tried to measure multiplication algorithms in finite fields of the form for various values of . I’ll leave the mathematical details to the previous post, but let’s now focus on the assembly produced by the two “instantaneous” benchmarks, which were for and .
Arithmetic in
After simplifying the code by removing all the const generic magic from my original code, the first case () looks like the following.
#[derive(Clone, Copy)]
struct GF8(u8);
impl GF8 {
fn mul(mut self, other: &Self) -> Self {
let mut result = Self(0);
for i in 0..8 {
if other.0 & (1 << i) != 0 {
result.add_assign(&self);
}
self.shl1();
}
result
}
fn shl1(&mut self) {
let d = self.0;
self.0 = d << 1;
let carry = d >> 7;
if carry != 0 {
self.0 ^= 1 ^ (1 << 4) ^ (1 << 3) ^ (1 << 1);
}
}
fn add_assign(&mut self, other: &Self) {
self.0 ^= other.0;
}
}
Let’s consider the following two simplified benchmarking scenarios: the first one with a simple black_box
surrounding the multiplication’s result, and the second one with black_box
applied both on inputs and outputs.
#![feature(bench_black_box)]
use std::hint::black_box;
// Gist of my original benchmark.
pub fn bench_gf8_mul_box() {
let test_value = GF8(!0);
let x = test_value;
let y = test_value;
for _ in 0..1024 {
black_box(x.mul(&y));
}
}
// Improved benchmark.
pub fn bench_gf8_mul_2box() {
let test_value = GF8(!0);
let x = test_value;
let y = test_value;
for _ in 0..1024 {
black_box(black_box(x).mul(&black_box(y)));
}
}
As you may now expect, the first one is simply reduced to a loop that writes the multiplication result directly to the stack (Compiler Explorer). It’s then unsurprising that this loop of 3 instructions takes (on average) less than 1 nanosecond per iteration to execute, hence the benchmark result I obtained in my previous post!
bench_gf8_mul_box:
sub rsp, 1
mov eax, 1024
mov rcx, rsp
.LBB0_1:
; The compiler simplified everything to the result!
mov byte ptr [rsp], 19
add eax, -1
jne .LBB0_1
add rsp, 1
ret
On the other hand, the benchmarking loop with a black_box
on both sides expands to the full multiplication algorithm (Compiler Explorer).
bench_gf8_mul_2box:
; Prepare the loop and the stack.
push rbx
sub rsp, 3
mov r11d, 1024
mov r8, rsp
lea r9, [rsp + 1]
lea r10, [rsp + 2]
.LBB1_1:
; Write the inputs on the stack.
mov byte ptr [rsp], -1
mov byte ptr [rsp + 1], -1
; Load the inputs from the stack.
movzx ecx, byte ptr [rsp]
movzx edi, byte ptr [rsp + 1]
; Compute the multiplication.
mov edx, edi
and dl, 1
; Many more instructions
; ...
and dil, al
xor dil, dl
; Write the result on the stack.
mov byte ptr [rsp + 2], dil
; Loop iteration.
add r11d, -1
jne .LBB1_1
; Deallocate the stack.
add rsp, 3
pop rbx
ret
Arithmetic in
The second case was for , with a substantially more complex code using specialized SIMD instructions.
Therefore, I compiled it with -C opt-level=3 -C target-cpu=native
to enable the relevant CPU features.
#![feature(bench_black_box)]
use std::hint::black_box;
pub fn bench_gf128_mul_box() {
let test_value = GF128([!0; 2]);
let x = test_value;
let y = test_value;
for _ in 0..1024 {
black_box(GF128::mul(&x, &y));
}
}
pub fn bench_gf128_mul_2box() {
let test_value = GF128([!0; 2]);
let x = test_value;
let y = test_value;
for _ in 0..1024 {
black_box(GF128::mul(&black_box(x), &black_box(y)));
}
}
#[derive(Clone, Copy)]
struct GF128([u64; 2]);
const A: usize = 7;
const B: usize = 2;
const C: usize = 1;
impl GF128 {
#[cfg(all(
target_arch = "x86_64",
target_feature = "sse2",
target_feature = "pclmulqdq"
))]
fn mul(x: &Self, y: &Self) -> Self {
use std::arch::x86_64::{__m128i, _mm_clmulepi64_si128, _mm_set_epi64x, _mm_storeu_si128};
let mut words = [0u64; 2];
let mut carry = [0u64; 2];
for i in 0..2 {
// Safety: target_feature "sse2" is available in this function.
let xi: __m128i = unsafe { _mm_set_epi64x(0, x.0[i] as i64) };
for j in 0..2 {
// Safety: target_feature "sse2" is available in this function.
let yj: __m128i = unsafe { _mm_set_epi64x(0, y.0[j] as i64) };
// Safety: target_feature "pclmulqdq" is available in this function.
let clmul: __m128i = unsafe { _mm_clmulepi64_si128(xi, yj, 0) };
let mut cc: [u64; 2] = [0u64, 0u64];
// Safety:
// - target_feature "sse2" is available in this function,
// - cc points to 128 bits (no alignment required by this function).
unsafe { _mm_storeu_si128(&mut cc as *mut _ as *mut __m128i, clmul) };
let ij = i + j;
if ij < 2 {
words[ij] ^= cc[0];
} else {
carry[ij - 2] ^= cc[0];
}
let ij1 = ij + 1;
if ij1 < 2 {
words[ij1] ^= cc[1];
} else {
carry[ij1 - 2] ^= cc[1];
}
}
}
Self::propagate_carries(words, carry)
}
fn propagate_carries(mut words: [u64; 2], carry: [u64; 2]) -> Self {
for i in 0..2 {
let c = carry[i];
words[i] ^= c ^ (c << A) ^ (c << B) ^ (c << C);
if i + 1 < 2 {
words[i + 1] ^=
(c >> (64 - A)) ^ (c >> (64 - B)) ^ (c >> (64 - C));
} else {
let c = (c >> (64 - A)) ^ (c >> (64 - B)) ^ (c >> (64 - C));
words[0] ^= c ^ (c << A) ^ (c << B) ^ (c << C);
words[1] ^=
(c >> (64 - A)) ^ (c >> (64 - B)) ^ (c >> (64 - C));
}
}
Self(words)
}
fn add_assign(&mut self, other: &Self) {
for i in 0..2 {
self.0[i] ^= other.0[i];
}
}
}
Here, the case of a single black_box
applied to the output is interesting (Compiler Explorer).
Although the compiler didn’t manage to optimize away all the math, it still recognized that the same value is computed over and over again, and therefore extracted out the computation before the loop!
The loop itself is only copying the previously computed result to the stack, and is therefore very fast.
bench_gf128_mul_box:
; Allocate 16 bytes (128 bits) on the stack.
sub rsp, 16
; Compute the multiplication, storing the result in rdx & rax.
vpcmpeqd xmm0, xmm0, xmm0
vpclmulqdq xmm0, xmm0, xmm0, 0
vmovq rcx, xmm0
vpextrq rax, xmm0, 1
mov rdx, rcx
shl rdx, 7
; Many more instructions
; ...
xor rax, rdi
xor rax, rcx
; Prepare the loop.
mov ecx, 1024
mov rsi, rsp
.LBB0_1:
; Write the multiplication result on the stack.
mov qword ptr [rsp], rdx
mov qword ptr [rsp + 8], rax
; Loop iteration.
dec ecx
jne .LBB0_1
; Deallocate the stack.
add rsp, 16
ret
The full benchmark with black_box
es on both sides keeps the full computation inside the loop.
bench_gf128_mul_2box:
; Prepare the stack.
push r15
push r14
push rbx
sub rsp, 16
; Prepare the loop.
mov r11d, 1024
mov r8, rsp
.LBB1_1:
; Load the first input.
mov qword ptr [rsp + 8], -1
mov qword ptr [rsp], -1
vmovdqu xmm0, xmmword ptr [rsp]
vmovq xmm3, qword ptr [rsp + 8]
; Load the second input.
mov qword ptr [rsp + 8], -1
mov qword ptr [rsp], -1
vmovdqu xmm1, xmmword ptr [rsp]
vmovq xmm4, qword ptr [rsp + 8]
; Compute the multiplication.
vpclmulqdq xmm2, xmm0, xmm4, 0
vpclmulqdq xmm4, xmm3, xmm4, 0
vpclmulqdq xmm3, xmm3, xmm1, 0
vpxor xmm5, xmm3, xmm2
vpshufd xmm5, xmm5, 238
; Many more instructions
; ...
xor rbx, r14
xor rbx, rdx
; Write the result on the stack.
mov qword ptr [rsp + 8], rbx
mov qword ptr [rsp], rax
; Loop iteration.
dec r11d
jne .LBB1_1
; Deallocate the stack.
add rsp, 16
pop rbx
pop r14
pop r15
ret
Updated benchmarks
I have now updated my code to include a black_box
both on inputs and outputs, and recomputed my benchmarks.
Here are the updated results and a comparison with my previous blog post.
The first graph was showing micro-benchmarks of the field operations (multiplication and inversion).
Below is the updated graph with the black_box
applied on inputs.
I’ve overlaid some dashed lines that represent the previous benchmarks.
We can see that fixing the black_box
affects mostly the fastest benchmarks, those running under 20 nanoseconds.
The effects described in the previous section for and are clearly visible.
However, I haven’t put the dashed lines for the slower benchmarks, as there was no noticeable effect beyond noise.
My hypothesis is that the compiler is able to fully optimize only “simple enough” functions, up to a certain number of instructions which happen to run under 20ns in practice.
More complex functions – like we’ve seen with std::f64::parse
– are too complex for the current compiler to optimize, so the back_box
has no effect on their benchmarks in practice.
This effect is visible on the following benchmark, regarding the full secret sharing operations, which all take more than a microsecond to run. This graph is essentially the same as in the previous post, the small differences can likely be attributed to noise.
My last benchmark was focused on the clmul
optimizations.
Here again, apart from the multiplication operation for the smallest fields, the missing black_box
didn’t have much effect on the results.
Conclusion
Rust’s black_box
function is still not stabilized.
One reason has been some discussion around its naming, and my opinion is that the “box” part is potentially misleading, because a “box” implies that it encapsulates a computation from its start to its end.
The following code can easily be interpreted as putting a box around the function f
being benchmarked:
loop {
black_box(f());
}
when in reality the “box” is only around the output of f
, so that code is equivalent to the following.
loop {
let x = f();
black_box(x);
}
This makes it clearer that f()
can be optimized by the compiler, such as being pre-computed before the loop as we’ve seen in practical examples.
Other names were proposed for black_box
, such as using the term “fence”, which I think is a better representation of what this function really does.
In order to properly benchmark a function, we need fences for each of its input(s) and output(s), which allows to faithfully benchmark the computation between the input(s) and output(s), i.e. the function.
On the other hand, putting only one fence on the output(s) doesn’t allow to benchmark the computation that leads to it: as we’ve seen, the compiler is free to optimize away all or part of the code until the output.
Another thing to keep in mind is that benchmarking is inherently imperfect, because any measurement mechanism adds overhead, due to the measurement itself (e.g. writing on the stack for the current default black_box
implementation).
Even calling a function on a clock to get the current time adds overhead.
Now, does this all mean that benchmarking is pointless?
I don’t think so, a (micro-)benchmark is a tool among others to help optimizing your code.
Another very useful tool is profiling, to understand which part of your code is worth optimizing in practice.
Profiling tools range from running perf
on a single program on a Linux machine (see this post on my blog), to larger-scale continuous profiling in production with tools like Prodfiler.
Another advice is to try Compiler Explorer to learn assembly by example. By testing various optimization levels you’ll learn a lot about what the compiler does under the hood.
One last thing we’ve seen is that even if Rust is already quite a fast programming language, there is still room for even more optimizations in the compiler, such as being able to fully reduce expressions like "0.0".parse::<f64>()
to 0f64
at compile time.
Comments
To react to this blog post please check the Reddit thread and the Twitter thread.
You may also like