In the previous blog post, I’ve described how Shamir’s Secret Sharing works from a mathematical point of view. In this blog post, I’ll focus on my Rust implementation.

The motto of the Rust programming language is “empowering everyone to build reliable and efficient software”. I’ll therefore discuss how this applies to mathematical algorithms and cryptography, illustrated by my Horcrux implementation, which you can find on GitHub.

In particular, we’ll see how “const generics” allow to instantiate a new finite field with one line of code, how to detect and use CPU-specific instructions reliably without relying on hacky macros nor assembly code, and how to seamlessly test and benchmark our code. Last, I’ll show how to easily build a command-line interface around it.

## Overview

Overall, my implementation fits in less than 2000 lines, half of which are tests, benchmarks and documentation – and I didn’t even discount blank lines. In terms of dependencies, the base implementation only relies on the rand library, which somehow isn’t part of Rust’s standard library, but is still pretty much a “first class” Rust library being part of the Rust “nursery”. On top of that, the CLI tool has a few more dependencies.

My code currently requires a nightly Rust compiler due to the usage of two unstable features: one related to const functions and the other one only for benchmarks. Apart from that, the code could easily be compiled on stable Rust.

Although I’ve done my best to write correct code, my implementation comes with two caveats at this stage: it has not been peer-reviewed, and it has no constant-time guarantees. Don’t use it as is for anything sensitive! However, feedback and pull requests to improve it are welcome.

## (Const) generics

As I’ve mentioned in a previous blog post, Rust’s story around generic types is in my opinion a strong selling point for this language. In the case of Horcrux, splitting the secret into shares and reconstructing it is fully generic over some Field trait, which declares all required arithmetic operations to compute Lagrange polynomial interpolation.

This means that once the splitting and reconstruction parts are implemented, they can be instantiated on any field that implements the suitable operations. In Rust, my Field interface looks like the following, where we retrieve the field arithmetic operations.

/// Trait for types that implement field arithmetic.
pub trait Field: Copy + Eq + Hash + From<u8> + Sub<Output = Self>
where
for<'a> Self: Mul<&'a Self, Output = Self>,
{
/// The neutral element for addition.
const ZERO: Self;
/// The neutral element for multiplication.
const ONE: Self;

/// Inverts an element.
fn invert(self) -> Self;

/// Samples a field element uniformly at random.
fn uniform<R: Rng + CryptoRng + ?Sized>(rng: &mut R) -> Self;

...
}

Now, how do we implement the finite fields themselves? Until recently, I would have needed to implement a separate type for each field $\mathrm{GF}(2^n)$.

However, a new addition to Rust’s generics story was recently stabilized in Rust version 1.51: const generics. This was a long awaited feature which I mentioned in several blog posts over the past few years, also highlighting how this could benefit cryptography.

So here we are, we can now write a type that will implement pretty much any field of the form $\mathrm{GF}(2^n)$.

/// Implementation of a binary field GF(2^n), with W::NBYTES * NWORDS bits, using the
/// irreducible polynomial x^n + x^a + x^b + x^c + 1.
#[derive(Clone, Copy)]
pub struct GF2n<W: Word, const NWORDS: usize, const A: usize, const B: usize, const C: usize> {
words: [W; NWORDS],
}

There are only a few restrictions.

• The value $n$ must be a multiple of a word type, which itself can be any unsigned integer type like u8 or u64.
• The field must have a irreducible polynomial of the form $X^n + X^a + X^b + X^c + 1$, where $a$ is smaller than the word size.

Other than that, you can instantiate it on pretty much any field with only one line of code. Here is for example an instantiation of $\mathrm{GF}(2^{256})$.

/// Finite field GF(2^256) implemented with 64-bit words and using the following irreducible
/// polynomial: x^256 + x^10 + x^5 + x^2 + 1.
pub type GF256 = GF2n<u64, 4, 10, 5, 2>;

As you can see, the field size $n$ as well as the irreducible polynomial are defined as constants of the type. This allows the compiler to optimize each instance as if we expanded everything by hand, but without the burden of manually expanding the code nor copy-pasting it every time. It only takes one line of code to support another field instance – yet everything is fully type-checked and optimized at compile time.

One thing to note is that only a minimal viable product of const generics has been stabilized in Rust so far, with some restrictions. Yet, it’s already quite powerful and I didn’t reach many restrictions for this use case.

## Detecting CPU-specific instructions

As mentioned in the previous post, we can optimize the multiplication algorithm on CPUs that support a suitable carry-less multiplication instruction. However, not all CPUs support such instructions, so how do we make sure to use the correct algorithm, i.e. either the optimized algorithm or a fallback, depending on the target architecture we’re compiling for?

First of all, how do we even invoke such specialized instructions? One approach would be to use the llvm_asm! macro to directly write some assembly. Even if this macro has been fully redesigned recently into a new asm! macro, it is still an unstable feature of Rust, which therefore requires a nightly compiler, and it is inherently unsafe.

A better approach is to use so-called “intrinsics” (see for example Intel’s intrinsics guide), which are functions that expose specialized instructions without the need to write assembly. The good news is that Rust has brought in and stabilized many intrinsics in the core::arch library, and they work on the stable version of the compiler. For example, the clmul instruction of interest is available as the _mm_clmulepi64_si128 intrinsic on Intel CPUs.

The second question is how do we know whether instructions are supported, and tell the compiler to use the optimized or fallback algorithm accordingly? In C, one could use a bunch of macros to hack something together and try to detect which compiler is used and which architecture is the target. But Rust makes this process much more streamlined and safe, thanks to built-in support for CPU feature detection.

For Horcrux, I chose to use static CPU feature detection, which consists of conditionally compiling the code on supported architectures. For example, the function implementing optimized multiplication is annotated as follows, in order to be compiled only on relevant CPUs.

// Optimized multiplication using CLMUL on Intel CPUs that support it.
#[cfg(all(
target_arch = "x86_64",
target_feature = "sse2",
target_feature = "pclmulqdq"
))]
fn mul_clmul_u64<const NWORDS: usize, ...>(
x: &GF2n<u64, NWORDS, ...>,
y: &GF2n<u64, NWORDS, ...>,
) -> GF2n<u64, NWORDS, ...> {
// Code using _mm_clmulepi64_si128.
...
}

In the top-level multiplication function, the dispatch between optimized and fallback algorithms then looks like this, here using conditional compilation of an if statement.

fn mul(self, other: &Self) -> Self {
// Detect and use an optimized algorithm on supported CPUs.
#[cfg(all(
target_arch = "x86_64",
target_feature = "sse2",
target_feature = "pclmulqdq"
))]
// Additionally to the CPU constraint, this optimization is only implemented for 64-bit words.
if W::NBITS == 64 {
return mul_clmul_u64(&self, other);
}

// Fallback to a generic algorithm: multiplication based on additions.
}

One last question, which might seem surprising, is how do we tell the compiler to optimize for the current CPU?

Indeed, when you compile Rust code with cargo build or cargo test, the compiler detects the current architecture, which represents the broad family of your CPU (such as arm or x86_64). However, within a architecture, there are new CPU models being released every few years, often coming with support for new specialized instructions. By default, the compiler will create a “generic” binary which works across all models using your architecture. This can be useful if you compile your code on some computer and distribute it to run on other computers.

One drawback is that the compiler will limit itself to a common denominator, without taking into account possible optimizations of your specific CPU. If you want to unlock all optimizations – in our case try the optimized CLMUL-based algorithm – you can use the following flag to target your current CPU model, a.k.a. “native”.

$RUSTFLAGS='-C target-cpu=native' cargo build You could also choose to target another CPU model, or to enable specific features. With static CPU feature detection, your compiled binary may crash if you distribute and run it on other computers that don’t support the same features. However, it’s totally fine to run tests and benchmarks on your own computer, or if you don’t plan to distribute the binary. Otherwise, it’s also possible to detect CPU features dynamically at runtime, which will be robust to distribute the binary across computers, although with a minor runtime overhead to do the detection. ## Tests As I mentioned in a previous post, Rust makes it very easy to write and run tests in your code. When it comes to cryptography, correctness is quite important, and even though tests don’t catch all the bugs, they definitely raise the bar on code quality and can prevent regressions, so there really is no excuse not to write any. For Horcrux, most of the code was to implement field arithmetic in $\mathrm{GF}(2^n)$. However, testing arithmetic correctness is not easy, notably because some bugs may trigger only in some rare inputs, such as all bits being equal to one in the middle of some complex computation. This means that careful code review and careful checking of the underlying math are ultimately required to complement the tests. Yet, having unit tests for all arithmetic operations can also greatly help during development to discover where in the code a bug may be. For example, finding out that there was a missing minus sign in the multiplication function is easier if we know that all the addition tests pass but some multiplication tests fail. So my strategy was to write tests that will check expected arithmetic properties, such as commutativity and associativity of addition, on a bunch of simple inputs. To get some coverage and increase the chances of exercising code paths affecting carries, I chose the following inputs: • all bits set to zero, • all bits set to one, • one bit set to one, all the others to zero, • one bit set to zero, all the others to one. Another question is on which instances of $\mathrm{GF}(2^n)$ should we run the tests, given that the implementation is generic? The answer is on all instances that we’ll have in the final program! For one, this will allow to check that the irreducible polynomial was indeed irreducible. And more importantly, from a practical point of view, Rust’s macro system makes it really easy to instantiate tests repeatedly for all fields that we support. All I had to do was to implement the following for_field! and for_all! macros, and put all my tests in a for_all! block. #[cfg(test)] mod test { // This macro instantiates a series of tests for a given field, using a // nested Rust module. macro_rules! for_field { ($mod:ident, $field:ident,$($tests:tt)* ) => { mod$mod {
type F = super::super::$field;$($tests)* } } } // In this macro, we list all the fields on which we want to instantiate the // tests. macro_rules! for_all { ($($tests:tt)* ) => { for_field!(gf008, GF8,$($tests)*); for_field!(gf016, GF16,$($tests)*); for_field!(gf032, GF32,$($tests)*); for_field!(gf064, GF64,$($tests)*); ... }; } // And we can instantiate the following tests for all our fields! for_all! { #[test] fn add_is_commutative() { // F is defined and made available here by the macros. let values = F::get_test_values(); for &x in &values { for &y in &values { assert_eq!(x + y, y + x); } } } ... } } Running the tests is very simple with cargo test (plus a few options to use the nightly compiler and optimize the code in release mode). As you can see in the output, the macros have automatically instantiated our tests.$ cargo +nightly test --release
...

## Benchmarks

Like unit tests, running benchmarks is also very simple in Rust. All we have to do to write a benchmark is to tag a function with #[bench], and use the nightly #![feature(test)] on our Rust crate. And of course, we can re-use the for_all! macro to run benchmarks on all the fields.

So at this point you’re probably wondering how fast my implementation of Shamir’s Secret Sharing runs. Let’s dive into the results!

I’ve chosen to present benchmarks of a few essential operations, for various values of the field size $n$:

• arithmetic on the finite field, to get an idea of how the building blocks behave,
• splitting a secret into shares, and reconstructing it.

For finite field arithmetic, I focused on multiplication and inversion, which are the most complex (and also slowest) operations.

Although the following benchmark results can help pin-point good and bad optimization strategies, they are not based on “production-ready” code. In particular, the implementation is currently not aiming to be constant time, and changing the code accordingly could affect the results. On the other hand, I’ve only explored some optimizations, so there can still be some potential for even further optimizations.

Like in a previous blog post, I’m using the plotters library to generate the following graphs. Thanks to its crates repository, Rust makes it really easy to create a simple script that parses the benchmark results and converts them into an SVG image. You can find my script here.

Let’s start with the field operations. I’ve plotted the results in a logarithmic scale for field sizes between 8 and 2048 bits. Also, I’ve considered 3 levels of optimizations:

• compiled in release mode (baseline),
• compiled with the native CPU architecture (using RUSTFLAGS='-C target-cpu=native')
• compiled with native CPU and CLMUL instructions enabled (see the optimized algorithm in the previous post).

Some observations.

• Asymptotic complexity. In theory the multiplication algorithm should have $O(n^2)$ complexity, and the inversion $O(n^3)$. In practice, the trend is similar although we don’t exactly observe these asymptotic complexities. First of all, for the smallest fields ($n \le 64$) we directly use arithmetic on the smaller integer types (u8, u16, u32) rather than arrays of u64 words. Second, small arrays can be optimized by the compiler to use a separate CPU register for each array element rather than having an array at all in RAM (or even CPU cache) - for example the array [u64; 2] can be transformed into a struct { a: u64, b: u64 } where a and b are each allocated a CPU register.
• Native CPU architecture. Compiling for the native CPU architecture can lead to a 2x speed improvement over the baseline (remember that the graph is in log scale), especially for the larger field types.
• CLMUL-based multiplication. Using clmul instructions is really the biggest optimization, improving speed by an order of magnitude! Note that I have only implemented it on fields based on u64 words, so for the smaller fields the “clmul-enabled” benchmark is actually measuring the non-optimized multiplication algorithm.
• Outliers. For the fastest benchmarks (below 10 nanoseconds), there seem to be some really low outlier values. This is likely due to the fact that our benchmarked code was optimized away as “dead code” by the compiler, despite using Rust’s standard Bencher interface, itself using the black_box hint precisely to avoid this optimization. However, as stated in the documentation, this hint function is a “nightly-only experimental API” of Rust, and rightfully so because it only provides a “best-effort” way of avoiding compiler optimizations. Here we can see why benchmarking is a difficult problem, as we are asking two conflicting things to the compiler: on the one hand we want our code to have all the release-level optimizations, and on the other hand we don’t want the benchmarked code to be optimized away as dead code. Does this mean that all these benchmarks are incorrect? They might be, although apart from these outliers there is no reason to believe that the values are completely off. One way to know for sure would be to analyze the assembly code produced by the compiler, but that can end up quite long to read for the largest functions.

Let’s now look at the high-level operations to split and reconstruct shares. Here I considered splitting and reconstructing a secret into $k = m = 10$ shares, for the 3 levels of optimizations, using compact shares.

As before, targeting the native CPU architecture improves speed by 2x for the largest fields, and clmul (implemented for $n \ge 64$) further improves by an order of magnitude. With the highest level of optimization, everything takes at most 100ms for these parameters, which is quite reasonable.

In practice anything larger than 256 bits is not really relevant for secret sharing, because we can always reduce splitting a large secret to the following steps.

1. Generate a random 256-bit AES key $K$.
2. Encrypt our large secret with $K$ into some ciphertext $C$ (using some authenticated encryption mode of operation).
3. Split the key $K$ into shares $K_i$.
4. For each share, distribute the pair $(K_i, C)$.

As long as AES-256 is not broken this shouldn’t cause any issue.

The last benchmark I want to show is a focus on the clmul-optimized version, considering more parameters:

• $k = m = 10$ (as before),
• $k = m = 255$ (which is the maximum number of shares we can use with $\mathrm{GF}(2^8)$).

Note that I only show results for compact shares, because there was no noticeable difference with randomized shares.

We can see that splitting shares aligns with multiplication, and reconstructing aligns with inversion, which goes with the expected asymptotic complexities:

• splitting requires $O(km)$ multiplications: $m$ points, each of which requires a polynomial evaluation with $k$ multiplications,
• reconstructing requires $O(k^2)$ multiplications and $O(k)$ inversions: for the example parameters the inversions dominate the complexity.

## Command-line interface with a Cargo workspace

On top of this algorithmic implementation, I wanted to build a simple command-line interface to demonstrate the splitting and reconstruction of shares, and the good news is that Rust makes it quite easy.

In terms of dependencies, I used clap to parse command-line arguments, plus regex and hex to convert values into convenient hexadecimal strings for the user.

To better separate this example CLI program from the underlying logic, I decided to use a Cargo workspace, another nice feature of Rust’s package manager. This allows to have a top-level directory where the CLI tool and other examples are defined (the scripts to generate all plots in this blog post), and inside it have a self-contained library in the horcrux/ folder.

.
├── Cargo.toml
├── examples
│   ├── lagrange.rs
│   └── plot.rs
├── horcrux
│   ├── Cargo.toml
│   └── src
│       ├── field.rs
│       ├── gf2n.rs
│       ├── lib.rs
│       └── shamir.rs
└── src
└── main.rs

This workspace structure allows to better separate dependencies that are required for the business-logic library from the higher-level dependencies of the CLI tool.

The top-level Cargo.toml defines the workspace as follows.

[workspace]

[dependencies]
# Dependency to the nested horcrux/ library within the workspace.
horcrux = { path = "horcrux", features = ["parse"] }
clap = "2.33.0"
hex = "0.4.3"
rand = "0.8.4"
regex = "1"

# Dependencies for the examples/ only.
[dev-dependencies]
plotters = "0.3.1"
plotters-backend = "0.3.2"

The lower-level horcrux/Cargo.toml only references dependencies needed for the secret sharing arithmetic. However, note that hex and regex are only needed to (de)serialize values for the command-line tool (via the parse feature). Besides, the small_rng feature of the rand dependency is only used for benchmarks, but there is currently no way to write that in the Cargo.toml.

[dependencies]
# Note: the "small_rng" feature is only used for benchmarks.
rand = { version = "0.8.4", features = ["small_rng"] }
hex = { version = "0.4.3", optional = true }
regex = { version = "1", optional = true }

[features]
default = ["clmul"]
clmul = []
parse = ["hex", "regex"]

## Conclusion

I mentioned in a previous blog post that Rust can be a good language to implement cryptographic algorithms. As we’ve seen on the implementation side of things, many language features are helpful for that, in particular const generics to write code generic over arithmetic parameters, and a robust macro system to generate unit tests and micro-benchmarks. It’s also nice to see CPU-specific instructions, commonly used to speed up cryptographic algorithms, directly exposed by the standard library, without the need for writing custom assembly code.

As future work, it would be interesting to implement finite fields over $\mathrm{GF}(p)$, or even $\mathrm{GF}(p^n)$, and compare the benchmarks with binary fields. Thanks to Rust’s traits system, the secret splitting and reconstruction logic is already abstracted away over any field, so only the arithmetic for these specific fields would have to be implemented. Another thing would be to make the implementation constant-time, although as I mentioned in a previous blog post, I’d argue that it’s complicated.

Some parts of the code (in the parsing logic for the command-line tool) are still not using const generics due to their current limitations, so that’ll be something to improve once these limitations are lifted.

Last, I haven’t yet looked at how to port the optimized multiplication algorithm to other CPUs such as ARM, that have their own variant of the clmul instructions.