As I mentioned in a previous blog post, I think that Rust is a good programming language to implement cryptographic algorithms, thanks to its memory safety, strong type system, ease of unit testing, and high performance. I’ve already experimented with that some years ago with the Gravity-SPHINCS algorithm that I designed during my master’s thesis.

I recently wanted to try Rust again to implement a less niche algorithm, and notably see how more recent features of the language such as const generics can help with cryptography. I settled to write an implementation of Shamir’s Secret Sharing, which I you can find on GitHub under the codename Horcrux.

In this first blog post, I’ll go through the mathematics of Shamir’s Secret Sharing, explaining the algorithms and highlighting choices that will affect implementation. In a second blog post, I’ll explain the Rust implementation, and notably how the language allows to easily test and benchmark the code, to write an algorithm generic over mathematical parameters, and to use CPU-specific instructions to make an algorithm 10x faster.

## Shamir’s Secret Sharing

### Threshold secret sharing

Let’s first start with a definition: Shamir’s Secret Sharing is a threshold secret sharing scheme. This means that it allows to take a secret value, typically an $n$-bit string, and to “split” it into $m$ shares. These shares can then be distributed to various participants, which can work together to reconstruct the initial secret. More precisely, the scheme is also parameterized by some threshold $k \le m$, such that the knowledge of any $k$ shares is enough to reconstruct the secret. On the other hand knowing $k-1$ (or fewer) shares gives no information whatsoever about the secret.

Typically, this means that a single share doesn’t reveal anything about the secret (unless $k = 1$). This also means that losing one share doesn’t block reconstructing the secret (unless $k = m$).

One use case could be to split the encryption key of some document (or backup, or hard drive, etc.) and spread the shares in multiple places. With that, there is no single place to steal the key – an attacker has to steal shares in multiple places to recover the encryption key – nor a single point of failure – if one or a few shares are physically destroyed by accident (e.g. due to fire) the encryption key can still be recovered.

These properties may seem like magic, so how does this scheme actually work?

### Polynomial interpolation

The basic principle behind Shamir’s Secret Sharing is polynomial interpolation. Given $k > 0$ points $\{(x_1, y_1), ..., (x_k, y_k)\}$ such that the $x_i$ are all distinct, there exists a unique polynomial $P(X) = a_0 + a_1 X + ... + a_{k-1} X^{k-1}$ of degree $k-1$ that interpolates through these points, i.e. $P(x_i) = y_i$.

Based on this principle, Shamir’s Secret Sharing works as follows. Given a secret $s$ and a threshold $k > 0$, set $a_0 = s$ and pick $a_1, ..., a_{k-1}$ uniformly at random – these coefficients $a_i$ define the polynomial $P$. Then the $m$ shares are $m$ distinct points on the polynomial, i.e. points $(x_1, P(x_1)), ..., (x_m, P(x_m))$ such that all the $x_i$ are distinct and $x_i \ne 0$.

The secret $s = 4$ at $x = 0$ (green point) can be recovered by interpolating a subset of $k$ other points (in red).

Here are some remarks from this construction.

• All points have $x_i \ne 0$, so that no share directly contains the secret.
• Given (at least) $k$ shares, we can reconstruct the polynomial $P$ by interpolation, and therefore the secret by computing $P(0)$.
• Given (at most) $k-1$ shares, we cannot reconstruct any information about the secret. Indeed, any secret value $s'$ could equally work, essentially because one can interpolate the points $\{(0, s'), (x_1, y_1), ..., (x_{k-1}, y_{k-1})\}$ to obtain a plausible polynomial of degree $k-1$.
• If one of the shares gets corrupted, the secret reconstructed from $k$ shares containing it will be corrupted as well, without inherent means of detecting the error.

With less than $k$ shares, we cannot recover the secret because there exist interpolations passing through all possible secret values.

Now, how do we reconstruct $P$ from the points? There is the following explicit construction:

$P(X) = \sum_{i=1}^k y_i L_i(X)$

where $L_i$ is a Lagrange polynomial, equal to:

$L_i(X) = \prod_{j \ne i} \frac{X - x_j}{x_i - x_j}$

A couple remarks on Lagrange polynomials.

• You can see on this formula that the $x_i$ have to be distinct, so that none of the denominators is zero.
• You can verify that $L_i(x_i) = 1$ and $L_i(x_j) = 0$ for $i \ne j$. This explains the formula for $P$, which will evaluate to $P(x_i) = y_i$.

All in all, recovering the secret from $k$ shares can be done by computing:

$s = P(0) = \sum_{i=1}^k y_i L_i(0) = \sum_{i=1}^k y_i \prod_{j \ne i} \frac{x_j}{x_j - x_i}$

### Some optimizations

From this mathematical formulation, let me present a couple optimizations that can be useful in the actual implementation.

The first optimization is how we represent the shares.

• Randomized shares. As we’ve seen, each share is a point $(x_i, y_i = P(x_i))$ where $x_i$ and $y_i$ are elements of our finite field. Typically, if the secret contains $n$ bits, representing each field element takes $n$ bits. Therefore, if we take arbitrary $x_i$, then we need $2n$ bits to represent a share. Let’s call this construction randomized shares, where the $x_i$ are uniformly distributed at random.
• Compact shares. However, the scheme doesn’t rely on the $x_i$ being random: to build $k$ shares we can simply take the first integers: $x_1 = 1, ..., x_k = k$. We can then store each $x_i$ using only $\log_2(k)$ bits, which can be much less than $n$ for a reasonable number of shares $k$. For example, if there are at most 256 shares then we can use only a single byte to represent $x_i$, and this should be enough for practical purposes. Let’s call this construction compact shares, where the $x_i$ are small integers.

The second optimization is purely algorithmic, and is about how we compute the reconstruction using Lagrange polynomials. As we will see in the following section, inversion is usually the most expensive operation to compute. In the formula described above, we are computing the inverse $(x_j - x_i)^{-1}$ for each pair $i \ne j$, which is not very efficient. We can however compute fewer inverses by rewriting the formula.

• We could have a table that maps $x_j - x_i$ to its inverse, in order to cache common values to invert. This caching works only if the $x_i$ are such that the same difference comes for multiple pairs (i.e. $x_j - x_i = x_l - x_k$), which is the case for compact shares. However, in practice I didn’t see this optimization bringing the most benefits, due to the overhead of maintaining the cache table.
• A more general and optimized solution is to compute the product of all denominators before doing the inversion, so that we only have one inverse to compute for each term in the formula, i.e. in total $k$ inversions to compute for each reconstruction. This means rewriting the formula as follows.
$s = \sum_{i=1}^k y_i \prod_{j \ne i} x_j \left(\prod_{j \ne i} (x_j - x_i) \right)^{-1}$

### Fields

The above description works on any polynomials whose coefficients are in a field. Let me recall the definition: a field is a set on which the following arithmetic operations are defined.

• Addition: an associative and commutative operation, with a neutral element (zero).
• Subtraction: every element $x$ has an opposite $-x$, or in other words we can subtract elements.
• Multiplication: an associative and commutative operation, with a neutral element (one), which is distributive over addition.
• Division: every non-zero element $x$ has a multiplicative inverse $x^{-1}$, or in other words there is a division operation by non-zero elements.

If you look at the formulas above, these four operations are exactly what we need to reconstruct a secret by polynomial interpolation.

The next question is which field(s) to choose in order to instantiate Shamir’s Secret Sharing. Common fields one can think of are the rational numbers $\mathbb{Q}$, the real numbers $\mathbb{R}$ or the complex numbers $\mathbb{C}$. However, these fields are infinite and therefore inconvenient to work with on computers. Even though each rational number can be represented by a finite amount of memory – the numerator and denominator are both integers – their representation can be unbounded and exceed a computer’s memory after many operations. Regarding real and complex numbers, only a subset of them can actually be computed, but again even working with this subset is inconvenient.

### Finite fields

Apart from these infinite fields, there exist finite fields, that contain a finite number of elements. The structure of finite fields is actually well understood, with a canonical construction, as we’ll see in the next section. This important theorem is that finite fields are all of the form $\mathrm{GF}(p^n)$ where $p$ is a prime number and $n$ a positive integer. The field $\mathrm{GF}(p^n)$ contains exactly $p^n$ elements.

Shamir’s Secret Sharing algorithm is the same regardless of the underlying field, but implementing arithmetic operations is more or less complex depending on the choice of the field. One thing to have in mind is that we want to be able to split secrets of 256 bits (a common size for symmetric encryption keys such as AES), so the finite field should contain at least $2^{256}$ elements, and ideally the conversion between a 256-bit secret and a field element should be simple.

There are two interesting forms of finite fields, that are used in various cryptographic algorithms. As an example, elliptic curves based on either of these two categories have been standardized.

• Prime fields. These fields are of the form $\mathrm{GF}(p)$ for a prime $p$, i.e. we set $n = 1$. The most common elliptic curve used on the Internet today1 - secp256r1 a.k.a. P-256 - is defined over a finite field of this form. Field operations are relatively simple to describe: they are arithmetic modulo $p$. Despite this simple description, arithmetic in $\mathrm{GF}(p)$ is not trivial to implement, precisely because of the modulo operation. It’s even more complex to implement in constant time. Another problem is that $p$ cannot be equal to $2^{256}$, so we’d have to do some conversion between 256-bit secrets and elements of $\mathrm{GF}(p)$.
• Binary fields. These fields are of the form $\mathrm{GF}(2^n)$, i.e. we set $p = 2$. Binary fields are used in some block ciphers: for example the “mix columns” step of AES relies on arithmetic in $\mathrm{GF}(2^8)$. Field operations are a bit more complicated to describe than “modulo $p$”, but using 2 as the prime allows to implement them as a combination of bitwise operations, such as XOR and shifts. As we’ll see later, recent CPUs also have dedicated instructions to help with multiplication in $\mathrm{GF}(2^n)$. The nice thing is that there is a natural one-to-one mapping between $n$-bit secrets and elements of $\mathrm{GF}(2^n)$.

Given this constraints, for Horcrux I’ve decided to implement arithmetic in binary fields $\mathrm{GF}(2^n)$.

## Arithmetic in $\mathrm{GF}(2^n)$

As I briefly mentioned, there exists an explicit construction of $\mathrm{GF}(p^n)$, which will be the basis for implementing arithmetic in $\mathrm{GF}(2^n)$. This construction is the set of polynomials with coefficients in $\mathrm{GF}(p)$, modulo some irreducible polynomial $P$ of degree $n$.

This reduction modulo $P$ is determined by the Euclidean division of polynomials, which like for integers states that each polynomial $T$ can be decomposed into a quotient $Q$ and a remainder $R$ such that $T = P Q + R$ where the polynomial $R$ has a smaller degree than $P$. The remainder $R$ is the reduction modulo $P$ of $T$.

In more concrete terms, after applying the reduction each field element can be represented by a polynomial $A = a_{n-1} X^{n-1} + ... + a_1 X + a_0$ with $n$ coefficients. In our case, each coefficient $a_i$ is simply a bit because $p = 2$, and operations between coefficients follow the arithmetic in $\mathrm{GF}(2)$, i.e. addition is a logical XOR and multiplication is a logical AND.

### Representation of elements

A naive way to represent an element $A$ would be to use an array of $n$ booleans. However, we can do better by packing the bits together in some unsigned integer type. For example, if $n = 64$, we can use a single 64-bit integer to store all the coefficients. In the more general sense, given a word size $w$ (typically $w = 64$) and a degree $n$ that is a multiple of $w$, we represent each polynomial by an array of $n/w$ words, each being a $w$-bit integer.

For example, with $n = 256$ we can use 4 words of 64 bits, which gives the following type in Rust.

struct GF256 {
words: [u64; 4],
}


Given two polynomials $A = a_{n-1} X^{n-1} + ... + a_0$ and $B = b_{n-1} X^{n-1} + ... + b_0$, their sum is defined by doing an addition in $\mathrm{GF}(2)$ coefficient by coefficient. And because addition in $\mathrm{GF}(2)$ is a XOR operation (denoted by $\oplus$), we have:

$A + B = (a_{n-1} \oplus b_{n-1}) X^{n-1} + ... + (a_0 \oplus b_0)$

In other words, we just need to XOR the bits of A with the bits of B, and this is where the representation as an array of words shines: we can XOR $w$ bits at once with a single XOR instruction.

impl Add for GF256 {
type Output = Self;

fn add(self, other: Self) -> Self {
let mut words = [0u64; 4];
for i in 0..4 {
words[i] = self.words[i] ^ other.words[i];
}
Self { words }
}
}


For subtraction, we also need to subtract coefficient by coefficient, but because $\mathrm{GF}(2)$ only has 2 elements, subtraction is equal to XOR as well.

$A - B = A \oplus B = A + B$

### Basic multiplication algorithm

Multiplication is certainly the most complex operation that we have to implement, because we have to apply the reduction modulo $P$. This is where there are various opportunities for optimizations, and where the choice of $P$ will be important.

One way to look at multiplication is as a series of additions, generalizing over the “long multiplication” taught in school to multiply integers.

$AB = (a_{n-1} X^{n-1} B \mod P) + ... + (a_1 X B \mod P) + (a_0 B \mod P)$

With this approach, we compute each term by taking the polynomial $B$, multiplying it by $X^i$ and reducing modulo $P$. We then add the $i$-th term if $a_i = 1$, and ignore it if $a_i = 0$.

Because we’ll compute iteratively $X B$, $X^2 B$, …, until $X^{n-1} B$, we only need a “multiply by $X$ modulo $P$” operation. This multiplication by $X$ is actually relatively straightforward. We first note that multiplying by $X$ shifts all the coefficients by one place:

$X B = b_{n-1} X^n + ... + b_0 X$

Then, we can write down the polynomial $P$ as $P = X^n + \widetilde{P}$ with:

$\widetilde{P} = p_{n-1} X^{n-1} + ... + p_0$

Because $P$ is equal to zero modulo $P$, and subtraction is equal to addition, we have the equality $X^n = -\widetilde{P} = \widetilde{P}$. This means that we can replace the first term $b_{n-1} X^n$ by $b_{n-1} \widetilde{P}$, or in other words:

$XB \mod P = b_{n-1} \widetilde{P} + b_{n-2} X^{n-1} + ... + b_0 X$

Overall, our algorithm to multiply by $X$ is the following:

• shift each word to the left by 1 bit,
• propagate a carry bit from each word into the next word,
• apply the reduction modulo $P$, by adding $\widetilde{P}$ if the last carry bit $b_{n-1}$ is set.
impl Mul<&Self> for GF256 {
type Output = Self;

fn mul(self, other: &Self) -> Self {
let mut result = Self {
words: [0u64; 4],
};
for &word in &other.words {
for i in 0..64 {
if word & (1 << i) != 0 {
result += &self;
}
self.shl1();
}
}
result
}
}

impl GF256 {
// Based on the irreducible polynomial X^256 + X^10 + X^5 + X^2 + 1.
const P: u64 = 1 ^ (1 << 2) ^ (1 << 5) ^ (1 << 10);

// Shift left by 1 place, a.k.a. multiplication by X.
fn shl1(&mut self) {
let mut carry = 0u64;
for i in 0..4 {
let d = self.words[i];
self.words[i] = (d << 1) ^ carry;
carry = d >> 63;
}
if carry != 0 {
self.words[0] ^= Self::P;
}
}
}


### Optimized multiplication algorithm

The above algorithm works, but it has some drawbacks.

• It’s rather slow because we have a loop with $n$ iterations, each of which does a few shifts and XORs on all of the $n/w$ words.
• It’s not constant-time as is, because the reduction modulo $P$ uses an “if” statement.

Modern CPUs have some instructions to directly interpret $w$-bit words as binary polynomials and multiply them as such. This is actually quite similar to regular multiplication, except that there is no carry to propagate, and therefore these instructions are called “carry-less multiplication” or CLMUL on Intel processors.

Given two polynomials $A$ and $B$ of degree (at most) $w - 1$, their product $AB$ is a polynomial $C$ of degree (at most) $2w - 2$, that therefore fits into a “double word” of $2w$ bits. Intel’s clmul instructions take as input two 64-bit words and return their product as a 128-bit double word.

If we come back to the multiplication of polynomials of degree $n$ that each consist of $n/w$ words, we can decompose multiplication into a series of clmul instructions, followed by some reduction of the carries modulo $P$. In this case though, the reduction will be a bit more complex than “adding $\widetilde{P}$ if the last carry bit is set”, because the carries to propagate are now entire words rather than bits.

I’ll discuss in the next blog post how to detect support for the relevant clmul instructions in Rust, and select between the basic and optimized algorithms accordingly. We’ll also see in the benchmarks that this optimization yields a 10x performance improvement!

### Inversion

Once we have a multiplication operation, we can define inversion from it. Indeed, the $2^n - 1$ non-zero elements of $\mathrm{GF}(2^n)$ form the group of invertible elements with respect to multiplication. Therefore, Lagrange’s theorem shows that for every element $a \ne 0$, $a^{2^n - 1} = 1$, from which we can conclude that:

$a^{-1} = a^{2^n - 2}$

So we can compute the inverse of $a$ by computing its $(2^n-2)$-th power via square-and-multiply exponentiation, using up to $2n$ multiplications.

impl GF256 {
const ONE: Self = Self {
words: [1u64, 0u64, 0u64, 0u64],
};

fn invert(mut self) -> Self {
let mut result = Self::ONE;
for _ in 1..256 {
self = self * &self;
result *= &self;
}
result
}
}


### Choice of the irreducible polynomial

As we have seen, multiplication in $\mathrm{GF}(2^n)$ relies on an irreducible polynomial, which is XOR-ed to the result when a shift produces a carry bit. It is essential that this polynomial is indeed irreducible, otherwise the arithmetic doesn’t define a field (some non-zero elements won’t be invertible, and the above formula likely won’t work for those which are invertible).

If we pick a simple case like $n = 8$, we can reuse a well-known irreducible polynomial such as the one used by AES, i.e. $X^8 + X^4 + X^3 + X + 1$. However, what about the general case for any $n$?

Here are a few remarks about an irreducible polynomial $P$ of degree $n$ over $\mathrm{GF}(2)$.

• The first term is $X^n$, because we look for a polynomial of degree $n$.
• The last term must be $1$, otherwise $X$ divides the polynomial $P$, which is therefore not irreducible.
• There must be an odd number of terms. Indeed, because we work with coefficients in $\mathrm{GF}(2)$, any polynomial $P$ with an even number of terms will evaluate to $P(1) = 0$, that is to say that 1 is a root of $P$, therefore $X - 1$ divides $P$ and $P$ is not irreducible.

The choice of $P$ will be quite important for performance, because it will affect the multiplication and inversion operations.

• If $P$ is sparse, i.e. has a small number of terms, arithmetic will be easier because we have to XOR fewer bits. So we are looking for polynomials of the form $X^n + X^a + 1$ (trinomials) or $X^n + X^a + X^b + X^c + 1$ (pentanomials).
• If all the terms of $\widetilde{P}$ fit in the last word (i.e. $a$, $b$ and $c$ are at most $w-1$), then XOR-ing $\widetilde{P}$ requires only XOR-ing this last word, instead of $n/w$ words.

Generating irreducible polynomials of arbitrary degree over $\mathrm{GF}(2)$ is not a trivial problem. I’ve looked a bit into the literature and found sub-categories of irreducible polynomials such as primitive polynomials and Conway polynomials. There were also some papers giving tables of such polynomials (low-weight polynomials, primitive polynomials). Some other paper discussed methods to choose “optimal” polynomials, but none of the examples where for polynomials with a power-of-2 degree.

In the end, one approach is simply a brute force algorithm, generating polynomials of the required form until we find one that is irreducible. Deciding whether a given $P$ is irreducible over $\mathrm{GF}(2)$ is not a trivial problem either. There are various methods for factorization of polynomials over finite fields, but we can more simply defer that part to a mathematics software like Sage.

I also wanted a bit more flexibility in generating polynomials of any degree, without being limited by those already listed in some table. So I ended up writing a Sage script that:

For Horcrux, I mostly chose to consider power-of-two number of bits, and the script found the following polynomials. Interestingly, there were no trinomials for these values of $n$, only pentanomials.

X^8 + X^4 + X^3 + X + 1
X^16 + X^5 + X^3 + X + 1
X^32 + X^7 + X^3 + X^2 + 1
X^64 + X^4 + X^3 + X + 1
X^128 + X^7 + X^2 + X + 1
X^256 + X^10 + X^5 + X^2 + 1


In the end, these polynomials match the table of low-weight polynomials mentioned above. We notably retrieve the polynomial for $\mathrm{GF}(2^8)$ used in AES’s “mix columns”, and the polynomial for $\mathrm{GF}(2^{128})$ used in GCM.

## Conclusion

We’re now done with the mathematical description of Shamir’s Secret Sharing. The next blog post will discuss how to implement that in Rust.

1. According to section 9.1 of RFC 8446, secp256r1 is the only elliptic curve mandatory to implement for TLS 1.3, both in terms of ECDSA signatures and ECDH key exchange.