As I mentioned in a previous blog post, I think that Rust is a good programming language to implement cryptographic algorithms, thanks to its memory safety, strong type system, ease of unit testing, and high performance. I’ve already experimented with that some years ago with the Gravity-SPHINCS algorithm that I designed during my master’s thesis.

I recently wanted to try Rust again to implement a less niche algorithm, and notably see how more recent features of the language such as const generics can help with cryptography. I settled to write an implementation of Shamir’s Secret Sharing, which I you can find on GitHub under the codename Horcrux.

In this first blog post, I’ll go through the mathematics of Shamir’s Secret Sharing, explaining the algorithms and highlighting choices that will affect implementation. In a second blog post, I’ll explain the Rust implementation, and notably how the language allows to easily test and benchmark the code, to write an algorithm generic over mathematical parameters, and to use CPU-specific instructions to make an algorithm 10x faster.


Shamir’s Secret Sharing

Threshold secret sharing

Let’s first start with a definition: Shamir’s Secret Sharing is a threshold secret sharing scheme. This means that it allows to take a secret value, typically an nn-bit string, and to “split” it into mm shares. These shares can then be distributed to various participants, which can work together to reconstruct the initial secret. More precisely, the scheme is also parameterized by some threshold kmk \le m, such that the knowledge of any kk shares is enough to reconstruct the secret. On the other hand knowing k1k-1 (or fewer) shares gives no information whatsoever about the secret.

Typically, this means that a single share doesn’t reveal anything about the secret (unless k=1k = 1). This also means that losing one share doesn’t block reconstructing the secret (unless k=mk = m).

One use case could be to split the encryption key of some document (or backup, or hard drive, etc.) and spread the shares in multiple places. With that, there is no single place to steal the key – an attacker has to steal shares in multiple places to recover the encryption key – nor a single point of failure – if one or a few shares are physically destroyed by accident (e.g. due to fire) the encryption key can still be recovered.

These properties may seem like magic, so how does this scheme actually work?

Polynomial interpolation

The basic principle behind Shamir’s Secret Sharing is polynomial interpolation. Given k>0k > 0 points {(x1,y1),...,(xk,yk)}\{(x_1, y_1), ..., (x_k, y_k)\} such that the xix_i are all distinct, there exists a unique polynomial P(X)=a0+a1X+...+ak1Xk1P(X) = a_0 + a_1 X + ... + a_{k-1} X^{k-1} of degree k1k-1 that interpolates through these points, i.e. P(xi)=yiP(x_i) = y_i.

Based on this principle, Shamir’s Secret Sharing works as follows. Given a secret ss and a threshold k>0k > 0, set a0=sa_0 = s and pick a1,...,ak1a_1, ..., a_{k-1} uniformly at random – these coefficients aia_i define the polynomial PP. Then the mm shares are mm distinct points on the polynomial, i.e. points (x1,P(x1)),...,(xm,P(xm))(x_1, P(x_1)), ..., (x_m, P(x_m)) such that all the xix_i are distinct and xi0x_i \ne 0.

Polynomial interpolation The secret s=4s = 4 at x=0x = 0 (green point) can be recovered by interpolating a subset of kk other points (in red).

Here are some remarks from this construction.

  • All points have xi0x_i \ne 0, so that no share directly contains the secret.
  • Given (at least) kk shares, we can reconstruct the polynomial PP by interpolation, and therefore the secret by computing P(0)P(0).
  • Given (at most) k1k-1 shares, we cannot reconstruct any information about the secret. Indeed, any secret value ss' could equally work, essentially because one can interpolate the points {(0,s),(x1,y1),...,(xk1,yk1)}\{(0, s'), (x_1, y_1), ..., (x_{k-1}, y_{k-1})\} to obtain a plausible polynomial of degree k1k-1.
  • If one of the shares gets corrupted, the secret reconstructed from kk shares containing it will be corrupted as well, without inherent means of detecting the error.

Ambiguous interpolation with missing points With less than kk shares, we cannot recover the secret because there exist interpolations passing through all possible secret values.

Now, how do we reconstruct PP from the points? There is the following explicit construction:

P(X)=i=1kyiLi(X)P(X) = \sum_{i=1}^k y_i L_i(X)

where LiL_i is a Lagrange polynomial, equal to:

Li(X)=jiXxjxixjL_i(X) = \prod_{j \ne i} \frac{X - x_j}{x_i - x_j}

A couple remarks on Lagrange polynomials.

  • You can see on this formula that the xix_i have to be distinct, so that none of the denominators is zero.
  • You can verify that Li(xi)=1L_i(x_i) = 1 and Li(xj)=0L_i(x_j) = 0 for iji \ne j. This explains the formula for PP, which will evaluate to P(xi)=yiP(x_i) = y_i.

All in all, recovering the secret from kk shares can be done by computing:

s=P(0)=i=1kyiLi(0)=i=1kyijixjxjxis = P(0) = \sum_{i=1}^k y_i L_i(0) = \sum_{i=1}^k y_i \prod_{j \ne i} \frac{x_j}{x_j - x_i}

Some optimizations

From this mathematical formulation, let me present a couple optimizations that can be useful in the actual implementation.

The first optimization is how we represent the shares.

  • Randomized shares. As we’ve seen, each share is a point (xi,yi=P(xi))(x_i, y_i = P(x_i)) where xix_i and yiy_i are elements of our finite field. Typically, if the secret contains nn bits, representing each field element takes nn bits. Therefore, if we take arbitrary xix_i, then we need 2n2n bits to represent a share. Let’s call this construction randomized shares, where the xix_i are uniformly distributed at random.
  • Compact shares. However, the scheme doesn’t rely on the xix_i being random: to build kk shares we can simply take the first integers: x1=1,...,xk=kx_1 = 1, ..., x_k = k. We can then store each xix_i using only log2(k)\log_2(k) bits, which can be much less than nn for a reasonable number of shares kk. For example, if there are at most 256 shares then we can use only a single byte to represent xix_i, and this should be enough for practical purposes. Let’s call this construction compact shares, where the xix_i are small integers.

The second optimization is purely algorithmic, and is about how we compute the reconstruction using Lagrange polynomials. As we will see in the following section, inversion is usually the most expensive operation to compute. In the formula described above, we are computing the inverse (xjxi)1(x_j - x_i)^{-1} for each pair iji \ne j, which is not very efficient. We can however compute fewer inverses by rewriting the formula.

  • We could have a table that maps xjxix_j - x_i to its inverse, in order to cache common values to invert. This caching works only if the xix_i are such that the same difference comes for multiple pairs (i.e. xjxi=xlxkx_j - x_i = x_l - x_k), which is the case for compact shares. However, in practice I didn’t see this optimization bringing the most benefits, due to the overhead of maintaining the cache table.
  • A more general and optimized solution is to compute the product of all denominators before doing the inversion, so that we only have one inverse to compute for each term in the formula, i.e. in total kk inversions to compute for each reconstruction. This means rewriting the formula as follows.
s=i=1kyijixj(ji(xjxi))1s = \sum_{i=1}^k y_i \prod_{j \ne i} x_j \left(\prod_{j \ne i} (x_j - x_i) \right)^{-1}

Fields

The above description works on any polynomials whose coefficients are in a field. Let me recall the definition: a field is a set on which the following arithmetic operations are defined.

  • Addition: an associative and commutative operation, with a neutral element (zero).
  • Subtraction: every element xx has an opposite x-x, or in other words we can subtract elements.
  • Multiplication: an associative and commutative operation, with a neutral element (one), which is distributive over addition.
  • Division: every non-zero element xx has a multiplicative inverse x1x^{-1}, or in other words there is a division operation by non-zero elements.

If you look at the formulas above, these four operations are exactly what we need to reconstruct a secret by polynomial interpolation.

The next question is which field(s) to choose in order to instantiate Shamir’s Secret Sharing. Common fields one can think of are the rational numbers Q\mathbb{Q}, the real numbers R\mathbb{R} or the complex numbers C\mathbb{C}. However, these fields are infinite and therefore inconvenient to work with on computers. Even though each rational number can be represented by a finite amount of memory – the numerator and denominator are both integers – their representation can be unbounded and exceed a computer’s memory after many operations. Regarding real and complex numbers, only a subset of them can actually be computed, but again even working with this subset is inconvenient.

Finite fields

Apart from these infinite fields, there exist finite fields, that contain a finite number of elements. The structure of finite fields is actually well understood, with a canonical construction, as we’ll see in the next section. This important theorem is that finite fields are all of the form GF(pn)\mathrm{GF}(p^n) where pp is a prime number and nn a positive integer. The field GF(pn)\mathrm{GF}(p^n) contains exactly pnp^n elements.

Shamir’s Secret Sharing algorithm is the same regardless of the underlying field, but implementing arithmetic operations is more or less complex depending on the choice of the field. One thing to have in mind is that we want to be able to split secrets of 256 bits (a common size for symmetric encryption keys such as AES), so the finite field should contain at least 22562^{256} elements, and ideally the conversion between a 256-bit secret and a field element should be simple.

There are two interesting forms of finite fields, that are used in various cryptographic algorithms. As an example, elliptic curves based on either of these two categories have been standardized.

  • Prime fields. These fields are of the form GF(p)\mathrm{GF}(p) for a prime pp, i.e. we set n=1n = 1. The most common elliptic curve used on the Internet today1 - secp256r1 a.k.a. P-256 - is defined over a finite field of this form. Field operations are relatively simple to describe: they are arithmetic modulo pp. Despite this simple description, arithmetic in GF(p)\mathrm{GF}(p) is not trivial to implement, precisely because of the modulo operation. It’s even more complex to implement in constant time. Another problem is that pp cannot be equal to 22562^{256}, so we’d have to do some conversion between 256-bit secrets and elements of GF(p)\mathrm{GF}(p).
  • Binary fields. These fields are of the form GF(2n)\mathrm{GF}(2^n), i.e. we set p=2p = 2. Binary fields are used in some block ciphers: for example the “mix columns” step of AES relies on arithmetic in GF(28)\mathrm{GF}(2^8). Field operations are a bit more complicated to describe than “modulo pp”, but using 2 as the prime allows to implement them as a combination of bitwise operations, such as XOR and shifts. As we’ll see later, recent CPUs also have dedicated instructions to help with multiplication in GF(2n)\mathrm{GF}(2^n). The nice thing is that there is a natural one-to-one mapping between nn-bit secrets and elements of GF(2n)\mathrm{GF}(2^n).

Given this constraints, for Horcrux I’ve decided to implement arithmetic in binary fields GF(2n)\mathrm{GF}(2^n).

Arithmetic in GF(2n)\mathrm{GF}(2^n)

As I briefly mentioned, there exists an explicit construction of GF(pn)\mathrm{GF}(p^n), which will be the basis for implementing arithmetic in GF(2n)\mathrm{GF}(2^n). This construction is the set of polynomials with coefficients in GF(p)\mathrm{GF}(p), modulo some irreducible polynomial PP of degree nn.

This reduction modulo PP is determined by the Euclidean division of polynomials, which like for integers states that each polynomial TT can be decomposed into a quotient QQ and a remainder RR such that T=PQ+RT = P Q + R where the polynomial RR has a smaller degree than PP. The remainder RR is the reduction modulo PP of TT.

In more concrete terms, after applying the reduction each field element can be represented by a polynomial A=an1Xn1+...+a1X+a0A = a_{n-1} X^{n-1} + ... + a_1 X + a_0 with nn coefficients. In our case, each coefficient aia_i is simply a bit because p=2p = 2, and operations between coefficients follow the arithmetic in GF(2)\mathrm{GF}(2), i.e. addition is a logical XOR and multiplication is a logical AND.

Representation of elements

A naive way to represent an element AA would be to use an array of nn booleans. However, we can do better by packing the bits together in some unsigned integer type. For example, if n=64n = 64, we can use a single 64-bit integer to store all the coefficients. In the more general sense, given a word size ww (typically w=64w = 64) and a degree nn that is a multiple of ww, we represent each polynomial by an array of n/wn/w words, each being a ww-bit integer.

For example, with n=256n = 256 we can use 4 words of 64 bits, which gives the following type in Rust.

struct GF256 {
    words: [u64; 4],
}

Addition and subtraction

Given two polynomials A=an1Xn1+...+a0A = a_{n-1} X^{n-1} + ... + a_0 and B=bn1Xn1+...+b0B = b_{n-1} X^{n-1} + ... + b_0, their sum is defined by doing an addition in GF(2)\mathrm{GF}(2) coefficient by coefficient. And because addition in GF(2)\mathrm{GF}(2) is a XOR operation (denoted by \oplus), we have:

A+B=(an1bn1)Xn1+...+(a0b0)A + B = (a_{n-1} \oplus b_{n-1}) X^{n-1} + ... + (a_0 \oplus b_0)

In other words, we just need to XOR the bits of A with the bits of B, and this is where the representation as an array of words shines: we can XOR ww bits at once with a single XOR instruction.

impl Add for GF256 {
    type Output = Self;

    fn add(self, other: Self) -> Self {
        let mut words = [0u64; 4];
        for i in 0..4 {
            words[i] = self.words[i] ^ other.words[i];
        }
        Self { words }
    }
}

For subtraction, we also need to subtract coefficient by coefficient, but because GF(2)\mathrm{GF}(2) only has 2 elements, subtraction is equal to XOR as well.

AB=AB=A+BA - B = A \oplus B = A + B

Basic multiplication algorithm

Multiplication is certainly the most complex operation that we have to implement, because we have to apply the reduction modulo PP. This is where there are various opportunities for optimizations, and where the choice of PP will be important.

One way to look at multiplication is as a series of additions, generalizing over the “long multiplication” taught in school to multiply integers.

AB=(an1Xn1BmodP)+...+(a1XBmodP)+(a0BmodP)AB = (a_{n-1} X^{n-1} B \mod P) + ... + (a_1 X B \mod P) + (a_0 B \mod P)

With this approach, we compute each term by taking the polynomial BB, multiplying it by XiX^i and reducing modulo PP. We then add the ii-th term if ai=1a_i = 1, and ignore it if ai=0a_i = 0.

Because we’ll compute iteratively XBX B, X2BX^2 B, …, until Xn1BX^{n-1} B, we only need a “multiply by XX modulo PP” operation. This multiplication by XX is actually relatively straightforward. We first note that multiplying by XX shifts all the coefficients by one place:

XB=bn1Xn+...+b0XX B = b_{n-1} X^n + ... + b_0 X

Then, we can write down the polynomial PP as P=Xn+P~P = X^n + \widetilde{P} with:

P~=pn1Xn1+...+p0\widetilde{P} = p_{n-1} X^{n-1} + ... + p_0

Because PP is equal to zero modulo PP, and subtraction is equal to addition, we have the equality Xn=P~=P~X^n = -\widetilde{P} = \widetilde{P}. This means that we can replace the first term bn1Xnb_{n-1} X^n by bn1P~b_{n-1} \widetilde{P}, or in other words:

XBmodP=bn1P~+bn2Xn1+...+b0XXB \mod P = b_{n-1} \widetilde{P} + b_{n-2} X^{n-1} + ... + b_0 X

Overall, our algorithm to multiply by XX is the following:

  • shift each word to the left by 1 bit,
  • propagate a carry bit from each word into the next word,
  • apply the reduction modulo PP, by adding P~\widetilde{P} if the last carry bit bn1b_{n-1} is set.
impl Mul<&Self> for GF256 {
    type Output = Self;

    fn mul(self, other: &Self) -> Self {
        let mut result = Self {
            words: [0u64; 4],
        };
        for &word in &other.words {
            for i in 0..64 {
                if word & (1 << i) != 0 {
                    result += &self;
                }
                self.shl1();
            }
        }
        result
    }
}

impl GF256 {
    // Based on the irreducible polynomial X^256 + X^10 + X^5 + X^2 + 1.
    const P: u64 = 1 ^ (1 << 2) ^ (1 << 5) ^ (1 << 10);

    // Shift left by 1 place, a.k.a. multiplication by X.
    fn shl1(&mut self) {
        let mut carry = 0u64;
        for i in 0..4 {
            let d = self.words[i];
            self.words[i] = (d << 1) ^ carry;
            carry = d >> 63;
        }
        if carry != 0 {
            self.words[0] ^= Self::P;
        }
    }
}

Optimized multiplication algorithm

The above algorithm works, but it has some drawbacks.

  • It’s rather slow because we have a loop with nn iterations, each of which does a few shifts and XORs on all of the n/wn/w words.
  • It’s not constant-time as is, because the reduction modulo PP uses an “if” statement.

Modern CPUs have some instructions to directly interpret ww-bit words as binary polynomials and multiply them as such. This is actually quite similar to regular multiplication, except that there is no carry to propagate, and therefore these instructions are called “carry-less multiplication” or CLMUL on Intel processors.

Given two polynomials AA and BB of degree (at most) w1w - 1, their product ABAB is a polynomial CC of degree (at most) 2w22w - 2, that therefore fits into a “double word” of 2w2w bits. Intel’s clmul instructions take as input two 64-bit words and return their product as a 128-bit double word.

If we come back to the multiplication of polynomials of degree nn that each consist of n/wn/w words, we can decompose multiplication into a series of clmul instructions, followed by some reduction of the carries modulo PP. In this case though, the reduction will be a bit more complex than “adding P~\widetilde{P} if the last carry bit is set”, because the carries to propagate are now entire words rather than bits.

I’ll discuss in the next blog post how to detect support for the relevant clmul instructions in Rust, and select between the basic and optimized algorithms accordingly. We’ll also see in the benchmarks that this optimization yields a 10x performance improvement!

Inversion

Once we have a multiplication operation, we can define inversion from it. Indeed, the 2n12^n - 1 non-zero elements of GF(2n)\mathrm{GF}(2^n) form the group of invertible elements with respect to multiplication. Therefore, Lagrange’s theorem shows that for every element a0a \ne 0, a2n1=1a^{2^n - 1} = 1, from which we can conclude that:

a1=a2n2a^{-1} = a^{2^n - 2}

So we can compute the inverse of aa by computing its (2n2)(2^n-2)-th power via square-and-multiply exponentiation, using up to 2n2n multiplications.

impl GF256 {
    const ONE: Self = Self {
        words: [1u64, 0u64, 0u64, 0u64],
    };

    fn invert(mut self) -> Self {
        let mut result = Self::ONE;
        for _ in 1..256 {
            self = self * &self;
            result *= &self;
        }
        result
    }
}

Choice of the irreducible polynomial

As we have seen, multiplication in GF(2n)\mathrm{GF}(2^n) relies on an irreducible polynomial, which is XOR-ed to the result when a shift produces a carry bit. It is essential that this polynomial is indeed irreducible, otherwise the arithmetic doesn’t define a field (some non-zero elements won’t be invertible, and the above formula likely won’t work for those which are invertible).

If we pick a simple case like n=8n = 8, we can reuse a well-known irreducible polynomial such as the one used by AES, i.e. X8+X4+X3+X+1X^8 + X^4 + X^3 + X + 1. However, what about the general case for any nn?

Here are a few remarks about an irreducible polynomial PP of degree nn over GF(2)\mathrm{GF}(2).

  • The first term is XnX^n, because we look for a polynomial of degree nn.
  • The last term must be 11, otherwise XX divides the polynomial PP, which is therefore not irreducible.
  • There must be an odd number of terms. Indeed, because we work with coefficients in GF(2)\mathrm{GF}(2), any polynomial PP with an even number of terms will evaluate to P(1)=0P(1) = 0, that is to say that 1 is a root of PP, therefore X1X - 1 divides PP and PP is not irreducible.

The choice of PP will be quite important for performance, because it will affect the multiplication and inversion operations.

  • If PP is sparse, i.e. has a small number of terms, arithmetic will be easier because we have to XOR fewer bits. So we are looking for polynomials of the form Xn+Xa+1X^n + X^a + 1 (trinomials) or Xn+Xa+Xb+Xc+1X^n + X^a + X^b + X^c + 1 (pentanomials).
  • If all the terms of P~\widetilde{P} fit in the last word (i.e. aa, bb and cc are at most w1w-1), then XOR-ing P~\widetilde{P} requires only XOR-ing this last word, instead of n/wn/w words.

Generating irreducible polynomials of arbitrary degree over GF(2)\mathrm{GF}(2) is not a trivial problem. I’ve looked a bit into the literature and found sub-categories of irreducible polynomials such as primitive polynomials and Conway polynomials. There were also some papers giving tables of such polynomials (low-weight polynomials, primitive polynomials). Some other paper discussed methods to choose “optimal” polynomials, but none of the examples where for polynomials with a power-of-2 degree.

In the end, one approach is simply a brute force algorithm, generating polynomials of the required form until we find one that is irreducible. Deciding whether a given PP is irreducible over GF(2)\mathrm{GF}(2) is not a trivial problem either. There are various methods for factorization of polynomials over finite fields, but we can more simply defer that part to a mathematics software like Sage.

I also wanted a bit more flexibility in generating polynomials of any degree, without being limited by those already listed in some table. So I ended up writing a Sage script that:

For Horcrux, I mostly chose to consider power-of-two number of bits, and the script found the following polynomials. Interestingly, there were no trinomials for these values of nn, only pentanomials.

X^8 + X^4 + X^3 + X + 1
X^16 + X^5 + X^3 + X + 1
X^32 + X^7 + X^3 + X^2 + 1
X^64 + X^4 + X^3 + X + 1
X^128 + X^7 + X^2 + X + 1
X^256 + X^10 + X^5 + X^2 + 1

In the end, these polynomials match the table of low-weight polynomials mentioned above. We notably retrieve the polynomial for GF(28)\mathrm{GF}(2^8) used in AES’s “mix columns”, and the polynomial for GF(2128)\mathrm{GF}(2^{128}) used in GCM.

Conclusion

We’re now done with the mathematical description of Shamir’s Secret Sharing. The next blog post will discuss how to implement that in Rust.


  1. According to section 9.1 of RFC 8446, secp256r1 is the only elliptic curve mandatory to implement for TLS 1.3, both in terms of ECDSA signatures and ECDH key exchange. 


Comments

To react to this blog post please check the Reddit thread and the Twitter thread.


RSS | Mastodon | GitHub


You may also like

Horcrux: Implementing Shamir's Secret Sharing in Rust (part 2)
STV-rs: Single Transferable Vote implementation in Rust
Optimization adventures: making a parallel Rust workload even faster with data-oriented design (and other tricks)
Why my Rust benchmarks were wrong, or how to correctly use std::hint::black_box?
And 32 more posts on this blog!