Thoughts on the xz backdoor: an lzma-rs perspective | Blog

Many discussions about open source dependencies and maintenance happened in the last month. Two posts caught my eye in the Rust ecosystem: Sudo-rs dependencies: when less is better about the Rust rewrite of sudo trimming its dependency graph, and On Tech Debt: My Rust Library is now a CDO about a Rust package being flagged as unmaintained, triggering complaints across downstream projects failing CI. And by now, you’ve likely heard about the backdoor in the xz-utils compression project.

As the author of a pure-Rust implementation of the XZ compression format, here are my series of thoughts on the topic.

Would a Rust rewrite have helped compared to the existing C implementation? What maintainance model can work for critical dependencies? Is compression considered critical enough in the computing stack?

Before we dive in, here’s a reminder to check if your systems contain the malicious package. On Debian, use dpkg -l to check the version of installed packages.

$ dpkg -l xz-utils liblzma5
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
+++-==============-============-============-=================================
ii  liblzma5:amd64 5.6.0-0.2    amd64        XZ-format compression library
ii  xz-utils       5.6.0-0.2    amd64        XZ-format compression utilities

After apt update && apt upgrade, you should have the xz packages in version 5.6.1+really5.4.5-1.

$ dpkg -l xz-utils liblzma5
...
||/ Name           Version             Architecture Description
+++-==============-===================-============-=================================
ii  liblzma5:amd64 5.6.1+really5.4.5-1 amd64        XZ-format compression library
ii  xz-utils       5.6.1+really5.4.5-1 amd64        XZ-format compression utilities

On MacOS with Homebrew, you can check installed packages with brew list.

$ brew list xz
/usr/local/Cellar/xz/5.6.1/bin/lzcat
/usr/local/Cellar/xz/5.6.1/bin/lzcmp
...

If you see any 5.6.x version here, it’s high time to “upgrade” back to a safe previous version with brew update && brew upgrade.

$ brew update
...
==> Outdated Formulae
xz ...

$ brew upgrade
...
==> Upgrading xz
  5.6.1 -> 5.4.6

Rewrite it in Rust?
A maintenance model for critical dependencies?
- Maintained or not maintained?
A wake up call: compression is security critical
- Compressed code archives
- Don’t roll your own
Further reading

Rewrite it in Rust?

It’s a common meme that enthusiatic Rust programmers just rewrite software in Rust, as it will be “blazingly fast”, memory safe, or just for the sake of it. Beyond the meme, would this have helped?

Rewritten in Rust: `lzma-rs`

I’m probably one of the best (or worst) placed to talk about it, having written an LZMA decompressor purely in Rust. Armin Ronacher’s post on tech debt and CDOs hit close to home, as I litterally “learned Rust this way” by writing this LZMA library between jobs back in 2017. I published it on crates.io just before starting the next job, and since then mostly moved on to other things. Over the years, a few dozen packages started depending on it.

Releases have been occasional, with some bug fixes and community-contributed features like a streaming API. I wouldn’t call it unmaintained, there just aren’t new features being actively developed. Notably, the decoder is complete for the LZMA and LZMA2 formats, and should also work for baseline XZ files that don’t contain filters (although there’s a pull request to add delta filtering, which I should go back to when I get the time). The encoder is not compressing much: as a rule of thumb for compression formats, writing a decompressor is 10 times simpler than writing an efficient compressor, so I focused my time on the former.

How would I evaluate the backdooring risk compared with xz-utils?

On the technical side, similarly to the backdoored xz-utils, I once accepted a large non-audited test file (597 KB), but applied basic due diligence to understand where the file came from. This wouldn’t be enough to exclude a backdoor, but importantly the published package excludes artifacts like test files – which keeps the size small but is also a good defense-in-depth measure and helps auditing. Additionally, my code doesn’t use any custom build script, only the standard Cargo.toml manifest, which is likewise easy to audit. Using a modern language with a standard package manager makes auditing easier!

On the human side, I fortunately don’t have to triage too many requests, as my package isn’t that popular and works for what it does. I haven’t been very quick in replying on contributions (sorry for that, although most pull requests end up merged eventually), which perhaps helps setting expectations that future contributions will not be merged within a day. That said, if someone wanted to be adversarial, they could try to file a RustSec advisory and leverage sock puppet accounts to flag my package as “unmaintained” before I notice – but the RustSec policy is a good defense-in-depth. Transfering ownership of the package without my knowledge or consent would be very unlikely given crates.io’s policy – the best an attacker could do would be to create a backdoored fork and convince others to depend on it, which would be an expensive attack.

State-of-the-art: `xz2` bindings

A much more commonly used XZ library in Rust is xz2, which provides bindings over xz-utils via the lzma-sys crate. Does it make it vulnerable to the backdoor? That would be quite serious given that it’s used at the heart of the Rust compiler’s bootstrapping process, to download the previous version of the compiler.

Looking at the source code on GitHub, lzma-sys contains an archived mirror of xz-utils version 5.2.5, well before the backdoor was injected. Of course, that’s a lie, there is no guarantee that these two GitHub repositories match the real uploaded lzma-sys crate, that you can browse on docs.rs instead (assuming docs.rs isn’t itself compromised).

Of course that’s also a lie, the devil is in the details! The included C code is only used if you opt into the “static” feature or as a fallback if liblzma isn’t found on your system. So chances are that if your system contains the backdoored liblzma then xz2 will pick it up too.

There are pros and cons to vendoring a C library in a Rust crate: if upstream pushes a rogue update you don’t pull it, but if upstream pushes a security fix you don’t get it either. One reassuring thing is that the build.rs file to compile the C code is only a hundred lines¹ of readable Rust code, in contrast with the complex build files in the upstream xz-utils project.

Build systems and portability

One thing that indeed struck me is how complicated xz-utils’ build system is. The CMakeLists.txt that contained a sneaky dot is more than 2000 lines long, in a domain-specific scripting format that few people are proficient in. Add to that 1400 lines in the configure.ac file and more build files in the cmake/ and m4/ folders – not mentioning the malicious build-to-host.m4 script that only appears on the Debian repository.

In comparison, the whole lzma-rs project only contains about 4k lines of code! One clear benefit of modern languages with a built-in package manager is avoiding a lot of boilerplate.

But why do we even need all these build scripts in C? This I think stems from the myth that C is a portable language: in reality build systems have to resort to terrible hacks like check_c_source_compiles to work around inconsistencies between various C compilers. Some detractors would mention that Rust isn’t portable enough to exotic platforms – as happened when Python’s cryptography package started depending on Rust in 2021. My take is that Rust works on a growing list of hundreds of platforms, and if an exotic platform isn’t on this list then the C compiler for it is likely outdated and buggy.

Having a single supported implementation of the compiler really simplifies the build system, no need to manage unspecified behavior. Rust does support build scripts which could be used by an attacker to sneakily modify the code, but most Rust packages don’t need them. When they do, the scripts are usually small as we saw with the lzma-sys example – a common use case is in fact to compile C code and generate bindings over it.

Additionally, these build scripts are also written in Rust. This makes them easier to audit, as a reviewer doesn’t need to be proficient in another language like CMake or m4. This also makes them more robust because type errors won’t compile, whereas text-based scripts may gladly expand unescaped variables into unintended commands.

Lastly, reactive-style checks like check_c_source_compiles are unthinkable in Rust: you declaratively state your dependencies in the Cargo.toml manifest and the language provides conditional compilation to directly check platform specifics within your code.

A maintenance model for critical dependencies?

The xz-utils backdoor was made possible by an attacker taking over maintenance of the package. An important question is therefore how to ensure sustainable maintenance of long-term critical dependencies, especially when new features are not actively developed anymore and the original author is moving on.

One observation is that modern package managers make it easy for anyone to create and publish a new library, which is a double-edged sword. On the one hand, anyone can prototype something new, or extend an existing project beyond its original scope by simply creating a new package. On the other hand, this fragments the ecosystem: someone might create a gzip package, someone else an lzma package, and another an xz package. Concretely, the gzip, bzip2 and xz-utils Debian packages are all independent packages maintained by different people, despite all being similar compression tools.

Unfortunately, with fragmentation comes more opportunities for attackers to insert a backdoor, by targeting the weakest package.

One model that I think could work is to bundle together similar libraries or tools once they reach enough maturity. For example, the GNU coreutils manages common Unix command-line tools like ls, cat and mkdir. Having all of these tools be part of a common project strenghens and de-risks the maintenance. It’s also probably easier for a wider project to obtain funding or contributions from those who depend on it. As an example, I learned that Germany’s Sovereign Tech Fund is investing in various critical open-source projects, such as systemd and the Rust rewrite of coreutils.

An example from the Rust ecosystem is the RustCrypto organization which develops cryptographic libraries. Importantly, a unified project doesn’t mean a monolithic library: RustCrypto precisely releases a small package for each algorithm. This means that when one of these packages has a vulnerability, only those who precisely depend on it have to remediate the issue, rather than anyone depending on anything managed by the project. The recently merged RFC on package namespaces will undoubtedly help promoting this model of organizations owning related packages within the Rust ecosystem.

Following this model, it would be natural to have a “compression tools” organization that manages implementations of widespread compression algorithms, in a sustainable way. It turns out that such an organization already exists: the libarchive project – which also got potentially targeted by the malicious actor.

Another model is for larger projects to absorb their small critical dependencies. For example, the Hashbrown library is an efficient hash table implementation that got integrated into the Rust standard library. It started as a one-person project, but got moved into the rust-lang organization, which means that its long-term maintenance is de-risked – even if the original author is still the main contributor today.

This model is quite natural: if a large project or company depends on open-source packages, it’s up to this large project to make sure their dependencies are well-maintained and well-funded, not the other way around. If not, the large project should step up, audit its dependencies, fork, or trim its dependency graph.

In fact, a project like the Rust compiler has a large dependency graph. I took a look a few years ago, and rustc required no less than 265 packages, most of which are external dependencies: regular expression engine, data structures, graph manipulation and of course compression algorithms to name a few. That’s only counting the core Rust compiler, the toolchain as a whole including the package manager pulls more dependencies. All of these are certainly critical to the entire Rust ecosystem, and it may be time to make sure they are in a good state.

Maintained or not maintained?

One question comes up more and more in the Rust ecosystem: is it worth labeling packages as “unmaintained”? This label is supported by the RustSec advisory database with clear guidelines. Unfortunately, despite “unmaintained” being informational, it is reported by default by tools like cargo audit (which many downstream projects use in CI), triggering chaos across the ecosystem whenever a commonly-used package receives the label.

Although some projects may find it useful to know the maintenance status of their dependencies, realistically most packages are not highly maintained anyway. The majority are single-person projects developed years ago. On the other side of the spectrum, some widely-used crates were developed by prolific contributors who own dozens (if not hundreds) of crates. These contributors really command a lot of respect for holding the ecosystem on their shoulders, but realistically one person cannot actively maintain that many packages regularly, and may one day move on.

It would be more useful to shift the mindset and not expect any particular maintenance nor support from packages, unless explicitly labeled otherwise. After all, open-source licenses state that software is provided “as is”. If you expect all your dependencies to be well maintained, then it’s on you to fund these projects, bear the burden of auditing them, trim your dependency graph, and be ready to maintain a fork if the upstream project doesn’t satisfy your needs.

A wake up call: compression is security critical

A lot has been said about the technical (build systems) and social (maintenance model) aspects of the xz backdoor, but I think it’s also a reminder that compression is critical to modern computing.

When we think of secure communication between computers, the first thing that comes to mind is cryptography. It provides confidentiality and authentication, and is ubiquitous via protocols like HTTPS – built upon TLS. Cryptography being security-critical has been well understood for a while in research and industry. New cryptographic algorithms are standardized after careful multi-year processes. Protocols like TLS are formally verified. Software and hardware implementations are expected to resist side-channel attacks, otherwise they make the news with fancy vulnerability websites.

At first glance, compression isn’t security-critical: it’s only turning some bytes into fewer bytes. In reality, compression is everywhere as most applications wouldn’t be practical without it. As such, the compression algorithm is part of the secure channel. Having perfectly secure cryptography isn’t useful if every decrypted byte is then decompressed by an implementation that is either backdoored, vulnerable to buffer overflows or incorrect. In fact, the combined compression-encryption channel has been exploited by several attacks.

By extension, file formats are also part of the communication channels between programs. What good are bytes if they are interpreted incorrectly or trigger buffer overflows in C parsers?

Compressed code archives

A particular case where compression is critical is the distribution of code archives (in source or binary form), which are essential to the software supply chain. A few examples:

Debian packages contain .tar archives that can be compressed with common algorithms such as gzip, bzip2, xz or zstd (Wikipedia).
GitHub creates releases with the source code packaged as .zip and .tar.gz archives.
Every Rust crate is packaged as a .tar.gz archive of the source code.
The Rust compiler is distributed as .tar.xz binaries, as can be seen with rustup update in verbose mode:

$ rustup -v update
...
verbose: downloading file from: 'https://static.rust-lang.org/dist/2024-03-28/rustc-1.77.1-x86_64-apple-darwin.tar.xz'

These archives are manipulated twice by compression programs: first to create an archive from source code (typically on some server infrastructure) and second to unpack the code (on developer or user machines). At each step, a compromised compressor/decompressor could leverage that to tamper with the archive and inject malicious code, or directly run other malicious operations on the system. This puts compression tools in a privileged position to insert persistent backdoors of the type of Ken Thompson’s Reflections on Trusting Trust (see also Russ Cox’s article 40 years later).

GitHub archives are interesting, because the generated .tar.gz archives are not stable. They are created on demand and the compression parameters can change, which can lead to different compressed archives for the same decompressed code. This means that you can’t rely on a known cryptographic hash of the archive to avoid tampering, which has caused real issues – another subtle way in which cryptography and compression interact. Additionally, GitHub allows uploading release assets which may not match the code in the git repository.

Don’t roll your own

Unfortunately, I think that compression and file formats don’t receive the same amount of security attention, scrutiny and formalism as cryptography, and the xz-utils backdoor is only an example of that. Compression algorithms are not always standardized: Deflate and Brotli have RFCs, LZMA and XZ don’t. Most projects are decades-old and don’t receive enough maintenance. In my previous blog post I was complaining about poor support for Brotli in Nginx.

About file formats, the mantra “don’t roll your own crypto” unfortunately doesn’t translate to “don’t roll your own encoding format”. The theory of parsing has been well researched for decades, but it isn’t systematically applied in the industry – a notable exception being Protocol Buffers and other IDLs which abstract away the serialization details. As a result, parsing JSON is still a minefield, with each implementation behaving differently on edge cases. It’s only in the last 10 years that the LangSec approach has attempted to bridge the gap between parsing theory and security. There is of course a recent push towards memory-safe languages – for example with a successful rewrite of an SVG rendering library in Rust already in 2019 – but a lot of legacy code in compression and encoding formats is still written in C/C++.

But I digress, how does this relate to the xz-utils backdoor, which was exploiting test files enabled by malicious build scripts and maintainer burn-out? One observation is that because cryptography is seen as security-critical, large organizations have dedicated teams developing in-house cryptography libraries, which prevents an external actor from taking over maintainance, or a maintainer from pushing malicious code without review.

Compression and parsing deserve more libraries like that.

Comments

To react to this blog post please check the Mastodon thread and the Reddit thread.

RSS | Mastodon | GitHub