Lessons learned from stracing a password manager in Docker | Blog

I recently tried to run my favorite password manager within a Docker container. I aimed to keep it as “contained” as possible, while still being usable. Finding the right trade-off turned out to be a longer journey than I expected, involving the strace tool, and learning about system calls like prctl. This blog post is here to share this experiment and knowledge!

Why running a password manager in Docker?
Setting up a read-only container
Let’s add new passwords!
Conclusion: How to block strace?

Why running a password manager in Docker?

Before we start, you may wonder why I wanted to run a password manager in a Docker container.

First of all, I like the idea of running a local password manager – as opposed to one running in a website – so that I understand where the passwords are stored, and where I back them up. Of course, that’s a personal choice, and I find it great that users have many options to choose from to setup a password manager.

The second part is running it in a container. I was quite inspired when I first read Jessie Frazelle’s blog post about running desktop applications in Docker containers a few years ago. I think it’s a great idea, for the following reasons.

From a security point of view, it allows to relatively easily restrict access to resources available to each application. My password manager doesn’t need access to either of the Internet, my sound card, or my documents – it only needs to access a password file and the clipboard. Without a container, all applications basically have access to all of the resources on my desktop.
It’s easy to setup, update and remove applications independently. You don’t have to worry about conflicting dependencies between applications, or about your favorite application not running on the latest Debian version. Each application is installed into its own “virtual” system – only the Linux kernel is shared between all of them. You don’t have to ask yourself where each application added configuration files on your host system.

The first downside is that each application takes a bit more space on disk (due to embedding all of its dependencies) but that’s not much of a concern today with 100+ GB of space on (SSD) disks.

The second downside is that too many restrictions of resources will break applications. So you have to tinker a bit to make things work again, but for me, that’s the fun part where I learn a lot about the Linux system. You may have already seen another of my Docker experiments in a previous blog post, otherwise I encourage you to take a look!

Setting up a read-only container

For this experiment, I decided to use KeePassXC, a cross-platform and open-source password manager. There is already a Dockerfile for it on Jessie’s GitHub, but I wanted to restrict a bit more the resources available to the password manager.

My first setup was to run the password manager in read-only mode. This is actually a common use case in practice, just reading passwords to authenticate without signing up to anything new.

Dockerfile and `docker run` basics

The first step is to create a Dockerfile with KeePassXC. Like in my previous blog post, I suggest to use the slim Debian testing base image (currently debian:bullseye-slim). We’ll also create an unprivileged user - let’s name is x11user - with a unique user ID such as 6000, and without creating any home directory nor giving it any shell. Last, we install the KeePassXC Debian package with apt-get.

Here is the resulting configuration: Dockerfile.

FROM debian:bullseye-slim
RUN useradd --uid 6000 --no-create-home --home-dir /nonexistent --shell /usr/sbin/nologin x11user

RUN apt-get update \
    && apt-get install -y \
        keepassxc \
        --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

USER x11user

ENTRYPOINT [ "/usr/bin/keepassxc" ]

When building this image, I’ll assume for the rest of this blog post that we tag it as keepassxc, so that we run it with docker run [DOCKER PARAMETERS] keepassxc [KEEPASSXC PARAMETERS].

The base of my docker run setup somewhat overlaps with my previous blog post, passing the following flags.

sudo docker run \
    --rm \
    -it \
    --cap-drop=all \
    --security-opt no-new-privileges \
    --read-only \
    --network=none \
    --cpus 1 \
    --memory=256m \
    --memory-swap=256m \
    --memory-swappiness=0 \
    ...

We only provided 256 MB of RAM memory. While this is enough in the current version of KeePassXC, this may become tight in the future. I actually used to provide only 64 MB, but my setup broke when I updated KeePassXC to a completely revamped version that suddenly used more RAM.

These OOM errors can be quite confusing, as generally the Linux kernel just kills programs upon OOM without much notice. However, in that case a report is left in the kernel logs, visible by running dmesg.

If a program running in Docker doesn’t work or crashes without any other explanation, here is an example of what to look for in dmesg output to check if it was an OOM error. In particular, you can see mention of oom-killer, and in this example the program name “keepassxc” and the memory limit of 64 MB = 65536 kB. The user ID 6000 also matches.
$ sudo dmesg
...
[ 2239.091381] keepassxc invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
...
[ 2239.091434] memory: usage 65536kB, limit 65536kB, failcnt 57
...
[ 2239.091435] Memory cgroup stats for /docker/83ce909bfae7dfe2ad5b5c4f519767157b9132436f7138c1b19f4e4ebcf5a1c4:
...
[ 2239.091448] Tasks state (memory values in pages):
[ 2239.091448] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 2239.091451] [   8050]  6000  8050    80644    24437   348160        0             0 keepassxc
[ 2239.091452] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=83ce909bfae7dfe2ad5b5c4f519767157b9132436f7138c1b19f4e4ebcf5a1c4,mems_allowed=0,oom_memcg=/docker/83ce909bfae7dfe2ad5b5c4f519767157b9132436f7138c1b19f4e4ebcf5a1c4,task_memcg=/docker/83ce909bfae7dfe2ad5b5c4f519767157b9132436f7138c1b19f4e4ebcf5a1c4,task=keepassxc,pid=8050,uid=6000
[ 2239.091467] Memory cgroup out of memory: Killed process 8050 (keepassxc) total-vm:322576kB, anon-rss:62924kB, file-rss:34824kB, shmem-rss:0kB, UID:6000 pgtables:340kB oom_score_adj:0
[ 2239.095133] oom_reaper: reaped process 8050 (keepassxc), now anon-rss:32kB, file-rss:0kB, shmem-rss:0kB

Giving access to the X11 server (GUI environment)

The next step is to setup Docker to run a GUI application.

First of all, we’ll run the password manager as our Docker-only x11user, which we chose not to map to any user ID on the host.

    ...
    -u x11user \
    ...

Then, because the password manager is a GUI application, on Linux we need to allow it to communicate with an X11 server, which manages all the windows in the GUI environment.

For this, we need to follow the following steps.

Mount the /tmp/.X11-unix/X0 path, a Unix domain socket that allows GUI applications to communicate with the X11 server. Note that although X0 is the typical path, one can run multiple X11 servers on a single machine, in which case there will be multiple sockets (X1, X2, etc.).
Export the $DISPLAY environment variable, to tell our GUI applications which X11 server to contact. Here again, we assume a single server running on the system, so we give it the value unix:0, but this can be generalized (DISPLAY=unix:1 for the /tmp/.X11-unix/X1 socket).
Mount an Xauthority file. This file contains a secret cookie which allows to authenticate to the X11 server and spawn GUI applications. This authentication is necessary, because otherwise any user on the system could interact with the GUI by using the /tmp/.X11-unix/X0 socket.
Export the $XAUTHORITY environment variable, to tell our GUI applications where to find the Xauthority file.

    ...
    --env=DISPLAY=unix:0 \
    --env=XAUTHORITY=/Xauthority \
    --volume=/tmp/.X11-unix/X0:/tmp/.X11-unix/X0 \
    --volume=$HOME/Xauthority.docker:/Xauthority:ro \
    ...

An important thing to know is that with access to the socket and the authorization file, applications have broad access to anything happening on your display, including keyboard/mouse inputs (no need to be the window on focus to collect them), clipboard contents, etc. See The Linux Security Circus: On GUI isolation by Joanna Rutkowska.

Now, you’ve probably noticed that I used $HOME/Xauthority.docker for the Xauthority file, but where does it come from? It actually doesn’t exist yet, but let’s now discuss how to create it. Note that you’ll have to create one before launching docker run.

Creating a suitable Xauthority file

When you’re logged in with a GUI environment, there’s typically a file at $HOME/.Xauthority, with read/write permissions only for your user – so that only your user can interact with the GUI. You can look at your cookie for the current display by running xauth list $DISPLAY. So the first solution would be to launch Docker with your user ID (typically -u 1000 if your user ID is 1000), and to mount the $HOME/.Xauthority file.

However, there is a caveat to that: the authentication is specific to a host name. Indeed, the output of xauth list typically starts with hostname/unix:0 – to indicate that the cookie is for the Unix socket 0 on hostname. The problem is that within the Docker container, the $HOSTNAME is different than on the rest of your machine – inside the container it is typically the container’s ID, randomly generated by Docker. This means that inside the container, our GUI applications won’t recognize any cookie in the original Xauthority file, and won’t try to authenticate to the X11 server.

A solution to that is to use the so-called “FamilyWild” authentication method of Xauthority, to match cookies on any host. More precisely, the following command creates a new file (let’s put it in $HOME/Xauthority.docker), by replacing the authentication method of your own Xauthority file by 0xffff (the “FamilyWild” value).

# Create our modified Xauthority file.
xauth nlist $DISPLAY | sed -e 's/^..../ffff/' | xauth -f $HOME/Xauthority.docker nmerge -
# Last, make x11user the owner of this file. Because this user only exists
# inside the container, we use its numeric user (and group) ID.
sudo chown 6000:6000 $HOME/Xauthority.docker

You may wonder which permissions the new Xauthority.docker file has. Indeed, it shouldn’t be readable by any other user than its owner - otherwise the secret cookie would not be a secret anymore. It turns out that the xauth program takes care of this when creating the file. The interesting part (as you can see in the source code) is that it invokes the umask system call before creating the file.

The umask system call may not be well known, but it’s very important for security. By invoking it with a 0077 mask (all bits set for the group and other), we drop all permissions for the group and other users, i.e. files created afterwards will not have any readable/writable/executable permissions for group/other.

Calling umask before creating a file containing secrets (i.e. shouldn’t be readable by any other user on the system) is a secure method. On the contrary, creating the file first and then calling chmod exposes to race condition attacks, where another user is able to read the file and extract the secret before chmod has been applied.

Last steps

Last, we give read-only access to the password file, and invoke KeePassXC on it. To simplify things, let’s assume that the password file is located at $HOME/file.kdbx on your host (i.e. your user’s home). Inside the container, this password file will be located at /file.kdbx.

    ...
    --volume=$HOME/file.kdbx:/file.kdbx:ro \
    keepassxc \
    /file.kdbx

Here is a summary of the docker run invocation.

sudo docker run \
    --rm \
    -it \
    --cap-drop=all \
    --security-opt no-new-privileges \
    --read-only \
    --network=none \
    --cpus 1 \
    --memory=256m \
    --memory-swap=256m \
    --memory-swappiness=0 \
    -u x11user \
    --env=DISPLAY=unix:0 \
    --env=XAUTHORITY=/Xauthority \
    --volume=/tmp/.X11-unix/X0:/tmp/.X11-unix/X0 \
    --volume=$HOME/Xauthority.docker:/Xauthority:ro \
    --volume=$HOME/file.kdbx:/file.kdbx:ro \
    keepassxc \
    /file.kdbx

When we launch the container, it complains a bit in the terminal output.

QStandardPaths: error creating runtime directory /tmp/runtime-x11user: Read-only file system
The lock file could not be created. Single-instance mode disabled.
libGL error: MESA-LOADER: failed to retrieve device information
libGL error: Version 4 or later of flush extension not found
libGL error: failed to load driver: i915
libGL error: failed to open /dev/dri/card0: No such file or directory
libGL error: failed to load driver: i965

However, the application works and we can copy passwords from the manager and paste them into login forms in the browser, so there is no need to install more things or give more permissions to the container.

Let’s add new passwords!

Now that we can read passwords in a contained environment, the next step is to be able to edit the password file, for example to add new passwords. For that, we have to mount the password file as writable.

    ...
    --volume=$HOME/file.kdbx:/file.kdbx \
    ...

However, when I simply changed that part, trying to edit and save the file from the password manager within the container didn’t work…

I thought that maybe the password manager was trying to save the file in a /tmp directory, and then overwrite the original file with that temporary file. Indeed, given that we have mounted the container as --read-only and didn’t do anything special for temporary files, the /tmp directory is read-only inside the container.

As in my previous blog post, we can mount a tmpfs to allow writing temporary files – while keeping the rest of the container’s filesystem read-only.

    ...
    --tmpfs=/tmp:size=1m \
    ...

Unfortunately, this wasn’t enough to allow saving password files…

`strace` to the rescue!

When we’re out of simple ideas, a powerful troubleshooting tool on Linux is strace. It allows to intercept all the system calls between the application and the kernel, and to print them in the terminal to help you understand what’s going on. For example, the application may try to open a file that doesn’t exist, without surfacing a relevant error message in the user interface or the terminal.

So I first added strace to the Docker image, by installing the corresponding package with apt-get in the Dockerfile. I also changed the ENTRYPOINT to point to /bin/bash, in order to have an interactive shell inside the container – rather than running the password manager directly.

Here is the result: Dockerfile.

FROM debian:bullseye-slim
RUN useradd --uid 6000 --no-create-home --home-dir /nonexistent --shell /bin/bash x11user

RUN apt-get update \
    && apt-get install -y \
        keepassxc \
        strace \
        --no-install-recommends \
    && rm -rf /var/lib/apt/lists/*

USER x11user

ENTRYPOINT [ "/bin/bash" ]

You may need to add the SYS_PTRACE capability by passing the --cap-add=SYS_PTRACE flag, but as we will see later this is misleading, as the capability itself is not needed to trace a child process. See the CAP_SYS_PTRACE rules in the ptrace manual.

With that, I could run strace keepassxc /file.kdbx within the container to see what was happening.

In this first attempt, many system calls appeared in the output, which makes sense for an interactive GUI application. However, I noticed that after some time, most of the strace output was unusable: instead of pretty printing system call arguments like file paths, it was just showing raw pointers.

I found this quite confusing. I searched for “strace raw pointers” on the Web, but this didn’t yield anything interesting. It seemed that there was some permission problem, but it’s hard to find the needle among so many system calls, with multiple threads and a lot of communication between the GUI and the X11 socket.

I added the -f flag to strace, to show which thread invoked which system calls (maybe some threads were restricted from strace?), but this didn’t surface anything interesting either.

My next attempt was to look into KeePassXC’s source code. In particular, looking for ptrace (the system call used by strace under the hood) gave an interesting result. Although the ptrace line was MacOS-specific, a few lines above there was a call to prctl(PR_SET_DUMPABLE, 0). Looking back at my strace output, this call to prctl was indeed the point where the arguments were not parsed by strace anymore!

As you can see below, the statx system call on the first line shows a path in clear, but the statx system call on the last line (just below prctl) only shows a pointer.

$ strace -f keepassxc /file.kdbx
...
[pid     9] statx(AT_FDCWD, "/nonexistent/.cache/keepassxc/keepassxc.ini", AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7ffda940eb50) = -1 ENOENT (No such file or directory)
[pid     9] brk(0x55c23eac8000)         = 0x55c23eac8000
[pid     9] brk(0x55c23eac4000)         = 0x55c23eac4000
[pid     9] prlimit64(0, RLIMIT_CORE, {rlim_cur=0, rlim_max=0}, NULL) = 0
[pid     9] prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE) = 0
[pid     9] statx(AT_FDCWD, 0x55c23ea9b648, AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7ffda940ed60) = -1 ENOENT (No such file or directory)

Interestingly, this system call doesn’t really block strace – we still see the list of system calls and their return values. It just blocks access to the process data, so strace cannot dereference pointers to read file paths and memory buffers. Yet, at this point it was clear that the password manager was actively trying to block my attempts at tracing system calls.

To try to understand a bit better, I added the --trace=%file argument to strace, in order to focus on system calls affecting files. This allowed to filter out all the frequent system calls related to the GUI (such as futex, poll, recvmsg and writev). I also had a closer look at KeePassXC’s source code, in particular the file saving function, but it wasn’t clear what was happening (at least not without digging further into Qt’s source code).

One thing was clear though: there were two saving modes, a “normal” mode, and an “atomic” mode, that KeePassXC suggested falling back to after a few failed saving attempts in normal mode. Running strace -f --trace=%file keepassxc /file.kdbx, I obtained the following traces for these two modes. I trimmed the traces by focusing on the system calls happening just after clicking on the save button.

“Normal” saving mode. The openat(... O_TMPFILE ...) = -1 ENOENT system call is a likely suspect (trying to create and open a temporary file), but it’s unclear why it fails without knowing the path – I had already mounted a tmpfs at /tmp.

$ strace -f --trace=%file keepassxc /file.kdbx
...
QFSFileEngine::open: No file name specified
[pid     9] access(0x55a8868bfa68, F_OK) = 0
[pid     9] lstat(0x55a88697c2c0, 0x7fff62bbe390) = 0
[pid     9] access(0x55a886435308, F_OK) = 0
[pid     9] openat(AT_FDCWD, 0x7f87169cb328, O_RDONLY|O_CLOEXEC) = 18
[pid    30] access(0x7f870c00e508, F_OK) = 0
[pid    30] access(0x7f870c00e508, W_OK) = 0
[pid    30] statx(AT_FDCWD, 0x7f870c00e508, AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7f86cf7fd830) = 0
[pid    30] statx(AT_FDCWD, 0x7f870c00e508, AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, 0x7f86cf7fd820) = 0
[pid    30] openat(AT_FDCWD, 0x7f86b4003018, O_RDWR|O_CLOEXEC|O_TMPFILE, 0600) = -1 ENOENT (No such file or directory)

“Atomic” saving mode. The unlink(0x7f86b405ccc8) = -1 EROFS system call is a likely suspect, but again, without knowing the path it could be any file, given that we mounted the whole container’s root filesystem as --read-only.

$ strace -f --trace=%file keepassxc /file.kdbx
...
QFSFileEngine::open: No file name specified
[pid     9] access(0x55a886be4188, F_OK) = 0
[pid     9] lstat(0x55a886c14760, 0x7fff62bbe390) = 0
[pid     9] access(0x55a886ad0438, F_OK) = 0
[pid     9] openat(AT_FDCWD, 0x7f87169cb328, O_RDONLY|O_CLOEXEC) = 18
[pid    33] lstat(0x7f86a8005560, 0x7f86ceffc7b0) = 0
[pid    33] openat(AT_FDCWD, 0x7f86a8002538, O_RDWR|O_CLOEXEC|O_TMPFILE, 0600) = 18
[pid    33] statx(18, 0x7f87173cf360, AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, 0x7f86ceffc550) = 0
[pid    33] statx(AT_FDCWD, 0x7f86a80026a8, AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, 0x7f86ceffc840) = 0
[pid    33] access(0x7f86a80026a8, R_OK) = 0
[pid    33] access(0x7f86a80026a8, W_OK) = 0
[pid    33] access(0x7f86a80026a8, X_OK) = -1 EACCES (Permission denied)
[pid    33] unlink(0x7f86a80026a8)      = -1 EROFS (Read-only file system)
[pid    33] linkat(AT_FDCWD, 0x7f86a8002948, AT_FDCWD, 0x7f86a805dd58, AT_SYMLINK_FOLLOW) = 0
[pid    33] statx(18, 0x7f87173cf360, AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, 0x7f86ceffb6f0) = 0
[pid    33] statx(AT_FDCWD, 0x7f86a805dd58, AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, 0x7f86ceffb6d0) = 0
[pid    33] stat(0x7f86a805c9b8, 0x7f86ceffb7a0) = 0

The following error message was also displayed in KeePassXC’s window in “atomic” mode, but it wasn’t really helpful.

Writing the database failed: Destination file exists
Backup database located at /tmp/KeePassXC.eOsoyi

At this point, I decided to give it a day…

Manipulating system calls

When I got back to it, I wondered if I could manipulate or block system calls with strace itself, to block the annoying prctl(PR_SET_DUMPABLE, 0). And indeed, strace supports commands to inject and fault system calls!

I tried to simply fault the prctl system call in general (without filtering for a PR_SET_DUMPABLE argument), and it worked! I could now see in clear all the paths and buffers passed to system calls.

$ strace -f --fault=prctl keepassxc /file.kdbx
...
[pid     9] statx(AT_FDCWD, "/nonexistent/.cache/keepassxc/keepassxc.ini", AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7ffebc697250) = -1 ENOENT (No such file or directory)
[pid     9] brk(0x55bc6b6a4000)         = 0x55bc6b6a4000
[pid     9] brk(0x55bc6b6a0000)         = 0x55bc6b6a0000
[pid     9] prlimit64(0, RLIMIT_CORE, {rlim_cur=0, rlim_max=0}, NULL) = 0
[pid     9] prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE) = -1 ENOSYS (Function not implemented) (INJECTED)
[pid     9] write(2, "Unable to disable core dumps.\n", 30Unable to disable core dumps.
) = 30
[pid     9] statx(AT_FDCWD, "share/keepassxc", AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7ffebc697460) = -1 ENOENT (No such file or directory)

Of course, this faulting was quite basic, and I wanted to check if there were other legitimate use cases for prctl. The only other cases that I found in this application were to give a name to threads.

$ strace -f keepassxc /file.kdbx 2>&1 | grep prctl
[pid    11] prctl(PR_SET_NAME, "QXcbEventQueue") = 0
[pid    10] prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE) = 0
[pid    12] prctl(PR_SET_NAME, 0x7fddf0046bc8) = 0
[pid    15] prctl(PR_SET_NAME, 0x7fdde8f0bb90 <unfinished ...>
[pid    16] prctl(PR_SET_NAME, 0x7fdde116cb90 <unfinished ...>
[pid    15] <... prctl resumed>)        = 0
[pid    16] <... prctl resumed>)        = 0
[pid    17] prctl(PR_SET_NAME, 0x7fdde096bb90 <unfinished ...>
[pid    17] <... prctl resumed>)        = 0
[pid    18] prctl(PR_SET_NAME, 0x7fddd74a5b90) = 0
[pid    19] prctl(PR_SET_NAME, 0x7fddd6ca4b90) = 0
[pid    20] prctl(PR_SET_NAME, 0x7fddd64a3b90) = 0
[pid    21] prctl(PR_SET_NAME, 0x7fddd5ca2b90) = 0
[pid    22] prctl(PR_SET_NAME, 0x7fddd54a1b90 <unfinished ...>
[pid    22] <... prctl resumed>)        = 0

Well, the output is more obvious if we fault prctl to see the names rather than their pointers ;)

$ strace -f --fault=prctl keepassxc /file.kdbx 2>&1 | grep prctl
[pid    11] prctl(PR_SET_NAME, "QXcbEventQueue" <unfinished ...>
[pid    11] <... prctl resumed>)        = -1 ENOSYS (Function not implemented) (INJECTED)
[pid    10] prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE) = -1 ENOSYS (Function not implemented) (INJECTED)
[pid    12] prctl(PR_SET_NAME, "QDBusConnection"...) = -1 ENOSYS (Function not implemented) (INJECTED)
[pid    15] prctl(PR_SET_NAME, "llvmpipe-0" <unfinished ...>
[pid    15] <... prctl resumed>)        = -1 ENOSYS (Function not implemented) (INJECTED)
[pid    16] prctl(PR_SET_NAME, "llvmpipe-1" <unfinished ...>
[pid    16] <... prctl resumed>)        = -1 ENOSYS (Function not implemented) (INJECTED)
[pid    17] prctl(PR_SET_NAME, "llvmpipe-2") = -1 ENOSYS (Function not implemented) (INJECTED)
[pid    18] prctl(PR_SET_NAME, "llvmpipe-3") = -1 ENOSYS (Function not implemented) (INJECTED)
[pid    19] prctl(PR_SET_NAME, "llvmpipe-4" <unfinished ...>
[pid    19] <... prctl resumed>)        = -1 ENOSYS (Function not implemented) (INJECTED)
[pid    20] prctl(PR_SET_NAME, "llvmpipe-5" <unfinished ...>
[pid    20] <... prctl resumed>)        = -1 ENOSYS (Function not implemented) (INJECTED)
[pid    21] prctl(PR_SET_NAME, "llvmpipe-6" <unfinished ...>
[pid    21] <... prctl resumed>)        = -1 ENOSYS (Function not implemented) (INJECTED)
[pid    22] prctl(PR_SET_NAME, "llvmpipe-7") = -1 ENOSYS (Function not implemented) (INJECTED)

Thread names can show up in process viewers such as htop, but they are not really critical to the function of a password manager, so faulting these system calls is fine.

Also, by default strace will return ENOSYS for system calls that fault. As you can see in the above output, this error is caught by the password manager, but it just prints an error message in the terminal – “Unable to disable core dumps.” – and continues running anyway!

Finally saving new passwords

I could now resume my tracing of file-related system calls while faulting prctl, and observe what is really happening when trying to save a file.

We have to trace prctl to be able to fault it. That is, --trace=%file --fault=prctl won’t fault prctl, because this system call doesn’t belong to the %file group. We have to use --trace=%file,prctl --fault=prctl instead.

I obtained the following traces, which confirmed my hypotheses about the problematic system calls, but more importantly explained why they failed.

“Normal” saving mode. The openat(AT_FDCWD, "", O_RDWR|O_CLOEXEC|O_TMPFILE, 0600) system call fails, as it attempts to create a temporary file in the current working directory (which is on the read-only filesystem).

$ strace -f --trace=%file,prctl --fault=prctl keepassxc /file.kdbx
...
QFSFileEngine::open: No file name specified
[pid     9] access("/file.kdbx", F_OK)  = 0
[pid     9] lstat("/file.kdbx", {st_mode=S_IFREG|0664, st_size=33998, ...}) = 0
[pid     9] access("/file.kdbx", F_OK)  = 0
[pid     9] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid    30] access("/file.kdbx", F_OK <unfinished ...>
[pid     9] <... openat resumed>)       = 18
[pid    30] <... access resumed>)       = 0
[pid    30] access("/file.kdbx", W_OK)  = 0
[pid    30] statx(AT_FDCWD, "/file.kdbx", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL, stx_attributes=0, stx_mode=S_IFREG|0664, stx_size=33998, ...}) = 0
[pid    30] statx(AT_FDCWD, "/file.kdbx", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL, stx_attributes=0, stx_mode=S_IFREG|0664, stx_size=33998, ...}) = 0
[pid    30] openat(AT_FDCWD, "", O_RDWR|O_CLOEXEC|O_TMPFILE, 0600) = -1 ENOENT (No such file or directory)

“Atomic” saving mode. The failing system call is unlink("/file.kdbx") = -1 EROFS. We didn’t mount this file as a read-only filesystem, but because we mounted it as a single-file volume, I assume that Docker forbids removing the file. This makes sense because that would remove the volume. In other words it’s possible to overwrite the file, but not to entirely remove it.

QFSFileEngine::open: No file name specified
[pid     9] access("/file.kdbx", F_OK)  = 0
[pid     9] lstat("/file.kdbx", {st_mode=S_IFREG|0664, st_size=33998, ...}) = 0
[pid     9] access("/file.kdbx", F_OK)  = 0
[pid     9] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 18
[pid    33] lstat("/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=140, ...}) = 0
[pid    33] openat(AT_FDCWD, "/tmp", O_RDWR|O_CLOEXEC|O_TMPFILE, 0600) = 18
[pid    33] statx(18, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS, stx_attributes=0, stx_mode=S_IFREG|0600, stx_size=0, ...}) = 0
[pid    33] statx(AT_FDCWD, "/file.kdbx", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL, stx_attributes=0, stx_mode=S_IFREG|0664, stx_size=33998, ...}) = 0
[pid    33] access("/file.kdbx", R_OK)  = 0
[pid    33] access("/file.kdbx", W_OK)  = 0
[pid    33] access("/file.kdbx", X_OK)  = -1 EACCES (Permission denied)
[pid    33] unlink("/file.kdbx")        = -1 EROFS (Read-only file system)
[pid    33] linkat(AT_FDCWD, "/proc/self/fd/18", AT_FDCWD, "/tmp/KeePassXC.wpCgbn", AT_SYMLINK_FOLLOW) = 0
[pid    33] statx(18, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS, stx_attributes=0, stx_mode=S_IFREG|0600, stx_size=34142, ...}) = 0
[pid    33] statx(AT_FDCWD, "/tmp/KeePassXC.wpCgbn", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_BASIC_STATS, stx_attributes=0, stx_mode=S_IFREG|0600, stx_size=34142, ...}) = 0
[pid    33] stat("/file.kdbx", {st_mode=S_IFREG|0664, st_size=33998, ...}) = 0

In the end, a simple solution is to copy the password file to the /tmp directory, call keepassxc on this temporary file, and afterwards copy back the temporary file on our single-file volume /file.kdbx with cp. This works because:

KeePassXC is allowed to create and remove files inside of the /tmp directory,
it seems that cp doesn’t remove the target file if it already exists, but simply overwrites it.

Conclusion: How to block `strace`?

A question that remains is what can we do to prevent a process from being traced? This is indeed a valid concern for a password manager, to prevent any other process from tracing it and reading the precious passwords.

So far, we’ve learned the following.

Although prctl(PR_SET_DUMPABLE, 0) can hide the contents of buffers from the memory of a process, it doesn’t block an existing tracing (we still see the list of system calls).
More importantly, if the process is already being traced, then this prctl system call can be intercepted and blocked by the tracing process, making it totally ineffective. This is the approach I’ve taken with strace --fault=prctl.

Note that by default, the --fault parameter of strace will inject a recognizable error code (ENOSYS), and as we’ve seen the password manager noticed it (yet didn’t stop running after that). However, the tracing process can also return a valid result, making the traced process believe that this blocking worked! With strace, one can use --inject instead of --fault, and customize the error and retval parameters to achieve that.

More generally, it’s possible to write one’s own tracing program, to fully customize what is being intercepted and injected, for example to only target the prctl(PR_SET_DUMPABLE, 0) call without faulting other calls to prctl. Here are some relevant resources.

Intercepting and Emulating Linux System Calls with Ptrace by Chris Wellons, 2018. The corresponding source code is available on GitHub.
Write yourself an strace in 70 lines of code by Nelson Elhage, 2010.
If you are a Rust enthusiast like me, you can read Loading and ptrace’ing a process in Rust by Joseph Kain, 2015, have a look at rustrace on GitHub, or use the ptrace module of the nix crate.

In short, attempting to block tracing from within a program won’t work if that program is already being traced, because the tracer can intercept and tamper with anything. Mitigations like prctl(PR_SET_DUMPABLE, 0) can only be effective if the program is not already being traced.

Dropping capabilities?

I originally thought that giving the CAP_SYS_PTRACE capability to the Docker container was necessary to run strace within it. I was somewhat right, but for the wrong reason! Indeed, Docker’s seccomp-bpf profile would enable the ptrace system call whenever this capability is provided. But this default profile also enables the ptrace system call if the kernel is recent enough (Linux >= 4.8), regardless of the capabilities!

The same observation was made by Julia Evans a few months ago in the Why strace doesn’t work in Docker blog post.

Aside from Docker’s rules, it turns out that CAP_SYS_PTRACE is only really required to trace a process from a different user. And even if it was required, we would need to run as root within the container to get this capability¹.

Yama rules

The next thing I learned was about Yama, a so-called Linux Security Module introduced in Linux 3.4. In particular, it allows to control the scope of ptrace at the kernel level, via the /proc/sys/kernel/yama/ptrace_scope file. Yama provides 4 levels.

0 - Classic ptrace permissions. Any process running with the same uid can trace a dumpable process.
1 - Restricted. A process can only trace its descendants.
2 - Admin-only. Only a process with CAP_SYS_PTRACE can trace other processes.
3 - Disabled. No process can use ptrace. Re-enabling to another level requires reboot.

You can change the setting by writing the corresponding number (0-3) to /proc/sys/kernel/yama/ptrace_scope. Another (equivalent) method is to run sysctl kernel.yama.ptrace_scope=N. You can read back the current level by reading the /proc/sys/kernel/yama/ptrace_scope file, or running sysctl kernel.yama.ptrace_scope.

As a simple sanity check, when ptrace is disabled (level 2 or 3) you should obtain the following result when strace-ing a familiar program.

$ strace ls
strace: test_ptrace_get_syscall_info: PTRACE_TRACEME: Operation not permitted
strace: ptrace(PTRACE_TRACEME, ...): Operation not permitted
+++ exited with 1 +++

You can also check that level 3 is indeed irreversible at runtime.

$ sudo sysctl kernel.yama.ptrace_scope=3
kernel.yama.ptrace_scope = 3
$ sudo sysctl kernel.yama.ptrace_scope=0
sysctl: setting key "kernel.yama.ptrace_scope": Invalid argument

As mentioned here, Yama rules don’t only block the ptrace system call, but also all kernel features related to program tracing. This includes the process_vm_readv and process_vm_writev system calls – which are also blocked by Docker’s default seccomp-bpf profile when CAP_SYS_PTRACE is dropped – as well as access to some files in /proc/PID/.

Persistent configuration

By default, the ptrace_scope is reset upon reboot – on Debian distributions this is currently reset to level 0 (classic permissions). But you can change that via the /etc/sysctl.d/ configuration folder. To always disable ptrace upon reboot, create a /etc/sysctl.d/local.conf file and write the following line.

kernel.yama.ptrace_scope = 3

I’d really recommend this setting in a production environment where you don’t expect to use ptrace.

If you don’t want to restrict ptrace for the whole kernel, using a more restricted seccomp-bpf profile to disable the ptrace system call in a container is also a good way to block it.

See “Docker does not yet support adding capabilities to non-root users” from this page. ↩

Comments

To react to this blog post please check the Twitter thread.

RSS | Mastodon | GitHub