Lessons learned from stracing a password manager in Docker
I recently tried to run my favorite password manager within a Docker container. I aimed to keep it as “contained” as possible, while still being usable. Finding the right trade-off turned out to be a longer journey than I expected, involving the strace tool, and learning about system calls like prctl. This blog post is here to share this experiment and knowledge!
- Why running a password manager in Docker?
- Setting up a read-only container
- Let’s add new passwords!
- Conclusion: How to block
strace
?
Why running a password manager in Docker?
Before we start, you may wonder why I wanted to run a password manager in a Docker container.
First of all, I like the idea of running a local password manager – as opposed to one running in a website – so that I understand where the passwords are stored, and where I back them up. Of course, that’s a personal choice, and I find it great that users have many options to choose from to setup a password manager.
The second part is running it in a container. I was quite inspired when I first read Jessie Frazelle’s blog post about running desktop applications in Docker containers a few years ago. I think it’s a great idea, for the following reasons.
- From a security point of view, it allows to relatively easily restrict access to resources available to each application. My password manager doesn’t need access to either of the Internet, my sound card, or my documents – it only needs to access a password file and the clipboard. Without a container, all applications basically have access to all of the resources on my desktop.
- It’s easy to setup, update and remove applications independently. You don’t have to worry about conflicting dependencies between applications, or about your favorite application not running on the latest Debian version. Each application is installed into its own “virtual” system – only the Linux kernel is shared between all of them. You don’t have to ask yourself where each application added configuration files on your host system.
The first downside is that each application takes a bit more space on disk (due to embedding all of its dependencies) but that’s not much of a concern today with 100+ GB of space on (SSD) disks.
The second downside is that too many restrictions of resources will break applications. So you have to tinker a bit to make things work again, but for me, that’s the fun part where I learn a lot about the Linux system. You may have already seen another of my Docker experiments in a previous blog post, otherwise I encourage you to take a look!
Setting up a read-only container
For this experiment, I decided to use KeePassXC, a cross-platform and open-source password manager. There is already a Dockerfile for it on Jessie’s GitHub, but I wanted to restrict a bit more the resources available to the password manager.
My first setup was to run the password manager in read-only mode. This is actually a common use case in practice, just reading passwords to authenticate without signing up to anything new.
Dockerfile and docker run
basics
The first step is to create a Dockerfile with KeePassXC.
Like in my previous blog post, I suggest to use the slim Debian testing base image (currently debian:bullseye-slim
).
We’ll also create an unprivileged user - let’s name is x11user
- with a unique user ID such as 6000
, and without creating any home directory nor giving it any shell.
Last, we install the KeePassXC Debian package with apt-get
.
- Here is the resulting configuration: Dockerfile.
FROM debian:bullseye-slim
RUN useradd --uid 6000 --no-create-home --home-dir /nonexistent --shell /usr/sbin/nologin x11user
RUN apt-get update \
&& apt-get install -y \
keepassxc \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
USER x11user
ENTRYPOINT [ "/usr/bin/keepassxc" ]
When building this image, I’ll assume for the rest of this blog post that we tag it as keepassxc
, so that we run it with docker run [DOCKER PARAMETERS] keepassxc [KEEPASSXC PARAMETERS]
.
The base of my docker run
setup somewhat overlaps with my previous blog post, passing the following flags.
sudo docker run \
--rm \
-it \
--cap-drop=all \
--security-opt no-new-privileges \
--read-only \
--network=none \
--cpus 1 \
--memory=256m \
--memory-swap=256m \
--memory-swappiness=0 \
...
We only provided 256 MB of RAM memory. While this is enough in the current version of KeePassXC, this may become tight in the future. I actually used to provide only 64 MB, but my setup broke when I updated KeePassXC to a completely revamped version that suddenly used more RAM.
These OOM errors can be quite confusing, as generally the Linux kernel just kills programs upon OOM without much notice. However, in that case a report is left in the kernel logs, visible by running dmesg.
If a program running in Docker doesn’t work or crashes without any other explanation, here is an example of what to look for in
dmesg
output to check if it was an OOM error. In particular, you can see mention ofoom-killer
, and in this example the program name “keepassxc” and the memory limit of 64 MB = 65536 kB. The user ID6000
also matches.$ sudo dmesg ... [ 2239.091381] keepassxc invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0 ... [ 2239.091434] memory: usage 65536kB, limit 65536kB, failcnt 57 ... [ 2239.091435] Memory cgroup stats for /docker/83ce909bfae7dfe2ad5b5c4f519767157b9132436f7138c1b19f4e4ebcf5a1c4: ... [ 2239.091448] Tasks state (memory values in pages): [ 2239.091448] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name [ 2239.091451] [ 8050] 6000 8050 80644 24437 348160 0 0 keepassxc [ 2239.091452] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=83ce909bfae7dfe2ad5b5c4f519767157b9132436f7138c1b19f4e4ebcf5a1c4,mems_allowed=0,oom_memcg=/docker/83ce909bfae7dfe2ad5b5c4f519767157b9132436f7138c1b19f4e4ebcf5a1c4,task_memcg=/docker/83ce909bfae7dfe2ad5b5c4f519767157b9132436f7138c1b19f4e4ebcf5a1c4,task=keepassxc,pid=8050,uid=6000 [ 2239.091467] Memory cgroup out of memory: Killed process 8050 (keepassxc) total-vm:322576kB, anon-rss:62924kB, file-rss:34824kB, shmem-rss:0kB, UID:6000 pgtables:340kB oom_score_adj:0 [ 2239.095133] oom_reaper: reaped process 8050 (keepassxc), now anon-rss:32kB, file-rss:0kB, shmem-rss:0kB
Giving access to the X11 server (GUI environment)
The next step is to setup Docker to run a GUI application.
First of all, we’ll run the password manager as our Docker-only x11user
, which we chose not to map to any user ID on the host.
...
-u x11user \
...
Then, because the password manager is a GUI application, on Linux we need to allow it to communicate with an X11 server, which manages all the windows in the GUI environment.
For this, we need to follow the following steps.
- Mount the
/tmp/.X11-unix/X0
path, a Unix domain socket that allows GUI applications to communicate with the X11 server. Note that althoughX0
is the typical path, one can run multiple X11 servers on a single machine, in which case there will be multiple sockets (X1
,X2
, etc.). - Export the
$DISPLAY
environment variable, to tell our GUI applications which X11 server to contact. Here again, we assume a single server running on the system, so we give it the valueunix:0
, but this can be generalized (DISPLAY=unix:1
for the/tmp/.X11-unix/X1
socket). - Mount an Xauthority file.
This file contains a secret cookie which allows to authenticate to the X11 server and spawn GUI applications.
This authentication is necessary, because otherwise any user on the system could interact with the GUI by using the
/tmp/.X11-unix/X0
socket. - Export the
$XAUTHORITY
environment variable, to tell our GUI applications where to find the Xauthority file.
...
--env=DISPLAY=unix:0 \
--env=XAUTHORITY=/Xauthority \
--volume=/tmp/.X11-unix/X0:/tmp/.X11-unix/X0 \
--volume=$HOME/Xauthority.docker:/Xauthority:ro \
...
An important thing to know is that with access to the socket and the authorization file, applications have broad access to anything happening on your display, including keyboard/mouse inputs (no need to be the window on focus to collect them), clipboard contents, etc. See The Linux Security Circus: On GUI isolation by Joanna Rutkowska.
Now, you’ve probably noticed that I used $HOME/Xauthority.docker
for the Xauthority file, but where does it come from?
It actually doesn’t exist yet, but let’s now discuss how to create it.
Note that you’ll have to create one before launching docker run
.
Creating a suitable Xauthority file
When you’re logged in with a GUI environment, there’s typically a file at $HOME/.Xauthority
, with read/write permissions only for your user – so that only your user can interact with the GUI.
You can look at your cookie for the current display by running xauth list $DISPLAY
.
So the first solution would be to launch Docker with your user ID (typically -u 1000
if your user ID is 1000), and to mount the $HOME/.Xauthority
file.
However, there is a caveat to that: the authentication is specific to a host name.
Indeed, the output of xauth list
typically starts with hostname/unix:0
– to indicate that the cookie is for the Unix socket 0 on hostname.
The problem is that within the Docker container, the $HOSTNAME
is different than on the rest of your machine – inside the container it is typically the container’s ID, randomly generated by Docker.
This means that inside the container, our GUI applications won’t recognize any cookie in the original Xauthority file, and won’t try to authenticate to the X11 server.
A solution to that is to use the so-called “FamilyWild” authentication method of Xauthority, to match cookies on any host.
More precisely, the following command creates a new file (let’s put it in $HOME/Xauthority.docker
), by replacing the authentication method of your own Xauthority file by 0xffff
(the “FamilyWild” value).
# Create our modified Xauthority file.
xauth nlist $DISPLAY | sed -e 's/^..../ffff/' | xauth -f $HOME/Xauthority.docker nmerge -
# Last, make x11user the owner of this file. Because this user only exists
# inside the container, we use its numeric user (and group) ID.
sudo chown 6000:6000 $HOME/Xauthority.docker
You may wonder which permissions the new
Xauthority.docker
file has. Indeed, it shouldn’t be readable by any other user than its owner - otherwise the secret cookie would not be a secret anymore. It turns out that thexauth
program takes care of this when creating the file. The interesting part (as you can see in the source code) is that it invokes the umask system call before creating the file.The
umask
system call may not be well known, but it’s very important for security. By invoking it with a0077
mask (all bits set for the group and other), we drop all permissions for the group and other users, i.e. files created afterwards will not have any readable/writable/executable permissions for group/other.Calling
umask
before creating a file containing secrets (i.e. shouldn’t be readable by any other user on the system) is a secure method. On the contrary, creating the file first and then callingchmod
exposes to race condition attacks, where another user is able to read the file and extract the secret beforechmod
has been applied.
Last steps
Last, we give read-only access to the password file, and invoke KeePassXC on it.
To simplify things, let’s assume that the password file is located at $HOME/file.kdbx
on your host (i.e. your user’s home).
Inside the container, this password file will be located at /file.kdbx
.
...
--volume=$HOME/file.kdbx:/file.kdbx:ro \
keepassxc \
/file.kdbx
Here is a summary of the docker run
invocation.
sudo docker run \
--rm \
-it \
--cap-drop=all \
--security-opt no-new-privileges \
--read-only \
--network=none \
--cpus 1 \
--memory=256m \
--memory-swap=256m \
--memory-swappiness=0 \
-u x11user \
--env=DISPLAY=unix:0 \
--env=XAUTHORITY=/Xauthority \
--volume=/tmp/.X11-unix/X0:/tmp/.X11-unix/X0 \
--volume=$HOME/Xauthority.docker:/Xauthority:ro \
--volume=$HOME/file.kdbx:/file.kdbx:ro \
keepassxc \
/file.kdbx
When we launch the container, it complains a bit in the terminal output.
QStandardPaths: error creating runtime directory /tmp/runtime-x11user: Read-only file system
The lock file could not be created. Single-instance mode disabled.
libGL error: MESA-LOADER: failed to retrieve device information
libGL error: Version 4 or later of flush extension not found
libGL error: failed to load driver: i915
libGL error: failed to open /dev/dri/card0: No such file or directory
libGL error: failed to load driver: i965
However, the application works and we can copy passwords from the manager and paste them into login forms in the browser, so there is no need to install more things or give more permissions to the container.
Let’s add new passwords!
Now that we can read passwords in a contained environment, the next step is to be able to edit the password file, for example to add new passwords. For that, we have to mount the password file as writable.
...
--volume=$HOME/file.kdbx:/file.kdbx \
...
However, when I simply changed that part, trying to edit and save the file from the password manager within the container didn’t work…
I thought that maybe the password manager was trying to save the file in a /tmp
directory, and then overwrite the original file with that temporary file.
Indeed, given that we have mounted the container as --read-only
and didn’t do anything special for temporary files, the /tmp
directory is read-only inside the container.
As in my previous blog post, we can mount a tmpfs to allow writing temporary files – while keeping the rest of the container’s filesystem read-only.
...
--tmpfs=/tmp:size=1m \
...
Unfortunately, this wasn’t enough to allow saving password files…
strace
to the rescue!
When we’re out of simple ideas, a powerful troubleshooting tool on Linux is strace. It allows to intercept all the system calls between the application and the kernel, and to print them in the terminal to help you understand what’s going on. For example, the application may try to open a file that doesn’t exist, without surfacing a relevant error message in the user interface or the terminal.
So I first added strace
to the Docker image, by installing the corresponding package with apt-get
in the Dockerfile
.
I also changed the ENTRYPOINT
to point to /bin/bash
, in order to have an interactive shell inside the container – rather than running the password manager directly.
- Here is the result: Dockerfile.
FROM debian:bullseye-slim
RUN useradd --uid 6000 --no-create-home --home-dir /nonexistent --shell /bin/bash x11user
RUN apt-get update \
&& apt-get install -y \
keepassxc \
strace \
--no-install-recommends \
&& rm -rf /var/lib/apt/lists/*
USER x11user
ENTRYPOINT [ "/bin/bash" ]
You may need to add the
SYS_PTRACE
capability by passing the--cap-add=SYS_PTRACE
flag, but as we will see later this is misleading, as the capability itself is not needed to trace a child process. See theCAP_SYS_PTRACE
rules in theptrace
manual.
With that, I could run strace keepassxc /file.kdbx
within the container to see what was happening.
In this first attempt, many system calls appeared in the output, which makes sense for an interactive GUI application. However, I noticed that after some time, most of the strace output was unusable: instead of pretty printing system call arguments like file paths, it was just showing raw pointers.
I found this quite confusing. I searched for “strace raw pointers” on the Web, but this didn’t yield anything interesting. It seemed that there was some permission problem, but it’s hard to find the needle among so many system calls, with multiple threads and a lot of communication between the GUI and the X11 socket.
I added the -f
flag to strace
, to show which thread invoked which system calls (maybe some threads were restricted from strace?), but this didn’t surface anything interesting either.
My next attempt was to look into KeePassXC’s source code.
In particular, looking for ptrace
(the system call used by strace under the hood) gave an interesting result.
Although the ptrace
line was MacOS-specific, a few lines above there was a call to prctl(PR_SET_DUMPABLE, 0)
.
Looking back at my strace output, this call to prctl
was indeed the point where the arguments were not parsed by strace anymore!
As you can see below, the statx
system call on the first line shows a path in clear, but the statx
system call on the last line (just below prctl
) only shows a pointer.
$ strace -f keepassxc /file.kdbx
...
[pid 9] statx(AT_FDCWD, "/nonexistent/.cache/keepassxc/keepassxc.ini", AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7ffda940eb50) = -1 ENOENT (No such file or directory)
[pid 9] brk(0x55c23eac8000) = 0x55c23eac8000
[pid 9] brk(0x55c23eac4000) = 0x55c23eac4000
[pid 9] prlimit64(0, RLIMIT_CORE, {rlim_cur=0, rlim_max=0}, NULL) = 0
[pid 9] prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE) = 0
[pid 9] statx(AT_FDCWD, 0x55c23ea9b648, AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7ffda940ed60) = -1 ENOENT (No such file or directory)
Interestingly, this system call doesn’t really block strace – we still see the list of system calls and their return values. It just blocks access to the process data, so strace cannot dereference pointers to read file paths and memory buffers. Yet, at this point it was clear that the password manager was actively trying to block my attempts at tracing system calls.
To try to understand a bit better, I added the --trace=%file
argument to strace
, in order to focus on system calls affecting files.
This allowed to filter out all the frequent system calls related to the GUI (such as futex
, poll
, recvmsg
and writev
).
I also had a closer look at KeePassXC’s source code, in particular the file saving function, but it wasn’t clear what was happening (at least not without digging further into Qt’s source code).
One thing was clear though: there were two saving modes, a “normal” mode, and an “atomic” mode, that KeePassXC suggested falling back to after a few failed saving attempts in normal mode.
Running strace -f --trace=%file keepassxc /file.kdbx
, I obtained the following traces for these two modes.
I trimmed the traces by focusing on the system calls happening just after clicking on the save button.
“Normal” saving mode.
The openat(... O_TMPFILE ...) = -1 ENOENT
system call is a likely suspect (trying to create and open a temporary file), but it’s unclear why it fails without knowing the path – I had already mounted a tmpfs at /tmp
.
$ strace -f --trace=%file keepassxc /file.kdbx
...
QFSFileEngine::open: No file name specified
[pid 9] access(0x55a8868bfa68, F_OK) = 0
[pid 9] lstat(0x55a88697c2c0, 0x7fff62bbe390) = 0
[pid 9] access(0x55a886435308, F_OK) = 0
[pid 9] openat(AT_FDCWD, 0x7f87169cb328, O_RDONLY|O_CLOEXEC) = 18
[pid 30] access(0x7f870c00e508, F_OK) = 0
[pid 30] access(0x7f870c00e508, W_OK) = 0
[pid 30] statx(AT_FDCWD, 0x7f870c00e508, AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7f86cf7fd830) = 0
[pid 30] statx(AT_FDCWD, 0x7f870c00e508, AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, 0x7f86cf7fd820) = 0
[pid 30] openat(AT_FDCWD, 0x7f86b4003018, O_RDWR|O_CLOEXEC|O_TMPFILE, 0600) = -1 ENOENT (No such file or directory)
“Atomic” saving mode.
The unlink(0x7f86b405ccc8) = -1 EROFS
system call is a likely suspect, but again, without knowing the path it could be any file, given that we mounted the whole container’s root filesystem as --read-only
.
$ strace -f --trace=%file keepassxc /file.kdbx
...
QFSFileEngine::open: No file name specified
[pid 9] access(0x55a886be4188, F_OK) = 0
[pid 9] lstat(0x55a886c14760, 0x7fff62bbe390) = 0
[pid 9] access(0x55a886ad0438, F_OK) = 0
[pid 9] openat(AT_FDCWD, 0x7f87169cb328, O_RDONLY|O_CLOEXEC) = 18
[pid 33] lstat(0x7f86a8005560, 0x7f86ceffc7b0) = 0
[pid 33] openat(AT_FDCWD, 0x7f86a8002538, O_RDWR|O_CLOEXEC|O_TMPFILE, 0600) = 18
[pid 33] statx(18, 0x7f87173cf360, AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, 0x7f86ceffc550) = 0
[pid 33] statx(AT_FDCWD, 0x7f86a80026a8, AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, 0x7f86ceffc840) = 0
[pid 33] access(0x7f86a80026a8, R_OK) = 0
[pid 33] access(0x7f86a80026a8, W_OK) = 0
[pid 33] access(0x7f86a80026a8, X_OK) = -1 EACCES (Permission denied)
[pid 33] unlink(0x7f86a80026a8) = -1 EROFS (Read-only file system)
[pid 33] linkat(AT_FDCWD, 0x7f86a8002948, AT_FDCWD, 0x7f86a805dd58, AT_SYMLINK_FOLLOW) = 0
[pid 33] statx(18, 0x7f87173cf360, AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, 0x7f86ceffb6f0) = 0
[pid 33] statx(AT_FDCWD, 0x7f86a805dd58, AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, 0x7f86ceffb6d0) = 0
[pid 33] stat(0x7f86a805c9b8, 0x7f86ceffb7a0) = 0
The following error message was also displayed in KeePassXC’s window in “atomic” mode, but it wasn’t really helpful.
Writing the database failed: Destination file exists
Backup database located at /tmp/KeePassXC.eOsoyi
At this point, I decided to give it a day…
Manipulating system calls
When I got back to it, I wondered if I could manipulate or block system calls with strace
itself, to block the annoying prctl(PR_SET_DUMPABLE, 0)
.
And indeed, strace
supports commands to inject and fault system calls!
I tried to simply fault the prctl
system call in general (without filtering for a PR_SET_DUMPABLE
argument), and it worked!
I could now see in clear all the paths and buffers passed to system calls.
$ strace -f --fault=prctl keepassxc /file.kdbx
...
[pid 9] statx(AT_FDCWD, "/nonexistent/.cache/keepassxc/keepassxc.ini", AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7ffebc697250) = -1 ENOENT (No such file or directory)
[pid 9] brk(0x55bc6b6a4000) = 0x55bc6b6a4000
[pid 9] brk(0x55bc6b6a0000) = 0x55bc6b6a0000
[pid 9] prlimit64(0, RLIMIT_CORE, {rlim_cur=0, rlim_max=0}, NULL) = 0
[pid 9] prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE) = -1 ENOSYS (Function not implemented) (INJECTED)
[pid 9] write(2, "Unable to disable core dumps.\n", 30Unable to disable core dumps.
) = 30
[pid 9] statx(AT_FDCWD, "share/keepassxc", AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7ffebc697460) = -1 ENOENT (No such file or directory)
Of course, this faulting was quite basic, and I wanted to check if there were other legitimate use cases for prctl
.
The only other cases that I found in this application were to give a name to threads.
$ strace -f keepassxc /file.kdbx 2>&1 | grep prctl
[pid 11] prctl(PR_SET_NAME, "QXcbEventQueue") = 0
[pid 10] prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE) = 0
[pid 12] prctl(PR_SET_NAME, 0x7fddf0046bc8) = 0
[pid 15] prctl(PR_SET_NAME, 0x7fdde8f0bb90 <unfinished ...>
[pid 16] prctl(PR_SET_NAME, 0x7fdde116cb90 <unfinished ...>
[pid 15] <... prctl resumed>) = 0
[pid 16] <... prctl resumed>) = 0
[pid 17] prctl(PR_SET_NAME, 0x7fdde096bb90 <unfinished ...>
[pid 17] <... prctl resumed>) = 0
[pid 18] prctl(PR_SET_NAME, 0x7fddd74a5b90) = 0
[pid 19] prctl(PR_SET_NAME, 0x7fddd6ca4b90) = 0
[pid 20] prctl(PR_SET_NAME, 0x7fddd64a3b90) = 0
[pid 21] prctl(PR_SET_NAME, 0x7fddd5ca2b90) = 0
[pid 22] prctl(PR_SET_NAME, 0x7fddd54a1b90 <unfinished ...>
[pid 22] <... prctl resumed>) = 0
Well, the output is more obvious if we fault prctl
to see the names rather than their pointers ;)
$ strace -f --fault=prctl keepassxc /file.kdbx 2>&1 | grep prctl
[pid 11] prctl(PR_SET_NAME, "QXcbEventQueue" <unfinished ...>
[pid 11] <... prctl resumed>) = -1 ENOSYS (Function not implemented) (INJECTED)
[pid 10] prctl(PR_SET_DUMPABLE, SUID_DUMP_DISABLE) = -1 ENOSYS (Function not implemented) (INJECTED)
[pid 12] prctl(PR_SET_NAME, "QDBusConnection"...) = -1 ENOSYS (Function not implemented) (INJECTED)
[pid 15] prctl(PR_SET_NAME, "llvmpipe-0" <unfinished ...>
[pid 15] <... prctl resumed>) = -1 ENOSYS (Function not implemented) (INJECTED)
[pid 16] prctl(PR_SET_NAME, "llvmpipe-1" <unfinished ...>
[pid 16] <... prctl resumed>) = -1 ENOSYS (Function not implemented) (INJECTED)
[pid 17] prctl(PR_SET_NAME, "llvmpipe-2") = -1 ENOSYS (Function not implemented) (INJECTED)
[pid 18] prctl(PR_SET_NAME, "llvmpipe-3") = -1 ENOSYS (Function not implemented) (INJECTED)
[pid 19] prctl(PR_SET_NAME, "llvmpipe-4" <unfinished ...>
[pid 19] <... prctl resumed>) = -1 ENOSYS (Function not implemented) (INJECTED)
[pid 20] prctl(PR_SET_NAME, "llvmpipe-5" <unfinished ...>
[pid 20] <... prctl resumed>) = -1 ENOSYS (Function not implemented) (INJECTED)
[pid 21] prctl(PR_SET_NAME, "llvmpipe-6" <unfinished ...>
[pid 21] <... prctl resumed>) = -1 ENOSYS (Function not implemented) (INJECTED)
[pid 22] prctl(PR_SET_NAME, "llvmpipe-7") = -1 ENOSYS (Function not implemented) (INJECTED)
Thread names can show up in process viewers such as htop
, but they are not really critical to the function of a password manager, so faulting these system calls is fine.
Also, by default strace
will return ENOSYS
for system calls that fault.
As you can see in the above output, this error is caught by the password manager, but it just prints an error message in the terminal – “Unable to disable core dumps.” – and continues running anyway!
Finally saving new passwords
I could now resume my tracing of file-related system calls while faulting prctl
, and observe what is really happening when trying to save a file.
We have to trace
prctl
to be able to fault it. That is,--trace=%file --fault=prctl
won’t faultprctl
, because this system call doesn’t belong to the%file
group. We have to use--trace=%file,prctl --fault=prctl
instead.
I obtained the following traces, which confirmed my hypotheses about the problematic system calls, but more importantly explained why they failed.
“Normal” saving mode.
The openat(AT_FDCWD, "", O_RDWR|O_CLOEXEC|O_TMPFILE, 0600)
system call fails, as it attempts to create a temporary file in the current working directory (which is on the read-only filesystem).
$ strace -f --trace=%file,prctl --fault=prctl keepassxc /file.kdbx
...
QFSFileEngine::open: No file name specified
[pid 9] access("/file.kdbx", F_OK) = 0
[pid 9] lstat("/file.kdbx", {st_mode=S_IFREG|0664, st_size=33998, ...}) = 0
[pid 9] access("/file.kdbx", F_OK) = 0
[pid 9] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC <unfinished ...>
[pid 30] access("/file.kdbx", F_OK <unfinished ...>
[pid 9] <... openat resumed>) = 18
[pid 30] <... access resumed>) = 0
[pid 30] access("/file.kdbx", W_OK) = 0
[pid 30] statx(AT_FDCWD, "/file.kdbx", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL, stx_attributes=0, stx_mode=S_IFREG|0664, stx_size=33998, ...}) = 0
[pid 30] statx(AT_FDCWD, "/file.kdbx", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL, stx_attributes=0, stx_mode=S_IFREG|0664, stx_size=33998, ...}) = 0
[pid 30] openat(AT_FDCWD, "", O_RDWR|O_CLOEXEC|O_TMPFILE, 0600) = -1 ENOENT (No such file or directory)
“Atomic” saving mode.
The failing system call is unlink("/file.kdbx") = -1 EROFS
.
We didn’t mount this file as a read-only filesystem, but because we mounted it as a single-file volume, I assume that Docker forbids removing the file.
This makes sense because that would remove the volume.
In other words it’s possible to overwrite the file, but not to entirely remove it.
QFSFileEngine::open: No file name specified
[pid 9] access("/file.kdbx", F_OK) = 0
[pid 9] lstat("/file.kdbx", {st_mode=S_IFREG|0664, st_size=33998, ...}) = 0
[pid 9] access("/file.kdbx", F_OK) = 0
[pid 9] openat(AT_FDCWD, "/sys/devices/system/cpu/online", O_RDONLY|O_CLOEXEC) = 18
[pid 33] lstat("/tmp", {st_mode=S_IFDIR|S_ISVTX|0777, st_size=140, ...}) = 0
[pid 33] openat(AT_FDCWD, "/tmp", O_RDWR|O_CLOEXEC|O_TMPFILE, 0600) = 18
[pid 33] statx(18, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS, stx_attributes=0, stx_mode=S_IFREG|0600, stx_size=0, ...}) = 0
[pid 33] statx(AT_FDCWD, "/file.kdbx", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_ALL, stx_attributes=0, stx_mode=S_IFREG|0664, stx_size=33998, ...}) = 0
[pid 33] access("/file.kdbx", R_OK) = 0
[pid 33] access("/file.kdbx", W_OK) = 0
[pid 33] access("/file.kdbx", X_OK) = -1 EACCES (Permission denied)
[pid 33] unlink("/file.kdbx") = -1 EROFS (Read-only file system)
[pid 33] linkat(AT_FDCWD, "/proc/self/fd/18", AT_FDCWD, "/tmp/KeePassXC.wpCgbn", AT_SYMLINK_FOLLOW) = 0
[pid 33] statx(18, "", AT_STATX_SYNC_AS_STAT|AT_EMPTY_PATH, STATX_ALL, {stx_mask=STATX_BASIC_STATS, stx_attributes=0, stx_mode=S_IFREG|0600, stx_size=34142, ...}) = 0
[pid 33] statx(AT_FDCWD, "/tmp/KeePassXC.wpCgbn", AT_STATX_SYNC_AS_STAT|AT_SYMLINK_NOFOLLOW, STATX_ALL, {stx_mask=STATX_BASIC_STATS, stx_attributes=0, stx_mode=S_IFREG|0600, stx_size=34142, ...}) = 0
[pid 33] stat("/file.kdbx", {st_mode=S_IFREG|0664, st_size=33998, ...}) = 0
In the end, a simple solution is to copy the password file to the /tmp
directory, call keepassxc
on this temporary file, and afterwards copy back the temporary file on our single-file volume /file.kdbx
with cp
.
This works because:
- KeePassXC is allowed to create and remove files inside of the
/tmp
directory, - it seems that
cp
doesn’t remove the target file if it already exists, but simply overwrites it.
Conclusion: How to block strace
?
A question that remains is what can we do to prevent a process from being traced? This is indeed a valid concern for a password manager, to prevent any other process from tracing it and reading the precious passwords.
So far, we’ve learned the following.
- Although
prctl(PR_SET_DUMPABLE, 0)
can hide the contents of buffers from the memory of a process, it doesn’t block an existing tracing (we still see the list of system calls). - More importantly, if the process is already being traced, then this
prctl
system call can be intercepted and blocked by the tracing process, making it totally ineffective. This is the approach I’ve taken withstrace --fault=prctl
.
Note that by default, the --fault
parameter of strace
will inject a recognizable error code (ENOSYS
), and as we’ve seen the password manager noticed it (yet didn’t stop running after that).
However, the tracing process can also return a valid result, making the traced process believe that this blocking worked!
With strace
, one can use --inject
instead of --fault
, and customize the error
and retval
parameters to achieve that.
More generally, it’s possible to write one’s own tracing program, to fully customize what is being intercepted and injected, for example to only target the prctl(PR_SET_DUMPABLE, 0)
call without faulting other calls to prctl
.
Here are some relevant resources.
- Intercepting and Emulating Linux System Calls with Ptrace by Chris Wellons, 2018. The corresponding source code is available on GitHub.
- Write yourself an strace in 70 lines of code by Nelson Elhage, 2010.
- If you are a Rust enthusiast like me, you can read Loading and ptrace’ing a process in Rust by Joseph Kain, 2015, have a look at
rustrace
on GitHub, or use the ptrace module of the nix crate.
In short, attempting to block tracing from within a program won’t work if that program is already being traced, because the tracer can intercept and tamper with anything.
Mitigations like prctl(PR_SET_DUMPABLE, 0)
can only be effective if the program is not already being traced.
Dropping capabilities?
I originally thought that giving the CAP_SYS_PTRACE
capability to the Docker container was necessary to run strace within it.
I was somewhat right, but for the wrong reason!
Indeed, Docker’s seccomp-bpf profile would enable the ptrace
system call whenever this capability is provided.
But this default profile also enables the ptrace
system call if the kernel is recent enough (Linux >= 4.8), regardless of the capabilities!
The same observation was made by Julia Evans a few months ago in the Why strace doesn’t work in Docker blog post.
Aside from Docker’s rules, it turns out that CAP_SYS_PTRACE
is only really required to trace a process from a different user.
And even if it was required, we would need to run as root
within the container to get this capability1.
Yama rules
The next thing I learned was about Yama, a so-called Linux Security Module introduced in Linux 3.4.
In particular, it allows to control the scope of ptrace
at the kernel level, via the /proc/sys/kernel/yama/ptrace_scope
file.
Yama provides 4 levels.
- 0 - Classic ptrace permissions. Any process running with the same uid can trace a dumpable process.
- 1 - Restricted. A process can only trace its descendants.
- 2 - Admin-only.
Only a process with
CAP_SYS_PTRACE
can trace other processes. - 3 - Disabled.
No process can use
ptrace
. Re-enabling to another level requires reboot.
You can change the setting by writing the corresponding number (0-3) to /proc/sys/kernel/yama/ptrace_scope
.
Another (equivalent) method is to run sysctl kernel.yama.ptrace_scope=N
.
You can read back the current level by reading the /proc/sys/kernel/yama/ptrace_scope
file, or running sysctl kernel.yama.ptrace_scope
.
As a simple sanity check, when ptrace
is disabled (level 2 or 3) you should obtain the following result when strace
-ing a familiar program.
$ strace ls
strace: test_ptrace_get_syscall_info: PTRACE_TRACEME: Operation not permitted
strace: ptrace(PTRACE_TRACEME, ...): Operation not permitted
+++ exited with 1 +++
You can also check that level 3 is indeed irreversible at runtime.
$ sudo sysctl kernel.yama.ptrace_scope=3
kernel.yama.ptrace_scope = 3
$ sudo sysctl kernel.yama.ptrace_scope=0
sysctl: setting key "kernel.yama.ptrace_scope": Invalid argument
As mentioned here, Yama rules don’t only block the
ptrace
system call, but also all kernel features related to program tracing. This includes theprocess_vm_readv
andprocess_vm_writev
system calls – which are also blocked by Docker’s default seccomp-bpf profile whenCAP_SYS_PTRACE
is dropped – as well as access to some files in/proc/PID/
.
Persistent configuration
By default, the ptrace_scope
is reset upon reboot – on Debian distributions this is currently reset to level 0 (classic permissions).
But you can change that via the /etc/sysctl.d/
configuration folder.
To always disable ptrace
upon reboot, create a /etc/sysctl.d/local.conf
file and write the following line.
kernel.yama.ptrace_scope = 3
I’d really recommend this setting in a production environment where you don’t expect to use ptrace
.
If you don’t want to restrict ptrace
for the whole kernel, using a more restricted seccomp-bpf profile to disable the ptrace
system call in a container is also a good way to block it.
Comments
To react to this blog post please check the Twitter thread.
You may also like