Entropy changes in Debian or 'why a VM boots in 5 minutes?'

Intro

At Skroutz we operate a wide variety of services comprising the ecosystem behind Skroutz.gr, a comparison shopping engine which evolved to an e-commerce marketplace. We run these services on our own infrastructure, bare metal servers and virtual machines. All hosts are running Debian GNU/Linux, which on July 6th 2019 had its latest stable release, called Buster. Buster came with lots of changes in included packages, as expected in a major release.

We started experimenting with dist-upgraded Buster hosts a couple of months before the official release, as soon as Buster got in “freeze” state. This strategy would give us a taste of what to expect with the new software versions and how to get better prepared to smoothly upgrade the operating system underneath our services with minimum disruption.

The problem

The issue we’re going to discuss in this post manifests pretty simply: after dist-upgrading a virtual machine to Buster and rebooting it, it took a couple of minutes before we could actually regain access via ssh. Virtual machine reboots are part of routine maintenance work to keep our services up-to-date and secure. When orchestrating such works across a fleet of hundred hosts, we certainly would like to avoid spending minutes before verifying that each host did come back up and healthy.

Investigation

It’s widely known that virtual machines do not enjoy the privilege of high quality randomness as the physical hosts do, since a virtual machine’s devices are emulated by design, thus do not feature unpredictable behavior, a useful ingredient for randomness ¹ ² ³.

Various references, e.g. Debian bug reports ⁴ ⁵, suggested that this behavior was to be attributed to OpenSSL and how it gathers entropy via the getrandom() system call. But all these online references were not descriptive enough or conclusive, so we opted for digging deeper and understand the issue.

Kernel ring buffer displays important information coming from the kernelspace and it’s the first place we looked at. Consider this snippet from a Buster VM that just booted:

# journalctl -k | grep random
Apr 17 12:05:06 somevm kernel: random: get_random_bytes called from start_kernel+0x93/0x531 with crng_init=0
Apr 17 12:05:06 somevm kernel: random: fast init done
Apr 17 12:05:06 somevm kernel: random: systemd: uninitialized urandom read (16 bytes read)
Apr 17 12:05:06 somevm kernel: random: systemd: uninitialized urandom read (16 bytes read)
Apr 17 12:05:06 somevm kernel: random: systemd: uninitialized urandom read (16 bytes read)
Apr 17 12:06:48 somevm kernel: random: crng init done
Apr 17 12:06:48 somevm kernel: random: 7 urandom warning(s) missed due to ratelimiting

Three important points stand out:

before anything else it’s the kernel entry point which requests randomness with get_random_bytes() kernel function. We will explain its behavior and usage below.
systemd (userspace) is also requesting randomness while bringing up system’s services
crng init (crng stands for cryptographic random number generator) takes almost 2 minutes since boot

kernel’s `get_random_bytes()`

get_random_bytes() is an in-kernel interface to provide random bytes. In our case, it is called from kernel’s entry point ⁶ if CONFIG_STACKPROTECTOR is set, which is true for kernels packaged in Debian. That message is printed if CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set (again true for Debian) to inform us that we don’t have a fully seeded CRNG. In case you’re curious, these numbers are required for GCC’s “stack-protector” feature. When a function gets called, a random number is placed on the stack, just before the return address. This number is called “canary” and is validated by the kernel after returning. If an attacker performs a stack-based buffer overflow, the canary value will be overwritten. The kernel will detect this attack and throw a kernel panic ⁷.

A quick look into the kernel codebase shows us that it is unlikely that the boot process will actually block here, rather we have a clear indication that kernel’s CRNG is not properly initialized and we’ll see how that affects userspace processes that depend on that.

systemd ssh.service

Following lines in dmesg show that systemd has started as well and it actually reads bytes from urandom, albeit uninitialized.

systemd allows us to print a tree of the time-critical chain of systemd units (including services) as well as the time spend for each one to be started. This is done via:

# systemd-analyze critical-chain
The time after the unit is active or started is printed after the "@" character.
The time the unit takes to start is printed after the "+" character.

graphical.target @1min 45.121s
└─multi-user.target @1min 45.121s
  └─ssh.service @1min 34.242s +10.857s
    └─network.target @3.887s
      └─networking.service @1.096s +2.790s
        └─network-pre.target @1.095s
          └─ferm.service @288ms +807ms
            └─systemd-journald.socket @287ms
              └─system.slice @282ms
                └─-.slice @282ms

It’s clear that ssh service takes somewhat longer than usual to get up. Its journal reads:

# journalctl -u ssh.service
-- Logs begin at Wed 2019-04-17 11:53:56 EEST, end at Wed 2019-04-17 11:56:43 EEST. --
Apr 17 11:54:00 somevm systemd[1]: Starting OpenBSD Secure Shell server...
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Start-pre operation timed out. Terminating.
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Control process exited, code=killed, status=15/TERM
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Failed with result 'timeout'.
Apr 17 11:55:30 somevm systemd[1]: Failed to start OpenBSD Secure Shell server.
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Service RestartSec=100ms expired, scheduling restart.
Apr 17 11:55:30 somevm systemd[1]: ssh.service: Scheduled restart job, restart counter is at 1.
Apr 17 11:55:30 somevm systemd[1]: Stopped OpenBSD Secure Shell server.
Apr 17 11:55:30 somevm systemd[1]: Starting OpenBSD Secure Shell server...
Apr 17 11:55:41 somevm sshd[1184]: Server listening on 0.0.0.0 port 22.
Apr 17 11:55:41 somevm sshd[1184]: Server listening on :: port 22.
Apr 17 11:55:41 somevm systemd[1]: Started OpenBSD Secure Shell server.

It seems that ssh.service gets stuck in its ExecStartPre command:

# systemctl cat ssh.service | ag ExecStartPre
ExecStartPre=/usr/sbin/sshd -t

sshd -t just checks the validity of configuration files and sanity of keys. So, why is it blocking? To get an insight on why ExecStartPre times out, we decided to enrich it like this:

#!/bin/sh
strace -f -c -w /usr/sbin/sshd -t > /tmp/sshd_strace_`date +%s` 2>&1

We basically wrap the sshd invocation with strace and instruct it to keep aggregate time statistics about each system call made by the executable. Our intention is to identify the system call sshd is spending most of its time at before finally get killed by systemd.

After rebooting the VM we got our sshd strace logfiles:

# ls -l /tmp/sshd_strace*
-rw-r--r-- 1 root root 2152 Apr 17 12:49 /tmp/sshd_strace_1555494448
-rw-r--r-- 1 root root 2152 Apr 17 12:49 /tmp/sshd_strace_1555494538

This is the output of the first attempt (which gets killed by systemd):

# cat sshd_strace_1555494448
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.96  101.669156   101669156         1           getrandom
  0.01    0.007609        7609         1           execve
  0.01    0.006644         120        55           read
  0.01    0.006289          49       128           mmap
  0.00    0.004297         104        41           mprotect
[...]
------ ----------- ----------- --------- --------- ----------------
100.00  101.706415                   444         7 total

It’s self-evident that sshd spends the whole time trying to acquire randomness via getrandom() system call.

The second systemd attempt to get sshd up actually succeeds with the strace log reading:

# cat sshd_strace_1555494538
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.94   11.543144    11543143         1           getrandom
  0.02    0.001813          34        52           close
  0.01    0.001594          12       128           mmap
  0.01    0.000753          16        47           openat
  0.01    0.000585          10        55           read
  0.00    0.000564          13        41           mprotect
[...]
------ ----------- ----------- --------- --------- ----------------
100.00   11.549977                   444         7 total

Notice that the second attempt succeeds (12:49:10) exactly at the same time getrandom() returns a result, which coincides exactly with the timestamp the kernel’s entropy pool gets initialized:

# journalctl -k | grep random
Apr 17 12:47:25 somevm kernel: random: get_random_bytes called from start_kernel+0x93/0x531 with crng_init=0
Apr 17 12:47:25 somevm kernel: random: fast init done
Apr 17 12:47:25 somevm kernel: random: systemd: uninitialized urandom read (16 bytes read)
Apr 17 12:47:25 somevm kernel: random: systemd: uninitialized urandom read (16 bytes read)
Apr 17 12:47:25 somevm kernel: random: systemd: uninitialized urandom read (16 bytes read)
Apr 17 12:49:10 somevm kernel: random: crng init done
Apr 17 12:49:10 somevm kernel: random: 7 urandom warning(s) missed due to ratelimiting

# journalctl -u ssh.service
-- Logs begin at Wed 2019-04-17 12:47:25 EEST, end at Wed 2019-04-17 12:52:23 EEST. --
Apr 17 12:47:28 somevm systemd[1]: Starting OpenBSD Secure Shell server...
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Start-pre operation timed out. Terminating.
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Control process exited, code=killed, status=15/TERM
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Failed with result 'timeout'.
Apr 17 12:48:58 somevm systemd[1]: Failed to start OpenBSD Secure Shell server.
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Service RestartSec=100ms expired, scheduling restart.
Apr 17 12:48:58 somevm systemd[1]: ssh.service: Scheduled restart job, restart counter is at 1.
Apr 17 12:48:58 somevm systemd[1]: Stopped OpenBSD Secure Shell server.
Apr 17 12:48:58 somevm systemd[1]: Starting OpenBSD Secure Shell server...
Apr 17 12:49:10 somevm systemd[1]: Started OpenBSD Secure Shell server.

Quick sidenote: We were curious about why sshd is calling getrandom() even if its just validating its configuration. A quick look at sshd’s source code, shows that it seeds its RNG during startup, even if its just validating its configuration:

int
main(int ac, char **av)
{
[...]
    seed_rng();
[...]
    if (test_flag > 1) {
[...]
        parse_server_match_config(&options, connection_info);
        dump_config(&options);
    }
[...]

seed_rng() is invoking RAND_status(), an OpenSSL library function which, finally, executes getrandom().

Changes for `getrandom()` system call

So we’ve identified that ssh.service blocks waiting for getrandom() syscall. Then our focus shifted to understanding why/when getrandom() blocks and how is that related with the kernel’s CRNG.

First, it’s clear that whether getrandom() will read from /dev/urandom or /dev/random and whether will it block or not is controlled by the relevant flags: GRND_RANDOM and GRND_NONBLOCK(check getrandom(2) for more). A quick search showed that neither OpenSSH nor OpenSSL (which OpenSSH relies on for cryptography) do not set any of these flags, meaning getrandom() will have its default behavior: will block until the kernel’s CRNG is ready.

If these flags are not set then either the system call or the CRNG did change in the meantime. And this meant digging into kernel source code and git history… :D Debian Stretch features kernels from the 4.9.x linux-stable tree while Debian Buster features kernels from the 4.19.x series.

Pondering over the output of git log -p v4.9..v4.19 -- drivers/char/random.c is really an enjoyful activity but we’ll spare you the time and directly point you to commit 43838a23a05fbd13e47 by Theodore Ts’o. This commit is entitled random: fix crng_ready() test and was introduced in linux 4.17 as a response to multiple security issues reported by Google’s Project Zero. It basically changes the crng_ready() function to be more strict about when linux’s CRNG is safe for cryptographic use cases:

diff --git a/drivers/char/random.c b/drivers/char/random.c
index e027e7fa1472..c8ec1e70abde 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -427,7 +427,7 @@ struct crng_state primary_crng = {
  * its value (from 0->1->2).
  */
 static int crng_init = 0;
-#define crng_ready() (likely(crng_init > 0))
+#define crng_ready() (likely(crng_init > 1))

But how does this commit affect getrandom() syscall? The following block is getrandom’s definition from linux v4.9.144 (just a kernel version in a Stretch host), ie before random: fix crng_ready() test was applied.

if (!crng_ready()) {
    if (flags & GRND_NONBLOCK)
            return -EAGAIN;
    crng_wait_ready();
    if (signal_pending(current))
                   return -ERESTARTSYS;
}
return urandom_read(NULL, buf, count, NULL);

Upon early boot, getrandom() would treat crng_init == 1 good enough and would just return urandom_read, so would not block. This was was not considered “secure” enough. After applying random: fix crng_ready() test getrandom() behavior changed: it would block (unless called with GRND_NONBLOCK) until CRNG was really cryptographically ready, i.e. crng_init == 2.

Resolution

As soon as we pinpointed the reason ssh (and other userspace software) could block early on boot when calling getrandom() we urged to evaluate possible solutions. Our goal is to assist the virtual machine to get “good enough” entropy early on when booting. Providing QEMU guests with quality entropy is not a novel issue, rather it’s a recurring one when one needs to operate a cryptographically intensive application within a virtual machine.

We discarded the option of running a userspace daemon, such as HAVEGED inside every VM. Currently, as far we are concerned, there are no practical attacks against HAVEGED, but it has received a lot of criticism for low-quality entropy, state leaking, etc ⁸ ⁹. Also, from a infrastructure perspective, we aim to provide anything that’s necessary to VMs, without having to perform modifications inside the guests. Users should be able to use our virtualization infrastructure without having to modify images, due to an “unwanted” side-effect on the host’s kernel.

Instead, we’d prefer a cleaner approach and turned our attention to VirtIO RNG ¹⁰. VirtIO RNG is a paravirtualized device for QEMU, that exposes a hardware RNG inside the guest. Enabling it for QEMU instances basically allows physical hosts to inject randomness in virtual guests by exposing a special purpose device, /dev/hwrng. VirtIO RNG is configurable and can be wired up on the host to retrieve entropy from various sources, such as /dev/{,u}random or even a hardware RNG. The downside of this solution for us was that it was not immediately available in our virtual machines cluster manager, Ganeti. Such a missing feature can also be seen as a contribution opportunity though! So Nikos got down to implement what was missing for the KVM hypervisor in Ganeti ¹¹.

In the meantime another possible solution emerged: RDRAND. This is a x86 CPU instruction, available on modern Intel (Ivy Bridge and later) and AMD processors, that returns random numbers as supplied by the hardware’s cryptographically secure pseudorandom number generator ¹². In other words one may trust the physical CPU to fetch “cryptographically secure” numbers. Using RDRAND is possible under certain conditions, which we luckily met:

physical host’s CPUs has to support this instruction. In our case, all bare metal servers consisting our Ganeti cluster did actually feature modern enough Intel CPUs.
Linux kernel has to use randomness provided by the CPU. Indeed, this functionality has been made available in Linux v4.19 by Theodore Ts’o and has been enabled in Debian since debian/4.19.20-1~9.

Apart from RDRAND, new Intel x86 CPUs expose yet another instruction, called RDSEED. RDSEED returns numbers of “seed-grade entropy”, the output of a true RNG that should be used by software seeding a pseudo-RNG. This would provide even better quality of entropy to our hosts, together with a possible speed gain. Unfortunately, not all hosts in our fleet support this instruction, so we dismissed the idea.

Finally, we were able to expose RDRAND CPU flag to all our guests by simply modifying ganeti cluster’s KVM cpu_type hypervisor parameter like so:

gnt-cluster modify -H kvm:cpu_type=SandyBridge\\,\\+pcid\\,\\+invpcid\\,\\+rdrand

This allowed Buster guests to properly initialize their kernel CRNG and all subsequent calls to getrandom() did not block.

Trusting the CPU to provide “cryptographically secure” random numbers may raise some concerns, given that hardware vendors have been found to compromise their products’ security/integrity when pressured or instructed by high-power, high-influence institutions. ^_^ This is even highlighted by Theodore Ts’o in the respective aforementioned commit. Our decision to use the RDRAND instruction and trust the CPU was preceded by weighing various related parameters: we already trust the CPU for all the things, being the dominant, followed by the fact that Debian has enabled that by default.

Conclusion

RDRAND and RDSEED helps the kernel quickly initialize its CRNG, inducing non-blocking calls to getrandom(), thus no lag during boot. RDRAND provides an acceptable seed for randomness, not necessarily a high quality entropy flow. This should be acceptable for most applications/cases where a pseudo-random generator like urandom is sufficient.
VirtIO RNG also solves the CRNG early boot starvation issue.
VirtIO RNG is the way to go when the guest machines needs high-quality (and probably high volume of) entropy.
VirtIO RNG support was not available for Ganeti at the time of our investigation, but we did work for such a feature. We thereby judged RDRAND was an acceptable short-term solution and went for it.

If you have any questions, ideas, thoughts or considerations, feel free to leave a comment below.

Skroutz Engineering

Skroutz's Engineering and Technology Blog

Entropy changes in Debian or 'why a VM boots in 5 minutes?'

Intro

The problem

Investigation

kernel’s `get_random_bytes()`

systemd ssh.service

Changes for `getrandom()` system call

Resolution

Conclusion

Links

Intro

The problem

Investigation

kernel’s get_random_bytes()

systemd ssh.service

Changes for getrandom() system call

Resolution

Conclusion

Links

kernel’s `get_random_bytes()`

Changes for `getrandom()` system call