At Skroutz we operate a wide variety of services comprising the ecosystem behind Skroutz.gr, a comparison shopping engine which evolved to an e-commerce marketplace. We run these services on our own infrastructure, bare metal servers and virtual machines. All hosts are running Debian GNU/Linux, which on July 6th 2019 had its latest stable release, called Buster. Buster came with lots of changes in included packages, as expected in a major release.
We started experimenting with dist-upgraded Buster hosts a couple of months before the official release, as soon as Buster got in “freeze” state. This strategy would give us a taste of what to expect with the new software versions and how to get better prepared to smoothly upgrade the operating system underneath our services with minimum disruption.
The issue we’re going to discuss in this post manifests pretty simply: after dist-upgrading a virtual machine to Buster and rebooting it, it took a couple of minutes before we could actually regain access via ssh. Virtual machine reboots are part of routine maintenance work to keep our services up-to-date and secure. When orchestrating such works across a fleet of hundred hosts, we certainly would like to avoid spending minutes before verifying that each host did come back up and healthy.
It’s widely known that virtual machines do not enjoy the privilege of high quality randomness as the physical hosts do, since a virtual machine’s devices are emulated by design, thus do not feature unpredictable behavior, a useful ingredient for randomness 1 2 3.
Various references, e.g. Debian bug reports 4 5, suggested that this
behavior was to be attributed to OpenSSL and how it gathers entropy via the
getrandom() system call. But all these online references were not descriptive
enough or conclusive, so we opted for digging deeper and understand the issue.
Kernel ring buffer displays important information coming from the kernelspace and it’s the first place we looked at. Consider this snippet from a Buster VM that just booted:
Three important points stand out:
before anything else it’s the kernel entry point which requests randomness with
get_random_bytes()kernel function. We will explain its behavior and usage below.
systemd (userspace) is also requesting randomness while bringing up system’s services
crng init(crng stands for cryptographic random number generator) takes almost 2 minutes since boot
get_random_bytes() is an in-kernel interface to provide random bytes. In our
case, it is called from kernel’s entry point 6 if
is set, which is true for kernels packaged in Debian. That message is printed
CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set (again true for Debian) to
inform us that we don’t have a fully seeded CRNG. In case you’re curious, these
numbers are required for GCC’s “stack-protector” feature. When a function gets
called, a random number is placed on the stack, just before the return address.
This number is called “canary” and is validated by the kernel after returning.
If an attacker performs a stack-based buffer overflow, the canary value will be
overwritten. The kernel will detect this attack and throw a kernel panic 7.
A quick look into the kernel codebase shows us that it is unlikely that the boot process will actually block here, rather we have a clear indication that kernel’s CRNG is not properly initialized and we’ll see how that affects userspace processes that depend on that.
Following lines in dmesg show that systemd has started as well and it actually reads bytes from urandom, albeit uninitialized.
systemd allows us to print a tree of the time-critical chain of systemd units (including services) as well as the time spend for each one to be started. This is done via:
It’s clear that ssh service takes somewhat longer than usual to get up. Its journal reads:
It seems that ssh.service gets stuck in its
sshd -t just checks the validity of configuration files and sanity of keys.
So, why is it blocking? To get an insight on why
ExecStartPre times out, we
decided to enrich it like this:
We basically wrap the
sshd invocation with
strace and instruct it to keep
aggregate time statistics about each system call made by the executable. Our
intention is to identify the system call sshd is spending most of its time at
before finally get killed by systemd.
After rebooting the VM we got our sshd strace logfiles:
This is the output of the first attempt (which gets killed by systemd):
It’s self-evident that sshd spends the whole time trying to acquire randomness
getrandom() system call.
The second systemd attempt to get sshd up actually succeeds with the strace log reading:
Notice that the second attempt succeeds (12:49:10) exactly at the same time
getrandom() returns a result, which coincides exactly with the timestamp the
kernel’s entropy pool gets initialized:
Quick sidenote: We were curious about why sshd is calling
getrandom() even if
its just validating its configuration. A quick look at sshd’s source code,
shows that it seeds its RNG during startup, even if its just validating its
seed_rng() is invoking
RAND_status(), an OpenSSL library function which,
getrandom() system call
So we’ve identified that
ssh.service blocks waiting for
Then our focus shifted to understanding why/when
getrandom() blocks and how is
that related with the kernel’s CRNG.
First, it’s clear that whether
getrandom() will read from
/dev/random and whether will it block or not is controlled by the relevant
getrandom(2) for more). A
quick search showed that neither OpenSSH nor OpenSSL (which OpenSSH relies on
for cryptography) do not set any of these flags, meaning
have its default behavior: will block until the kernel’s CRNG is ready.
If these flags are not set then either the system call or the CRNG did change in the meantime. And this meant digging into kernel source code and git history… :D Debian Stretch features kernels from the 4.9.x linux-stable tree while Debian Buster features kernels from the 4.19.x series.
Pondering over the output of
git log -p v4.9..v4.19 -- drivers/char/random.c
is really an enjoyful activity but we’ll spare you the time and directly point
you to commit
by Theodore Ts’o. This commit is entitled
random: fix crng_ready() test and
was introduced in linux 4.17 as a response to multiple
reported by Google’s Project Zero. It basically changes the
function to be more strict about when linux’s CRNG is safe for cryptographic
But how does this commit affect
getrandom() syscall? The following block is
getrandom’s definition from linux v4.9.144 (just a kernel version in a Stretch
host), ie before
random: fix crng_ready() test was applied.
Upon early boot,
getrandom() would treat
crng_init == 1 good enough and
would just return
urandom_read, so would not block. This was was not
considered “secure” enough. After applying
random: fix crng_ready() test
getrandom() behavior changed: it would block (unless called with
GRND_NONBLOCK) until CRNG was really cryptographically ready, i.e.
crng_init == 2.
As soon as we pinpointed the reason ssh (and other userspace software) could
block early on boot when calling
getrandom() we urged to evaluate possible
solutions. Our goal is to assist the virtual machine to get “good enough”
entropy early on when booting. Providing QEMU guests with quality entropy is
not a novel issue, rather it’s a recurring one when one needs to operate a
cryptographically intensive application within a virtual machine.
We discarded the option of running a userspace daemon, such as HAVEGED inside every VM. Currently, as far we are concerned, there are no practical attacks against HAVEGED, but it has received a lot of criticism for low-quality entropy, state leaking, etc 8 9. Also, from a infrastructure perspective, we aim to provide anything that’s necessary to VMs, without having to perform modifications inside the guests. Users should be able to use our virtualization infrastructure without having to modify images, due to an “unwanted” side-effect on the host’s kernel.
Instead, we’d prefer a cleaner approach and turned our attention to VirtIO RNG
10. VirtIO RNG is a paravirtualized device for QEMU, that exposes a hardware
RNG inside the guest. Enabling it for QEMU instances basically allows physical
hosts to inject randomness in virtual guests by exposing a special purpose
/dev/hwrng. VirtIO RNG is configurable and can be wired up on the
host to retrieve entropy from various sources, such as
even a hardware RNG. The downside of this solution for us was that it was not
immediately available in our virtual machines cluster manager, Ganeti. Such a
missing feature can also be seen as a contribution opportunity though! So Nikos
got down to implement what was missing for the KVM hypervisor in Ganeti 11.
In the meantime another possible solution emerged:
RDRAND. This is a x86 CPU
instruction, available on modern Intel (Ivy Bridge and later) and AMD
processors, that returns random numbers as supplied by the hardware’s
cryptographically secure pseudorandom number generator 12. In other words one
may trust the physical CPU to fetch “cryptographically secure” numbers. Using
RDRAND is possible under certain conditions, which we luckily met:
physical host’s CPUs has to support this instruction. In our case, all bare metal servers consisting our Ganeti cluster did actually feature modern enough Intel CPUs.
RDRAND, new Intel x86 CPUs expose yet another instruction, called
RDSEED returns numbers of “seed-grade entropy”, the output
of a true RNG that should be used by software seeding a pseudo-RNG. This would
provide even better quality of entropy to our hosts, together with a possible
speed gain. Unfortunately, not all hosts in our fleet support this instruction,
so we dismissed the idea.
Finally, we were able to expose
RDRAND CPU flag to all our guests by simply
modifying ganeti cluster’s KVM
cpu_type hypervisor parameter like so:
This allowed Buster guests to properly initialize their kernel CRNG and all
subsequent calls to
getrandom() did not block.
Trusting the CPU to provide “cryptographically secure” random numbers may raise
some concerns, given that hardware vendors have been found to compromise their
products’ security/integrity when pressured or instructed by high-power,
high-influence institutions. ^_^ This is even highlighted by Theodore Ts’o in
the respective aforementioned commit. Our decision to use the
instruction and trust the CPU was preceded by weighing various related
parameters: we already trust the CPU for all the things, being the dominant,
followed by the fact that Debian has enabled that by default.
RDSEEDhelps the kernel quickly initialize its CRNG, inducing non-blocking calls to
getrandom(), thus no lag during boot.
RDRANDprovides an acceptable seed for randomness, not necessarily a high quality entropy flow. This should be acceptable for most applications/cases where a pseudo-random generator like
VirtIO RNG also solves the CRNG early boot starvation issue.
VirtIO RNG is the way to go when the guest machines needs high-quality (and probably high volume of) entropy.
VirtIO RNG support was not available for Ganeti at the time of our investigation, but we did work for such a feature. We thereby judged
RDRANDwas an acceptable short-term solution and went for it.
If you have any questions, ideas, thoughts or considerations, feel free to leave a comment below.