As part of our standard toolkit, we provide each developer at Skroutz with a writable database snapshot against which she can develop. These snapshots are updated daily through a pipeline that involves taking an LVM snapshot of production data, anonymizing the dataset by stripping all personal data, and transferring it via rsync to the development database servers. The development servers in turn use ZFS snapshots to expose a copy-on-write snapshot to each developer, with self-service tools allowing rollback or upgrades to newer snapshots.
We use the same pipeline to expose MariaDB and MongoDB data, with a full dataset size of 600GB and 200GB respectively, and a slightly modified pipeline for Elasticsearch. While on-disk data changes significantly for all data sources, rsync still saves significant time by transferring roughly 1/3 of the full data set every night. This setup has worked rather well for the better part of a decade and has managed to scale from 15 developers to 150. However, as with most systems, it has had its fair share of maintenance and has given us some interesting moments.
One of the most interesting issues we encountered led to the discovery of a fairly old bug in the Linux kernel TCP implementation: every now and then, an rsync transfer from a source server would hang indefinitely for no apparent reason, as — apart from the stuck transfer — everything else seemed to be in order. What’s more, for reasons that became apparent later, the issue could not be reproduced at will, although some actions (e.g. adding an rsync-level rate limit) seemed to make the issue less frequent, with frequency ranging from once or twice per week to once every three months.
As is not unusual in these cases, we had more urgent systems and issues to attend to, so we labeled this a “race condition in rsync” that we should definitely look into at some point, and worked around it by throttling the rsync transfers.
Until it started biting us every single day.
rsync as a pipeline
While not strictly necessary, knowing how rsync works internally will help understand the analysis that follows. The rsync site contains a thorough description of rsync’s internal architecture, so I’ll try to summarize the most important points here:
-
rsync starts off as a single process on the client and a single process on the server, communicating via a socket pair. When using the rsync daemon, as in our case, communication is done over a plain TCP connection
-
Based on the direction of sync, after the initial handshake is over, each end assumes a role, either that of the sender, or that of the receiver. In our case the client is the receiver, and the server is the sender.
-
The receiver forks an additional process called the generator, sharing the socket with the receiver process. The generator figures out what it needs to ask from the sender, and the sender subsequently sends the data to the receiver. What we essentially have after this step is a pipeline, generator → sender → receiver, where the arrows are the two directions of the same TCP connection. While there is some signaling involved, the pipeline operates in a blocking fashion and relies on OS buffers and TCP receive windows to apply backpressure.
A ghost in the network?
Our first reaction when we encountered the issue was to suspect the network for errors, which was a reasonable thing to do since we had recently upgraded our servers and switches. After eliminating the usual suspects (NIC firmware bugs involving TSO/GSO/GRO/VLAN offloading, excessive packet drops or CRC errors at the switches etc), we came to the conclusion that everything was normal and something else had to be going on.
Attaching the hung processes using strace and gdb told us little: the
generator was hung on send()
and the sender and receiver were hung
on recv()
, yet no data was moving. However, turning to the kernel on
both systems revealed something interesting! On the client the rsync
socket shared between the generator and the receiver processes was
in the following state:
while on the server, the socket state was the following:
The interesting thing here is that there are 3.5MB of data on the
client, queued to be sent (❶ in the first output) by the
generator to the server; however, while the server has an empty Recv-Q
and can accept data, nothing seems to be moving forward. If Recv-Q
in the second output was non-zero, we would be looking at rsync on the
server being stuck and not reading from the network, however here it
is obvious that rsync has consumed all incoming data and is not to
blame.
So why is data queued up on one end of the connection, while the other end is
obviously able to accept it? The answer is conveniently hidden in the timer
fields of both ss
outputs, especially in
timer:(persist,1min56sec,0)
. Quoting ss(8)
:
persist
means that the connection has received a zero window
advertisement and is waiting for the peer to advertise a non-zero
window.
TCP Zero Windows and Zero Window Probes
TCP implements flow control by limiting the data in flight using a sliding window called the receive window. Wikipedia has a good description, but in short each end of a TCP connection advertises how much data it is willing to buffer for the connection, i.e. how much data the other end may send before waiting for an acknowledgment.
When one side’s receive buffer (Recv-Q
) fills up (in this case
because the rsync process is doing disk I/O at a speed slower than the
network’s), it will send out a zero window advertisement, which will
put that direction of the connection on hold. When buffer space
eventually frees up, the kernel will send an unsolicited window update
with a non-zero window size, and the data transfer continues. To be
safe, just in case this unsolicited window update is lost, the other
end will regularly poll the connection state using the so-called Zero
Window Probes (the persist
mode we are seeing here).
The window is stuck closed
It’s now time to dive a couple of layers deeper and use tcpdump
to
see what’s going on at the network level:
The first packet is the rsync client’s zero window probe, the second packet is the server’s response. Surprisingly enough, the server is advertising a non-zero window size of 13831 bytes¹ which the client apparently ignores.
¹ actually multiplied by 128 because of a window scaling factor of 7
We are finally making some progress and have a case to work on! At some point the client encountered a zero window advertisement from the server as part of regular TCP flow control, but then the window failed to re-open for some reason. The client seems to be still ignoring the new window advertised by the server and this is why the transfer is stuck.
Linux TCP input processing
By now it’s obvious that the TCP connection is in a weird state on the rsync client. Since TCP flow control happens at the kernel level, to get to the root of this we need to look at how the Linux kernel handles incoming TCP acknowledgments and try to figure out why it ignores the incoming window advertisement.
Incoming TCP packet processing happens in
net/ipv4/tcp_input.c
.Despite
the ipv4
component in the path, this is mostly shared code between
IPv4 and IPv6.
Digging a bit through the code we find out that incoming window
updates are handled in
tcp_ack_update_window
and actually updating the window is guarded by the following function:
The ack
, ack_seq
, snd_wl1
and snd_una
variables hold TCP
sequence numbers that are used in TCP’s sliding window to keep track
of the data exchanged over the wire. These sequence numbers are 32-bit
unsigned integers (u32
) and are incremented by 1 for each byte that
is exchanged, beginning from an arbitrary initial value (initial
sequence number). In particular:
ack_seq
is the sequence number of the incoming segmentack
is the acknowledgment number contained in the incoming segment, i.e. it acknowledges the sequence number of the last segment the peer received from us.snd_wl1
is the sequence number of the incoming segment that last updated the peer’s receive window.snd_una
is the sequence number of the first unacknowledged segment, i.e. a segment we have sent but has not been yet acknowledged by the peer.
Being fixed-size integers, the sequence numbers will eventually wrap
around, so the after()
macro takes care of comparing two sequence
numbers in the face of wraparounds.
For the record, the snd_una
and snd_wl1
names come directly from
the original TCP specification in RFC 793, back in 1981!
Translating the rather cryptic check into plain English, we are willing to accept a window update from a peer if:
- ❶
- our peer acknowledges the receipt of data we previously sent; or
- ❷
- our peer is sending new data since the previous window update; or
- ❸
- our peer isn’t sending us new data since the previous window update, but is advertising a larger window
Note that the comparison of ack_seq
with snd_wl1
is done to make
sure that the window is not accidentally updated by a
(retransmission of a) segment that was seen earlier.
In our case, at least condition ❸ should be able to re-open the window, but apparently it doesn’t and we need access to these variables to figure out what is happening. Unfortunately, these variables are part of the internal kernel state and are not directly exposed to userspace, so it’s time to get a bit creative.
Accessing the internal kernel state
To get access to the kernel state, we somehow need to run code inside
the kernel. One way would be to patch the kernel with a few printk()
calls here and there, but that would require rebooting the machine and
waiting for rsync to hang again. Rather, we opted to live-patch the
kernel using systemtap with the following script:
Systemtap works by converting systemtap scripts into C and building a
kernel module that hot-patches the kernel and overrides specific
instructions. Here we overrode tcp_ack()
, hooked at its end and
dumped the internal TCP connection state. The $sk->sk_send_head !=
NULL
check is a quick way to only match connections that have a
non-empty Send-Q
.
Loading the resulting module into the kernel gave us the following:
The two things of interest here are snd_wl1: 1708927328
and
ack_seq: 1253278418
. Not only are they not identical as we would
expect, but actually ack_seq
is smaller than snd_wl1
, indicating
that ack_seq
wrapped around at some point and snd_wl1
has not been
updated for a while. Using the serial number arithmetic rules, we can figure out that this end has
received (at least) 3.8 GB since the last update of snd_wl1
.
We already saw that snd_wl1
contains the last sequence number used
to update the peer’s receive window (and thus our send window), with
the ultimate purpose of guarding against window updates from old
segments. It should be okay if snd_wl1
is not updated for a while,
but it should not lag too far behind ack_seq
, or else we risk
rejecting valid window updates, as in this case. So it looks like the
Linux kernel fails to update snd_wl1
under some circumstances, which
leads to an inability to recover from a zero-window condition.
Having tangible proof that something was going on in the kernel, it was time to get people familiar with the networking code in the loop.
Taking things upstream
After sleeping on this, we wrote a good summary of what we knew so far and what we supposed was happening, and reached out to the Linux networking maintainers. Confirmation came less than two hours later, together with a patch by Neal Cardwell.
Apparently, the bug was in the bulk receiver fast-path, a code path
that skips most of the expensive, strict TCP processing to optimize
for the common case of bulk data reception. This is a significant
optimization, outlined 28 years ago² by Van Jacobson in his “TCP
receive in 30 instructions” email. Apparently
the Linux implementation did not update snd_wl1
while in the
receiver fast path. If a connection uses the fast path for too long,
snd_wl1
will fall so far behind that ack_seq
will wrap around with
respect to it. And if this happens while the receive window is zero,
there is no way to re-open the window, as demonstrated above. What’s
more, this bug had been present in Linux since v2.1.8, dating
back to 1996!
² This optimization is still relevant today: a relatively recent attempt to remove the header prediction code and associated fast paths to simplify the code was reverted on performance regression grounds.
As soon as we got the patch, we applied it, rebuilt the kernel, deployed it on the affected machines and waited to see if the issue was fixed. A couple of days later we were certain that the fix was indeed correct and did not cause any ill side-effects.
After a bit of discussion, the final commit landed in
linux-net
, and from there it was merged into Linux mainline for 5.10-rc1.
Eventually it found its way to the stable 4.9 and 4.19 kernel series that we
use on our Debian systems, in 4.9.241 and 4.19.153 respectively.
Aftermath
With the fix in place, we still had a couple of questions to answer, namely:
-
How is it possible for a TCP bug that leads to stuck connections to go unnoticed for 24 years?
-
Out of an infrastructure with more than 600 systems running all kinds of software, how come we only witnessed this bug when using rsync?
It’s hard to give a definitive answer to these questions, but we can reason about it this way:
-
This bug will not be triggered by most L7 protocols. In “synchronous” request-response protocols such as HTTP, usually each side will consume all available data before sending. In this case, even if
snd_wl1
wraps around, the bulk receiver will be left with a non-zero window and will still be able to send out data, causing the next acknowledgment to update the window and adjustsnd_wl1
through check ❶ intcp_may_update_window
.rsync
on the other hand uses a pretty aggressive pipeline where the server might send out multi-GB responses without consuming incoming data in the process. Even inrsync
’s case, usingrsync
over SSH (a rather common combination) rather than the plain TCP transport would not expose this bug, as SSH framing/signaling would most likely not allow data to queue up on the server this way. -
Regardless of the application protocol, the receiver must remain long enough (for at least 2GB) with a zero send window in the fast path to cause a wrap-around — but not too long for
ack_seq
to overtakesnd_wl1
again. For this to happen, there must be no packet loss or other conditions that would cause the fast path’s header prediction to fail. This is very unlikely to happen in practice as TCP itself determines the network capacity by actually causing packets to be lost. -
Most applications will care about network timeouts and will either fail or reconnect, making it appear as a “random network glitch” and leaving no trace to debug behind.
Finally, even if none of the above happens and you end up with a stuck TCP connection, it takes a lot of annoyance to decide to deal with it and drill deep in kernel code. And when you do, you are rewarded with a nice adventure, where you get to learn about internet protocol history, have a glimpse at kernel internals, and witness open source work in motion!
If you enjoyed reading this post and you like hunting weird bugs and looking at kernel code, you might want to drop us a line — we are always looking for talented SREs and DevOps Engineers!