Uncovering a 24-year-old bug in the Linux Kernel

As part of our standard toolkit, we provide each developer at Skroutz with a writable database snapshot against which she can develop. These snapshots are updated daily through a pipeline that involves taking an LVM snapshot of production data, anonymizing the dataset by stripping all personal data, and transferring it via rsync to the development database servers. The development servers in turn use ZFS snapshots to expose a copy-on-write snapshot to each developer, with self-service tools allowing rollback or upgrades to newer snapshots.

We use the same pipeline to expose MariaDB and MongoDB data, with a full dataset size of 600GB and 200GB respectively, and a slightly modified pipeline for Elasticsearch. While on-disk data changes significantly for all data sources, rsync still saves significant time by transferring roughly 1/3 of the full data set every night. This setup has worked rather well for the better part of a decade and has managed to scale from 15 developers to 150. However, as with most systems, it has had its fair share of maintenance and has given us some interesting moments.

One of the most interesting issues we encountered led to the discovery of a fairly old bug in the Linux kernel TCP implementation: every now and then, an rsync transfer from a source server would hang indefinitely for no apparent reason, as — apart from the stuck transfer — everything else seemed to be in order. What’s more, for reasons that became apparent later, the issue could not be reproduced at will, although some actions (e.g. adding an rsync-level rate limit) seemed to make the issue less frequent, with frequency ranging from once or twice per week to once every three months.

As is not unusual in these cases, we had more urgent systems and issues to attend to, so we labeled this a “race condition in rsync” that we should definitely look into at some point, and worked around it by throttling the rsync transfers.

Until it started biting us every single day.

rsync as a pipeline

While not strictly necessary, knowing how rsync works internally will help understand the analysis that follows. The rsync site contains a thorough description of rsync’s internal architecture, so I’ll try to summarize the most important points here:

rsync starts off as a single process on the client and a single process on the server, communicating via a socket pair. When using the rsync daemon, as in our case, communication is done over a plain TCP connection
Based on the direction of sync, after the initial handshake is over, each end assumes a role, either that of the sender, or that of the receiver. In our case the client is the receiver, and the server is the sender.
The receiver forks an additional process called the generator, sharing the socket with the receiver process. The generator figures out what it needs to ask from the sender, and the sender subsequently sends the data to the receiver. What we essentially have after this step is a pipeline, generator → sender → receiver, where the arrows are the two directions of the same TCP connection. While there is some signaling involved, the pipeline operates in a blocking fashion and relies on OS buffers and TCP receive windows to apply backpressure.

A ghost in the network?

Our first reaction when we encountered the issue was to suspect the network for errors, which was a reasonable thing to do since we had recently upgraded our servers and switches. After eliminating the usual suspects (NIC firmware bugs involving TSO/GSO/GRO/VLAN offloading, excessive packet drops or CRC errors at the switches etc), we came to the conclusion that everything was normal and something else had to be going on.

Attaching the hung processes using strace and gdb told us little: the generator was hung on send() and the sender and receiver were hung on recv(), yet no data was moving. However, turning to the kernel on both systems revealed something interesting! On the client the rsync socket shared between the generator and the receiver processes was in the following state:

$ ss -mito dst :873
State      Recv-Q Send-Q                  Local Address:Port                                 Peer Address:Port
ESTAB      0      392827 ❶             2001:db8:2a::3:38022                             2001:db8:2a::18:rsync                 timer:(persist,1min56sec,0)
	 skmem:(r0,rb4194304,t0,tb530944,f3733,w401771,o0,bl0,d757) ts sack cubic wscale:7,7 rto:204 backoff:15 rtt:2.06/0.541 ato:40 mss:1428 cwnd:10 ssthresh:46 bytes_acked:22924107 bytes_received:100439119971 segs_out:7191833 segs_in:70503044 data_segs_out:16161 data_segs_in:70502223 send 55.5Mbps lastsnd:16281856 lastrcv:14261988 lastack:3164 pacing_rate 133.1Mbps retrans:0/11 rcv_rtt:20 rcv_space:2107888 notsent:392827 minrtt:0.189

while on the server, the socket state was the following:

$ ss -mito src :873
State      Recv-Q Send-Q                Local Address:Port                                 Peer Address:Port
ESTAB      0      0                   2001:db8:2a::18:rsync                              2001:db8:2a::3:38022                 timer:(keepalive,3min7sec,0)
 	 skmem:(r0,rb3540548,t0,tb4194304,f0,w0,o0,bl0,d292) ts sack cubic wscale:7,7 rto:204 rtt:1.234/1.809 ato:40 mss:1428 cwnd:1453 ssthresh:1431 bytes_acked:100439119971 bytes_received:22924106 segs_out:70503089 segs_in:7191833 data_segs_out:70502269 data_segs_in:16161 send 13451.4Mbps lastsnd:14277708 lastrcv:16297572 lastack:7012576 pacing_rate 16140.1Mbps retrans:0/794 rcv_rtt:7.5 rcv_space:589824 minrtt:0.026

The interesting thing here is that there are 3.5MB of data on the client, queued to be sent (❶ in the first output) by the generator to the server; however, while the server has an empty Recv-Q and can accept data, nothing seems to be moving forward. If Recv-Q in the second output was non-zero, we would be looking at rsync on the server being stuck and not reading from the network, however here it is obvious that rsync has consumed all incoming data and is not to blame.

So why is data queued up on one end of the connection, while the other end is obviously able to accept it? The answer is conveniently hidden in the timer fields of both ss outputs, especially in timer:(persist,1min56sec,0). Quoting ss(8):

       -o, --options
              Show timer information. For TCP protocol, the output format is:

              timer:(<timer_name>,<expire_time>,<retrans>)

              <timer_name>
                     the name of the timer, there are five kind of timer names:

                     on : means one of these  timers:  TCP  retrans  timer,  TCP
                     early retrans timer and tail loss probe timer

                     keepalive: tcp keep alive timer

                     timewait: timewait stage timer

                     persist: zero window probe timer

                     unknown: none of the above timers

persist means that the connection has received a zero window advertisement and is waiting for the peer to advertise a non-zero window.

TCP Zero Windows and Zero Window Probes

TCP implements flow control by limiting the data in flight using a sliding window called the receive window. Wikipedia has a good description, but in short each end of a TCP connection advertises how much data it is willing to buffer for the connection, i.e. how much data the other end may send before waiting for an acknowledgment.

When one side’s receive buffer (Recv-Q) fills up (in this case because the rsync process is doing disk I/O at a speed slower than the network’s), it will send out a zero window advertisement, which will put that direction of the connection on hold. When buffer space eventually frees up, the kernel will send an unsolicited window update with a non-zero window size, and the data transfer continues. To be safe, just in case this unsolicited window update is lost, the other end will regularly poll the connection state using the so-called Zero Window Probes (the persist mode we are seeing here).

The window is stuck closed

It’s now time to dive a couple of layers deeper and use tcpdump to see what’s going on at the network level:

[…]
09:34:34.165148 0c:c4:7a:f9:68:e4 > 0c:c4:7a:f9:69:78, ethertype IPv6 (0x86dd), length 86: (flowlabel 0xcbf6f, hlim 64, next-header TCP (6) payload length: 32) 2001:db8:2a::3.38022 > 2001:db8:2a::18.873: Flags [.], cksum 0x711b (incorrect -> 0x4d39), seq 4212361595, ack 1253278418, win 16384, options [nop,nop,TS val 2864739840 ecr 2885730760], length 0
09:34:34.165354 0c:c4:7a:f9:69:78 > 0c:c4:7a:f9:68:e4, ethertype IPv6 (0x86dd), length 86: (flowlabel 0x25712, hlim 64, next-header TCP (6) payload length: 32) 2001:db8:2a::18.873 > 2001:db8:2a::3.38022: Flags [.], cksum 0x1914 (correct), seq 1253278418, ack 4212361596, win 13831, options [nop,nop,TS val 2885760967 ecr 2863021624], length 0
[… repeats every 2 mins]

The first packet is the rsync client’s zero window probe, the second packet is the server’s response. Surprisingly enough, the server is advertising a non-zero window size of 13831 bytes¹ which the client apparently ignores.

¹ actually multiplied by 128 because of a window scaling factor of 7

We are finally making some progress and have a case to work on! At some point the client encountered a zero window advertisement from the server as part of regular TCP flow control, but then the window failed to re-open for some reason. The client seems to be still ignoring the new window advertised by the server and this is why the transfer is stuck.

Linux TCP input processing

By now it’s obvious that the TCP connection is in a weird state on the rsync client. Since TCP flow control happens at the kernel level, to get to the root of this we need to look at how the Linux kernel handles incoming TCP acknowledgments and try to figure out why it ignores the incoming window advertisement.

Incoming TCP packet processing happens in net/ipv4/tcp_input.c.Despite the ipv4 component in the path, this is mostly shared code between IPv4 and IPv6.

Digging a bit through the code we find out that incoming window updates are handled in tcp_ack_update_window and actually updating the window is guarded by the following function:

/* Check that window update is acceptable.
 * The function assumes that snd_una<=ack<=snd_next.
 */
static inline bool tcp_may_update_window(const struct tcp_sock *tp,
					const u32 ack, const u32 ack_seq,
					const u32 nwin)
{
	return	after(ack, tp->snd_una) || ❶
		after(ack_seq, tp->snd_wl1) || ❷
		(ack_seq == tp->snd_wl1 && nwin > tp->snd_wnd); ❸
}

The ack, ack_seq, snd_wl1 and snd_una variables hold TCP sequence numbers that are used in TCP’s sliding window to keep track of the data exchanged over the wire. These sequence numbers are 32-bit unsigned integers (u32) and are incremented by 1 for each byte that is exchanged, beginning from an arbitrary initial value (initial sequence number). In particular:

ack_seq is the sequence number of the incoming segment
ack is the acknowledgment number contained in the incoming segment, i.e. it acknowledges the sequence number of the last segment the peer received from us.
snd_wl1 is the sequence number of the incoming segment that last updated the peer’s receive window.
snd_una is the sequence number of the first unacknowledged segment, i.e. a segment we have sent but has not been yet acknowledged by the peer.

Being fixed-size integers, the sequence numbers will eventually wrap around, so the after() macro takes care of comparing two sequence numbers in the face of wraparounds.

For the record, the snd_una and snd_wl1 names come directly from the original TCP specification in RFC 793, back in 1981!

Translating the rather cryptic check into plain English, we are willing to accept a window update from a peer if:

❶: our peer acknowledges the receipt of data we previously sent; or
❷: our peer is sending new data since the previous window update; or
❸: our peer isn’t sending us new data since the previous window update, but is advertising a larger window

Note that the comparison of ack_seq with snd_wl1 is done to make sure that the window is not accidentally updated by a (retransmission of a) segment that was seen earlier.

In our case, at least condition ❸ should be able to re-open the window, but apparently it doesn’t and we need access to these variables to figure out what is happening. Unfortunately, these variables are part of the internal kernel state and are not directly exposed to userspace, so it’s time to get a bit creative.

Accessing the internal kernel state

To get access to the kernel state, we somehow need to run code inside the kernel. One way would be to patch the kernel with a few printk() calls here and there, but that would require rebooting the machine and waiting for rsync to hang again. Rather, we opted to live-patch the kernel using systemtap with the following script:

probe kernel.statement("tcp_ack@./net/ipv4/tcp_input.c:3751")
{
  if ($sk->sk_send_head != NULL) {
	  ack_seq = @cast(&$skb->cb[0], "tcp_skb_cb", "kernel<net/tcp.h>")->seq
	  printf("ack: %d, ack_seq: %d, prior_snd_una: %d\n", $ack, ack_seq, $prior_snd_una)
	  seq = @cast(&$sk->sk_send_head->cb[0], "tcp_skb_cb", "kernel<net/tcp.h>")->seq
	  end_seq = @cast(&$sk->sk_send_head->cb[0], "tcp_skb_cb", "kernel<net/tcp.h>")->end_seq
	  printf("sk_send_head seq:%d, end_seq: %d\n", seq, end_seq)

	  snd_wnd = @cast($sk, "tcp_sock", "kernel<linux/tcp.h>")->snd_wnd
	  snd_wl1 = @cast($sk, "tcp_sock", "kernel<linux/tcp.h>")->snd_wl1
	  ts_recent = @cast($sk, "tcp_sock", "kernel<linux/tcp.h>")->rx_opt->ts_recent
	  rcv_tsval = @cast($sk, "tcp_sock", "kernel<linux/tcp.h>")->rx_opt->rcv_tsval
	  printf("snd_wnd: %d, tcp_wnd_end: %d, snd_wl1: %d\n", snd_wnd, $prior_snd_una + snd_wnd, snd_wl1)
	  printf("flag: %x, may update window: %d\n", $flag, $flag & 0x02)
	  printf("rcv_tsval: %d, ts_recent: %d\n", rcv_tsval, ts_recent)
	  print("\n")
     }
}

Systemtap works by converting systemtap scripts into C and building a kernel module that hot-patches the kernel and overrides specific instructions. Here we overrode tcp_ack(), hooked at its end and dumped the internal TCP connection state. The $sk->sk_send_head != NULL check is a quick way to only match connections that have a non-empty Send-Q.

Loading the resulting module into the kernel gave us the following:

ack: 4212361596, ack_seq: 1253278418, prior_snd_una: 4212361596
sk_send_head seq:4212361596, end_seq: 4212425472
snd_wnd: 0, tcp_wnd_end: 4212361596, snd_wl1: 1708927328
flag: 4100, may update window: 0
rcv_tsval: 2950255047, ts_recent: 2950255047

The two things of interest here are snd_wl1: 1708927328 and ack_seq: 1253278418. Not only are they not identical as we would expect, but actually ack_seq is smaller than snd_wl1, indicating that ack_seq wrapped around at some point and snd_wl1 has not been updated for a while. Using the serial number arithmetic rules, we can figure out that this end has received (at least) 3.8 GB since the last update of snd_wl1.

We already saw that snd_wl1 contains the last sequence number used to update the peer’s receive window (and thus our send window), with the ultimate purpose of guarding against window updates from old segments. It should be okay if snd_wl1 is not updated for a while, but it should not lag too far behind ack_seq, or else we risk rejecting valid window updates, as in this case. So it looks like the Linux kernel fails to update snd_wl1 under some circumstances, which leads to an inability to recover from a zero-window condition.

Having tangible proof that something was going on in the kernel, it was time to get people familiar with the networking code in the loop.

Taking things upstream

After sleeping on this, we wrote a good summary of what we knew so far and what we supposed was happening, and reached out to the Linux networking maintainers. Confirmation came less than two hours later, together with a patch by Neal Cardwell.

Apparently, the bug was in the bulk receiver fast-path, a code path that skips most of the expensive, strict TCP processing to optimize for the common case of bulk data reception. This is a significant optimization, outlined 28 years ago² by Van Jacobson in his “TCP receive in 30 instructions” email. Apparently the Linux implementation did not update snd_wl1 while in the receiver fast path. If a connection uses the fast path for too long, snd_wl1 will fall so far behind that ack_seq will wrap around with respect to it. And if this happens while the receive window is zero, there is no way to re-open the window, as demonstrated above. What’s more, this bug had been present in Linux since v2.1.8, dating back to 1996!

² This optimization is still relevant today: a relatively recent attempt to remove the header prediction code and associated fast paths to simplify the code was reverted on performance regression grounds.

As soon as we got the patch, we applied it, rebuilt the kernel, deployed it on the affected machines and waited to see if the issue was fixed. A couple of days later we were certain that the fix was indeed correct and did not cause any ill side-effects.

After a bit of discussion, the final commit landed in linux-net, and from there it was merged into Linux mainline for 5.10-rc1. Eventually it found its way to the stable 4.9 and 4.19 kernel series that we use on our Debian systems, in 4.9.241 and 4.19.153 respectively.

Aftermath

With the fix in place, we still had a couple of questions to answer, namely:

How is it possible for a TCP bug that leads to stuck connections to go unnoticed for 24 years?
Out of an infrastructure with more than 600 systems running all kinds of software, how come we only witnessed this bug when using rsync?

It’s hard to give a definitive answer to these questions, but we can reason about it this way:

This bug will not be triggered by most L7 protocols. In “synchronous” request-response protocols such as HTTP, usually each side will consume all available data before sending. In this case, even if snd_wl1 wraps around, the bulk receiver will be left with a non-zero window and will still be able to send out data, causing the next acknowledgment to update the window and adjust snd_wl1 through check ❶ in tcp_may_update_window. rsync on the other hand uses a pretty aggressive pipeline where the server might send out multi-GB responses without consuming incoming data in the process. Even in rsync’s case, using rsync over SSH (a rather common combination) rather than the plain TCP transport would not expose this bug, as SSH framing/signaling would most likely not allow data to queue up on the server this way.
Regardless of the application protocol, the receiver must remain long enough (for at least 2GB) with a zero send window in the fast path to cause a wrap-around — but not too long for ack_seq to overtake snd_wl1 again. For this to happen, there must be no packet loss or other conditions that would cause the fast path’s header prediction to fail. This is very unlikely to happen in practice as TCP itself determines the network capacity by actually causing packets to be lost.
Most applications will care about network timeouts and will either fail or reconnect, making it appear as a “random network glitch” and leaving no trace to debug behind.

Finally, even if none of the above happens and you end up with a stuck TCP connection, it takes a lot of annoyance to decide to deal with it and drill deep in kernel code. And when you do, you are rewarded with a nice adventure, where you get to learn about internet protocol history, have a glimpse at kernel internals, and witness open source work in motion!

If you enjoyed reading this post and you like hunting weird bugs and looking at kernel code, you might want to drop us a line — we are always looking for talented SREs and DevOps Engineers!

Skroutz Engineering

Skroutz's Engineering and Technology Blog