Detecting socket leaks with eBPF

ยท 1503 words ยท 8 minute read

Have you tried turning it off and on again? ๐Ÿ”—

Does any of this sound familiar?

Something has gone wrong. Our service is experiencing an elevated number of faults. Customers are complaining. OK, now it’s completely dead. There is no need for alarm. We just need to restart the server.

The server was rebooted, and they all lived happily ever after, until they needed to reboot the server again the next week.

The engineers who built this service claim “carrier-grade” reliability. They are self-professed adherents to the The Missile Will Explode Anyway school of engineering thought. But there is no missile, and the school of thought - doctrine, superstition, whatever - affects real customers. One of the necromancers involved insisted that I was “a [redacted] liar” when I told him of a customer-facing instance on EC2 with 350-something days of uptime.

“After that much time, a call to socket will just [redacted] fail!"

I am sure that this belief is an effective adaptation to the stressors this person experienced working on broken systems. Superstitions thrive in the absence of arbiters of truth. What if we could be a bit more scientific about TCP socket leaks?

Anatomy of a TCP socket leak ๐Ÿ”—

TCP has an undeserved reputation for being complex and hard to understand. Compared to some other protocols for sequenced, bidirectional communication with retransmission semantics, it is neatly defined and some implementations are accessible to the non-expert such as me.

Loosely speaking, a TCP connection is an instance of the state machine governed by the rules laid out in RFC 793. There are ten states, and a connection can be in only one of them at a point in time. Normally, a connection progresses through this life cycle, and ends up in the terminal CLOSED state. Connections are therefore not resumable, in the sense that once a connection is CLOSED, it cannot enter any other state and it is of no value to consume resources tracking it further.

Every TCP connection maintained by a host’s network stack consumes resources. If a connection is not doing useful work, then it is wasting resources that could service other, valuable connections. A TCP socket leak is the undesirable event in which a connection becomes stranded in a non-CLOSED state, without doing anything of value. If too many leaks occur, the networking stack’s finite resources will become exhausted.

Deliberately leaking sockets ๐Ÿ”—

Up to now, I have left “network stack” deliberately vague. From here, assume the network stack is Linux’s implementation of TCP on IPV4.

I wrote a toy program with a simple specification:

  1. Listen over the loopback interface for TCP connections.
  2. Repeatedly accept connections, then do nothing with them (the leak).
  3. Store the socket descriptors handed to us by accept(3).
  4. Return the leaked resources when a SIGHUP is received.

After the client disconnects, this program will strand their connection in the CLOSE_WAIT state.

The source code of an implementation of this is listed below, and is also available on GitHub. It will build with a recent C++ compiler, even with the -Wall -Werror options set.

Example of deliberate leak ๐Ÿ”—

/*
 * Toy TCP server demonstrating socket leak.
 *
 * */

#include <arpa/inet.h>
#include <iostream>
#include <netinet/ip.h>
#include <signal.h>
#include <string.h>
#include <sys/types.h>
#include <unistd.h>
#include <unordered_set>

std::unordered_set<int> leaked_sockets;

void upon_sighup(int) {
  std::cout << "Received SIGHUP. Cleaning up..." << std::endl;
  for (auto it = leaked_sockets.begin(); it != leaked_sockets.end();) {
    auto close_res = close(*it);
    auto err_src = close_res >= 0 ? close_res : errno;
    std::cout << "Closing socket " << *it << ": " << strerror(err_src)
              << std::endl;
    it = leaked_sockets.erase(it);
  }
  std::cout << "Clean up complete" << std::endl;
}

void setup() {
  std::cout << "PID: " << getpid() << std::endl;
  signal(SIGHUP, &upon_sighup);
}

int main() {
  setup();

  sockaddr_in server_addy = {.sin_family = AF_INET, .sin_port = htons(9001)};
  inet_aton("127.0.0.1", &server_addy.sin_addr);

  auto addy = reinterpret_cast<sockaddr *>(&server_addy);
  uint32_t size = sizeof(sockaddr_in);

  int s = socket(AF_INET, SOCK_STREAM, 0);
  bind(s, addy, size);
  listen(s, -1);

  do {
    int client = accept(s, addy, &size);
    if (client > 0) {
      leaked_sockets.emplace(client);
      std::cout << "Accepted client " << client << std::endl;
    } else {
      std::cout << "Cannot accept client: " << strerror(errno) << std::endl;
    }
  } while (1);

  return 0;
}

Leaking sockets to cause resource exhaustion ๐Ÿ”—

We can easily run the program and connect to it:

$ g++ -Wall -Werror leak_ipv4.cc           | $ nc 127.0.0.1 9001
$ ./a.out                                  | $ kill -SIGHUP 62146                                
PID: 62146                                 |
Accepted client 4                          |                                    
Received SIGHUP. Cleaning up...            | 
Closing socket 4: Success                  |
Clean up complete                          |

This server can be rendered useless by connecting to it many times. This will eventually exhaust the number of open files (socket descriptors) it is permitted.

...                                        | $ ./exhaust.sh
Accepted client 141                        |   ^ this script opens connections repeatedly
Accepted client 142                        |
...                                        |
...                                        |
...                                        |
Cannot accept client: Too many open files  |
Cannot accept client: Too many open files  |
Cannot accept client: Too many open files  |
...                                        |

Diagnosing socket leaks with traditional tools ๐Ÿ”—

  • Naive method: listing files
    • Use lsof and some grep wizardry to list a process' open files
    • Run ls -l /proc/$PID/fd
  • Sicko mode:
    • Use sysstat
    • Run sar -n SOCK,ETCP 1

Why don’t we bypass sicko mode and go right to f***o mode?

Get in loser, we’re going microkernelling ๐Ÿ”—

Much has been written about eBPF. In short, it is a virtual machine that runs user space-supplied bytecode in kernel space. Excellent tools such as bpftrace enable writing of eBPF programs in a high-level language. Using bpftrace, we can place a probe on kernel functions, and have our code executed when the kernel calls the probed function.

The benefits are enormous.

The value of eBPF (serious) ๐Ÿ”—

We can dig into kernel internals to investigate difficult problems with which programs built on top of traditional kernel-user space interfaces cannot help. Linux is famous for, amongst other things, interface stability. This comes at a cost, which often exceeds the benefit of adding syscalls for narrow use cases. Kernel developers do not have to design, engineer and support new interfaces to expose powerful functionality to a very small percentage of users.

Many excellent existing tracing tools such as strace and sysstat are complicated, as a direct consequence of their power and configurability. You have to invest a lot into learning them. What if we could remove this layer of indirection and get right into the kernel?

The value of eBPF (facetious) ๐Ÿ”—

The Solaris devotees will stop constantly one-upping you with “Solaris did this first”, “Solaris did that in the 17th century”, “DTrace can fix a broken heart”, and “There is a flag you can pass to ioctl on Solaris that makes you impervious to gunfire.”

Writing the eBPF program ๐Ÿ”—

bpftrace is a brilliant tool for writing eBPF programs. It can be used to place breakpoint-like probes on kernel functions, and exposes convenient syntax for accessing eBPF’s user-facing data structures such as maps. I want the kernel to tell me when it changes the state of a TCP connection. This will tell me where in the TCP state machine my misbehaving user space program is stranding its connections.

As I mentioned earlier, the kernel’s implementation of IPV4 is quite accessible, so finding the perfect function to probe was quite straightforward. The function has the following, self-explanatory signature:

void tcp_set_state(struct sock *sk, int state)

As this is a proof of concept, a trivial program will work:

  1. Attach to tcp_set_state.
  2. Keep track of struct socks moving into the CLOSE_WAIT state.
  3. Periodically print a list of connections in this state.
  4. Otherwise, print nothing.

Show me the code ๐Ÿ”—

The implementation is unbelievably concise:

#!/usr/bin/env bpftrace

#include <net/tcp_states.h>
#include <net/sock.h>
#include <linux/socket.h>
#include <linux/tcp.h>

BEGIN
{
	@leaks = 0
}

kprobe:tcp_set_state
{
	$sk = (struct sock *)arg0;
	$newstate = arg1;
	if ($newstate == TCP_CLOSE_WAIT) {
		@closewaits[$sk] = tid;
		@leaks++;
	}
	else {
		if (@closewaits[$sk]) {
		  delete(@closewaits[$sk]);
		  @leaks--;
		}
	}
}

interval:s:1
{
	if (@leaks > 0) {
		print(@closewaits);
	}
}

END
{
	clear(@closewaits);
}

That’s it. All that was required to write this program was some knowledge of Linux’s data structures, and the name of a kernel function. You can easily get all of this information directly from the kernel’s source code. For such a specific use case, I am happy to forgo any expectation of portability across different kernel versions. Having said that, the kernel developers are working hard to expose stable ways to probe the kernel that do not depend on possibly changing kernel internals.

Running the program (as root) takes no parameters or options:

# ./bubble.bt
Attaching 4 probes...
@closewaits[0xffff9534ba0cb480]: 71926

This tells me that a TCP connection has been stranded in CLOSE_WAIT. Unfortunately, the number stored in the map is not the PID of the offending process. I still can’t figure out how to do that. If you are reading this and you know a way how, please let me know!

TL;DR ๐Ÿ”—

Resource leaks can cause services to fail. We can use Linux’s eBPF to catch them as they happen.

See also ๐Ÿ”—

All bpftrace programs I have written can trace (hah) their lineage back to the tools contained here.