Detecting socket leaks with eBPF - One must imagine /sys/fs happy

A word of caution

This post is not intended to introduce production-ready code. It won’t help find TCP connections stranded in FIN_WAIT, for example. The result is an exposition of some useful aspects of eBPF.

Anatomy of a TCP socket leak

TCP has an undeserved reputation for being complex and hard to understand. Compared to some other protocols for sequenced, bidirectional communication with retransmission semantics, it is neatly defined and some implementations are accessible to the non-expert such as me.

Loosely speaking, a TCP connection is an instance of the state machine governed by the rules laid out in RFC 793. There are ten states, and a connection can be in only one of them at a point in time. Normally, a connection progresses through this life cycle, and ends up in the terminal CLOSED state. Connections are therefore not resumable, in the sense that once a connection is CLOSED, it cannot enter any other state and it is of no value to consume resources tracking it further.

Every TCP connection maintained by a host’s network stack consumes resources. If a connection is not doing useful work, then it is wasting resources that could service other, valuable connections. A TCP socket leak is the undesirable event in which a connection becomes stranded in a non-CLOSED state, without doing anything of value. If too many leaks occur, the networking stack’s finite resources will be exhausted.

Intentionally stranding TCP connections in `CLOSE_WAIT`

Up to now, I have left “network stack” deliberately vague. From here, assume the network stack is Linux’s implementation of TCP on IPV4. To keep things simple, only leaked sockets stranded in CLOSE_WAIT will be considered. Indeed, this is the easiest TCP state to deliberately and permanently get our socket stuck in.

I wrote a toy program with a simple specification:

Listen over the loopback interface for TCP connections.
Repeatedly accept connections, then do nothing with them (the leak).
Store the socket descriptors handed to us by accept(3).
Return the leaked resources when a SIGHUP is received.

After the client disconnects, this program will strand their connection in the CLOSE_WAIT state.

The source code of an implementation of this is listed below, and is also available on GitHub. It will build with a recent C++ compiler, even with the -Wall -Werror options set.

Example of deliberate leak

std::unordered_set<int> leaked_sockets;

void upon_sighup(int) {
  std::cout << "Received SIGHUP. Cleaning up..." << std::endl;
  for (auto it = leaked_sockets.begin(); it != leaked_sockets.end();) {
    auto close_res = close(*it);
    auto err_src = close_res >= 0 ? close_res : errno;
    std::cout << "Closing socket " << *it << ": " << strerror(err_src)
              << std::endl;
    it = leaked_sockets.erase(it);
  }
  std::cout << "Clean up complete" << std::endl;
}

int main() {
  std::cout << "PID: " << getpid() << std::endl;
  signal(SIGHUP, &upon_sighup);

  sockaddr_in server_addr = {.sin_family = AF_INET, .sin_port = htons(9001)};
  inet_aton("127.0.0.1", &server_addr.sin_addr);

  auto addr = reinterpret_cast<sockaddr *>(&server_addr);
  uint32_t size = sizeof(sockaddr_in);

  int s = socket(AF_INET, SOCK_STREAM, 0);
  bind(s, addr, size);
  listen(s, -1);

  do {
    int client = accept(s, addr, &size);
    if (client > 0) {
      leaked_sockets.emplace(client);
      std::cout << "Accepted client " << client << std::endl;
    } else {
      std::cout << "Cannot accept client: " << strerror(errno) << std::endl;
    }
  } while (1);

  return 0;
}

Leaking sockets to cause resource exhaustion

We can easily run the program and connect to it:

$ g++ -Wall -Werror leak_ipv4.cc           | $ nc 127.0.0.1 9001
$ ./a.out                                  | $ kill -SIGHUP 62146
PID: 62146                                 |
Accepted client 4                          |
Received SIGHUP. Cleaning up...            |
Closing socket 4: Success                  |
Clean up complete                          |

This server can be rendered useless by connecting to it many times. This will eventually exhaust the number of open files (socket descriptors) it is permitted.

...                                        | $ ./exhaust.sh
Accepted client 141                        |   ^ this script opens connections repeatedly
Accepted client 142                        |
...                                        |
...                                        |
...                                        |
Cannot accept client: Too many open files  |
Cannot accept client: Too many open files  |
Cannot accept client: Too many open files  |
...                                        |

Get in loser, we’re going microkernelling

Much has been written about eBPF. In short, it is a virtual machine that runs user space-supplied bytecode in kernel space. Excellent tools such as bpftrace enable writing of eBPF programs in a high-level language. Using bpftrace, we can place a probe on kernel functions, and have our code executed when the kernel calls the probed function.

The benefits are enormous.

The value of eBPF (serious)

We can dig into kernel internals to investigate difficult problems with which programs built on top of traditional kernel-user space interfaces cannot help. Linux is famous for, amongst other things, interface stability. This comes at a cost, which often exceeds the benefit of adding syscalls for narrow use cases. Kernel developers do not have to design, engineer and support new interfaces to expose powerful functionality to a very small percentage of users.

The value of eBPF (facetious)

My Solaris-enthusiast friends will stop pitying me, and instead remind me that “Solaris did this first”, “Solaris did that in the 17th century”, “DTrace can fix a broken heart”, and “There is a flag you can pass to ioctl on Solaris that makes you impervious to gunfire.”

Writing the eBPF program

bpftrace is a low learning curve tool for writing eBPF programs. It can be used to place breakpoint-like probes on kernel functions, and exposes convenient syntax for accessing eBPF’s user-facing data structures such as maps. I want the kernel to tell me when it changes the state of a TCP connection. This will tell me where in the TCP state machine my misbehaving user space program is stranding its connections.

As I mentioned earlier, the kernel’s implementation of IPV4 is quite accessible, so finding the perfect function to probe was quite straightforward. The function has the following, self-explanatory signature:

void tcp_set_state(struct sock *sk, int state)

As this is a proof of concept, a trivial program will work:

Attach to tcp_set_state.
Keep track of struct socks moving into the CLOSE_WAIT state.
Periodically print a list of connections in this state.
Otherwise, print nothing.

Show me the code

The implementation is unbelievably concise:

#!/usr/bin/env bpftrace

#include <net/tcp_states.h>
#include <net/sock.h>
#include <linux/socket.h>
#include <linux/tcp.h>

BEGIN
{
  @leaks = 0
}

kprobe:tcp_set_state
{
  $sk = (struct sock *)arg0;
  $newstate = arg1;
  if ($newstate == TCP_CLOSE_WAIT) {
    @closewaits[$sk] = tid;
    @leaks++;
  }
  else {
    if (@closewaits[$sk]) {
      delete(@closewaits[$sk]);
      @leaks--;
    }
  }
}

interval:s:1
{
  if (@leaks > 0) {
    print(@closewaits);
  }
}

END
{
  clear(@closewaits);
}

That’s it. All that was required to write this program was some knowledge of Linux’s data structures, and the name of a kernel function. You can easily get all of this information directly from the kernel’s source code. For such a specific use case, I am happy to forgo any expectation of portability across different kernel versions.

Running the program (as root) takes no parameters or options:

# ./bubble.bt
Attaching 4 probes...
@closewaits[0xffff9534ba0cb480]: 71926

This tells me that a TCP connection has been stranded in CLOSE_WAIT.

TL;DR

Resource leaks can cause services to fail. We can use Linux’s eBPF to catch them as they happen.