Hunting bugs in perf_event2 - One must imagine /sys/fs happy

Background

Lately I’ve been working on dirty_mike, which is an experimental self-profiling library for Rust applications on Linux. I have a need to take quick-and-dirty, ad-hoc measurements of arbitrary subsystems. This is what dirty_mike does.

Linux’s perf event subsystem

To obtain data, dirty_mike uses Linux’s perf event subsystem, which is a powerful system introspection tool, that is the basis for the userspace perf application. While perf can do everything dirty_mike can, it does not have a neat, easy-to-use Rust API. Ultimately, that’s why dirty_mike exists. It makes use of the Rust crate perf_event2 to access the perf event subsystem. This crate is a fork of the original perf_event. While many improvements have been upstreamed, the most valuable ones for our purposes have not been.

perf_event2 wraps the C API with some niceties, and is well-documented enough to make it possible to write programs that inspect systems by just reading the RustDoc. It is a natural fit for dirty_mike.

Hooking into kernel events

At the start of writing this blog post, dirty_mike could only take counting measurements. I wanted to extend this capability to the sampled measurements produced by the kernel. The API for sampling and the publication of arbitrary events is much the same. The perf event subsystem works relatively simply, at least conceptually:

The kernel exposes named events which can be subscribed to. See Brendan Gregg’s examples.
The perf_event_open API allows you to register your interest in such events.
The kernel also agrees to write events to a ring buffer mapped into your process’s address space.
The data put into this ring buffer follows the “schema” defined by the C structs in linux/perf_event.h.

The `perf_event2` crate: idiomatic Rust wrapping the C API

Rather conveniently, the perf_event2 crate has safe, idiomatic Rust APIs for accessing these events. Instead of wrangling C-structs, we can just use tagged unions, like this one.

pub enum Record<'a> {
    Mmap(Mmap<'a>),
    Lost(Lost),
    Comm(Comm<'a>),
    Exit(Exit),
    Throttle(Throttle),
    Unthrottle(Throttle),
    Fork(Fork),
    Read(Read),
    Sample(Box<Sample<'a>>),
    Mmap2(Mmap2<'a>),
    Aux(Aux),
    ITraceStart(ITraceStart),
    LostSamples(LostSamples),
    Switch,
    SwitchCpuWide(SwitchCpuWide),
    Namespaces(Namespaces<'a>),
    KSymbol(KSymbol<'a>),
    BpfEvent(BpfEvent),
    CGroup(CGroup<'a>),
    TextPoke(TextPoke<'a>),
    AuxOutputHwId(AuxOutputHwId),
    ..
    // and so on
}

Limitations, surprises, and bugs in `perf_event2`

For dirty_mike to succeed, I am going to need a rough idea of the “performance envelope” of its main dependency, perf_event2. Its RustDoc inherits this comment from the original crate:

Linux’s perf_event_open API can report all sorts of things this crate doesn’t yet understand: stack traces, logs of executable and shared library activity, tracepoints, kprobes, uprobes, and so on. And beyond the counters in the kernel header files, there are others that can only be found at runtime by consulting sysfs, specific to particular processors and devices. For example, modern Intel processors have counters that measure power consumption in Joules.

If you find yourself in need of something this crate doesn’t support, please consider submitting a pull request.

While this comment asserts that perf_event2 does not support tracepoints, kprobes or uprobes, it does have an Event trait, with implementations for Breakpoint, Cache, Dynamic, Hardware, KProbe, Raw, Software, Tracepoint and MSR.

This trait appears in the bound in the generic function Builder::new, so code asking for say, tracepoints, will compile but it may not work. Experimentation is needed.

Let’s get some context switch events

I wrote some code (link) to try subscribe to the simplest, most straight-forward event in the perf event API: context switches.

use perf_event::{Builder, Group, events};

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let group = Group::builder().context_switch(true).build_group()?;

    let ctr = group.into_counter();

    let mut sampled = ctr.sampled(128)?;

    sampled.enable()?;

    for i in 0..10 {
        println!("Waiting for event {}...", i);
        match sampled.next_blocking(Some(std::time::Duration::from_millis(500))) {
            Some(sample) => match sample.parse_record() {
                Ok(record) => println!("Event record: {:?}", record),
                Err(e) => eprintln!("Failed to parse record: {}", e),
            },
            None => {
                eprintln!("timeout");
            }
        }
    }

    sampled.disable()?;

    Ok(())
}

It did not work as I expected. Running it, I got:

❯ sudo ./target/debug/examples/experiment
Waiting for event 0...
timeout
Waiting for event 1...
Failed to parse record: unexpected EOF during parsing
Waiting for event 2...
Failed to parse record: unexpected EOF during parsing
Waiting for event 3...
timeout

It looked like the primary value-add of perf_event2 - a neat API - was not working as expected.

I’d now identified two unpleasant surprises, though not bugs per se.

The first call to next_blocking always timed out, and if supplied with None would hang indefinitely.
I expected parse_record to return a Record::Switch variant, but instead it failed.

Why is this code not working?

Spend two minutes looking for dumb, obvious mistakes.

When troubleshooting, I often like to spend a few minutes in a methodology-free, exploratory search. In my experience, powerful kernel APIs (such as KVM, eBPF, perf event) can be frustrating to use. It’s very easy to supply faulty inputs, or violate subtle conditions documented in the headers. As such, the very first thing to do is to look for stupid mistakes. I scanned the perf_event.h header for admonitions, or easy-to-miss requirements. I discovered that the perf event subsystem provides two distinct ways to gather context switch events:

Set the context_switch bit in perf_event_attr (the main argument to the syscall perf_event_open).
Alternatively, in the same struct, set the config field to PERF_COUNT_SW_CONTEXT_SWITCHES and set the type field to PERF_TYPE_SOFTWARE.

The existence of two distinct mechanisms to obtain context switches suggested a possible explanation for the surprising behavior. My hypothesis was now that the kernel reported context switches in two distinct ways, depending on how it was asked for them. In other words, I had a skill issue.

Am I even getting context switch events?

Testing this hypothesis was very straightforward. The kernel had promised to deposit events into the ring buffer managed by perf_event2. I knew based on the output that some iterations of the loop were taking the branch that received some event. Therefore, something was appearing in this ring buffer, even though perf_event2 couldn’t parse it.

Looking at the return value of next_blocking I saw Option<Record<'a>>, which had some methods I could use to inspect what was being pulled off the ring buffer. Important to note: this struct is distinct from the enum by the same name.

Changing the code to print out the ty (link), I saw that the record’s type was 14. I consulted the header to confirm that 14 indeed corresponded to a context switch event. Then, I used the len method to confirm what I already strongly suspected was true: the record was 0-sized, which led parse_record to return an Err variant. Great! Now I knew I was getting events. While I had not found a significant bug, I did find an opportunity to add some more diagnostically useful methods to the Record type. The pro-social thing to do is to make it easier for the next person to debug. Doing a bit of clicking around docs.rs, I found that as expected, perf_event2 depended on some bindgen glue code. Happily, bindgen translated the constants from the header to Rust code, but they were sequestered in the perf_event_open_sys crate. It would therefore be relatively easy to make this code nicer to use.

Should perf_event2 unify the APIs for both ways of obtaining context switches? From my perspective it violates the least-surprise principle, but others may be accustomed to, and indeed depend on, this idiosyncrasy of perf_event_open. I defer the decision on what color to paint this particular bikeshed, at least for now.

Static tracepoints also not working

I then tried to subscribe to static tracepoints, against the advice of the documentation. The first thing I did was to sample the CPU every 1000 cycles, check that code works, and then swap out the CPU_CYCLES event for a static tracepoint, but with a much lower sample period.

let tp = events::Hardware::CPU_CYCLES;

let mut sampler = Builder::new(tp)
    .sample_period(1_000)
    .build()?
    .sampled(8192)?;

Clearly, the CPU sampling code was working:

❯ sudo ./target/debug/examples/experiment
Event record 0: Sample(
    Sample { .. },
)

However, when I tried the static tracepoint block/block_rq_complete, the loop just hung. In addition, I found that the CPU sampling code would hang after 67 iterations. Perhaps I was not pulling items off the circular buffer after all.

To rule out permissions issues or a misconfigured kernel, I ran the perf CLI sampling this static tracepoint:

❯ sudo perf record -e block:block_rq_complete -a -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.725 MB perf.data (203 samples) ]
❯ sudo perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 203  of event 'block:block_rq_complete'
# Event count (approx.): 203
#
# Overhead  Command          Shared Object      Symbol
# ........  ...............  .................  ......................
#
    92.12%  swapper          [kernel.kallsyms]  [k] blk_update_request
     2.46%  usb-storage      [kernel.kallsyms]  [k] blk_update_request
     1.97%  kworker/u8:24-f  [kernel.kallsyms]  [k] blk_update_request
     0.99%  ksoftirqd/0      [kernel.kallsyms]  [k] blk_update_request
     0.99%  kworker/u8:0-bt  [kernel.kallsyms]  [k] blk_update_request
     0.49%  DefaultDispatch  [kernel.kallsyms]  [k] blk_update_request
     0.49%  ThreadPoolForeg  [kernel.kallsyms]  [k] blk_update_request
     0.49%  kworker/u8:6-bt  [kernel.kallsyms]  [k] blk_update_request

Since this worked, I concluded that the problem was with my program.

Analysis

I had picked up a few issues using perf_event2. To make effective use of it I needed to sort out all the data I had to determine what was a perf_event2 bug, what was my fault, and what was simply an artifact of how the perf event subsystem works.

To summarize, the following issues were in play:

The spurious failures parsing 0-sized samples.
The mystery of the gummed up ring buffer.
Static tracepoints don’t work at all, and they don’t explicitly fail either.

At this point I decided that the highest-leverage avenue of investigation would be to check out perf_event2. I cloned a copy of the repo onto my machine and set up a git worktree to conveniently vendor it into my workspace.

Finding call sites of `perf_event_open`

Fortunately, the centre-of-mass of the perf event subsystem is concentrated around the perf_event_open syscall. I needed to identify how perf_event2 converted my inputs to calls to this function. Conveniently, there is only one call site of this syscall in the entire crate:

let result = check_errno_syscall(|| unsafe {
    sys::perf_event_open(&mut attrs, pid, cpu, group_fd, flags as c_ulong)
});

As a formality, I ran the perf CLI under strace to double check that it definitely used the same syscall, and that I wasn’t completely off-base:

❯ sudo cat /sys/kernel/debug/tracing/events/block/block_rq_complete/id
1367
❯ sudo strace -v -e perf_event_open -- perf record -e block:block_rq_complete -a -- sleep 10 2>&1 >/dev/null | rg 1367
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0x88 /* PERF_ATTR_SIZE_??? */, config=1367, sample_period=1, sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_RAW|PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_ID|PERF_FORMAT_LOST, disabled=1, inherit=1, pinned=0, exclusive=0, exclude_user=0, exclude_kernel=0, exclude_hv=0, exclude_idle=0, mmap=0, comm=0, freq=0, inherit_stat=0, enable_on_exec=0, task=0, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=1, exclude_host=0, exclude_guest=1, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=0, comm_exec=0, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, ksymbol=0, bpf_event=0, aux_output=0, cgroup=0, text_poke=0, build_id=0, inherit_thread=0, remove_on_exec=0, sigtrap=0, wakeup_events=0, config1=0, config2=0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0, aux_sample_size=0, sig_data=0, config3=0}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 4
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0x88 /* PERF_ATTR_SIZE_??? */, config=1367, sample_period=1, sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_RAW|PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_ID|PERF_FORMAT_LOST, disabled=1, inherit=1, pinned=0, exclusive=0, exclude_user=0, exclude_kernel=0, exclude_hv=0, exclude_idle=0, mmap=0, comm=0, freq=0, inherit_stat=0, enable_on_exec=0, task=0, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=1, exclude_host=0, exclude_guest=1, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=0, comm_exec=0, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, ksymbol=0, bpf_event=0, aux_output=0, cgroup=0, text_poke=0, build_id=0, inherit_thread=0, remove_on_exec=0, sigtrap=0, wakeup_events=0, config1=0, config2=0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0, aux_sample_size=0, sig_data=0, config3=0}, -1, 1, -1, PERF_FLAG_FD_CLOEXEC) = 5
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0x88 /* PERF_ATTR_SIZE_??? */, config=1367, sample_period=1, sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_RAW|PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_ID|PERF_FORMAT_LOST, disabled=1, inherit=1, pinned=0, exclusive=0, exclude_user=0, exclude_kernel=0, exclude_hv=0, exclude_idle=0, mmap=0, comm=0, freq=0, inherit_stat=0, enable_on_exec=0, task=0, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=1, exclude_host=0, exclude_guest=1, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=0, comm_exec=0, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, ksymbol=0, bpf_event=0, aux_output=0, cgroup=0, text_poke=0, build_id=0, inherit_thread=0, remove_on_exec=0, sigtrap=0, wakeup_events=0, config1=0, config2=0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0, aux_sample_size=0, sig_data=0, config3=0}, -1, 2, -1, PERF_FLAG_FD_CLOEXEC) = 6
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0x88 /* PERF_ATTR_SIZE_??? */, config=1367, sample_period=1, sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_RAW|PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_ID|PERF_FORMAT_LOST, disabled=1, inherit=1, pinned=0, exclusive=0, exclude_user=0, exclude_kernel=0, exclude_hv=0, exclude_idle=0, mmap=0, comm=0, freq=0, inherit_stat=0, enable_on_exec=0, task=0, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=1, exclude_host=0, exclude_guest=1, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=0, comm_exec=0, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, ksymbol=0, bpf_event=0, aux_output=0, cgroup=0, text_poke=0, build_id=0, inherit_thread=0, remove_on_exec=0, sigtrap=0, wakeup_events=0, config1=0, config2=0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0, aux_sample_size=0, sig_data=0, config3=0}, -1, 3, -1, PERF_FLAG_FD_CLOEXEC) = 8

This was an unexpected discovery: perf makes one perf_event_open call for each logical CPU on my system (a nine year old dual-core Thinkpad). My code was not doing that:

❯ sudo strace -v -e perf_event_open -- ./target/debug/examples/experiment
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0x88 /* PERF_ATTR_SIZE_??? */, config=1367, sample_period=1000, sample_type=0, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING, disabled=1, inherit=0, pinned=0, exclusive=0, exclude_user=0, exclude_kernel=1, exclude_hv=1, exclude_idle=0, mmap=0, comm=0, freq=0, inherit_stat=0, enable_on_exec=0, task=0, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=0, exclude_host=0, exclude_guest=0, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=0, comm_exec=0, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, ksymbol=0, bpf_event=0, aux_output=0, cgroup=0, text_poke=0, build_id=0, inherit_thread=0, remove_on_exec=0, sigtrap=0, wakeup_events=0, config1=0, config2=0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0, aux_sample_size=0, sig_data=0, config3=0}, 0, -1, -1, PERF_FLAG_FD_CLOEXEC) = 3

I was now in a position to test the hypotheses that perf_event2 was not faithfully mapping my inputs to perf_event_open.

Aside: what do the em-dash enthusiasts think?

At this point I asked Claude (Sonnet 4) and ChatGPT (o3) for an explanation as to which parameters made a difference. Claude merely style-transferred a diff of the two entries to a wall of unhelpful markdown. Impressively – or so I thought – o3 saw the problem immediately in its chain of thought, and then explained it to me:

TL;DR

#1 is a sampler – it fires an interrupt every single time the trace-point hits and pushes a full sample record into an mmap ring.

#2 is a counter – it never generates samples at all; you poll/read it later to get an aggregate hit-count (+ timing scalars).

Everything else (inheritance flags, what execution modes are excluded, what lands in a read(), etc.) follows from that basic design choice.

It then expanded:

Setting sample_type = 0 for a trace-point almost always means “I meant to use a counter, not a trace-point”

That is not stated in the man page. A useful piece of information, if true.

Reproducing `perf`’s behavior

The thorough thing to do at this point was to set all arguments in my code to match the arguments that perf used, and then to experiment with what effect they had.

Maximum laziness

But, if o3 was correct, then all I had to do was ensure that perf_event_attr’s sample_type field was given the right bit flags. Looking at the arguments to perf_event_open, I learned that Builder has a field which is precisely the perf_event_attr struct:

#[derive(Clone)]
pub struct Builder<'a> {
    attrs: perf_event_attr,
    ...
}

Searching for code that writes to the sample_type field I found the Builder::sample method. How convenient! Disappointingly, merely setting the sample_type field to be identical to the working example did not help:

.sample(
    SampleFlag::RAW
        | SampleFlag::IP
        | SampleFlag::TID
        | SampleFlag::TIME
        | SampleFlag::CPU
        | SampleFlag::PERIOD
        | SampleFlag::IDENTIFIER,
)

No shortcuts for me

The shortcut suggested by o3 did not work. I made sure that the final perf_event_open

                   

syscall would have identical arguments to the working example: style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">let mut sampler = Builder::new(tp) .any_pid() .one_cpu(1) .include_kernel() .inherit(true) .sample_id_all(true) .exclude_guest(true) .sample( SampleFlag::RAW | SampleFlag::IP | SampleFlag::TID | SampleFlag::TIME | SampleFlag::CPU | SampleFlag::PERIOD | SampleFlag::RAW | SampleFlag::IDENTIFIER, ) .read_format(ReadFormat::ID | ReadFormat::LOST) .sample_period(1) .build()? .sampled(8192)?;

After confirming with strace that this had the intended effect on the arguments provided to the syscall, I tried this incantation out and as expected, the tracepoint events appeared in the ring buffer. Also, the event record parser worked:

Event record 26: Sample(
    Sample {
        ip: 18446744071855823341,
        pid: 499,
        tid: 499,
        time: 900888146338066,
        id: 3554,
        cpu: 1,
        period: 1

I had established that perf_event2 did indeed support tracepoints. The hypothesis that it was not faithfully providing my inputs to perf_event_open was now disconfirmed. Providing excrutiatingly specific flags worked.

Obtaining a minimal reproducible sample of the bug

Of the dozen-or-so flags and arguments that could be responsible for failing to pick up tracepoints, only a few actually wound up mattering. I minimized the code required to receive tracepoints:

let mut sampler = Builder::new(tp)
    .any_pid()
    .one_cpu(1)
    .include_kernel()
    .sample(SampleFlag::RAW)
    .sample_period(1)
    .build()?
    .sampled(8192)?;

Sure, this code “works”, but it would be better if Builder did what it said it would do when asking for tracepoints. Ideally, you’d only have to call the new function.

Fixing the bugs

I noticed that the Event trait, which Tracepoint implements, simply requires its implementors to place the correct settings in the perf_event_attr struct.

pub trait Event: Sized {
    // Required method
    fn update_attrs(self, attr: &mut perf_event_attr);

    // Provided method
    fn update_attrs_with_data(
        self,
        attr: &mut perf_event_attr,
    ) -> Option<Arc<dyn EventData>> { ... }
}

I had hoped that changing just one impl block (code), would have populated Builder with the right settings.

impl Event for Tracepoint {
    fn update_attrs(self, attr: &mut bindings::perf_event_attr) {
        attr.set_exclude_kernel(0);
        attr.sample_type |= crate::SampleFlag::RAW.bits();
        attr.type_ = bindings::PERF_TYPE_TRACEPOINT;
        attr.config = self.id;
    }
}

Alas, Builder::new immediately overwrote several values set by update_attrs:

// Do the update_attrs bit before we set any of the default state so
// that user code can't break configuration we really care about.
let data = event.update_attrs_with_data(&mut attrs);

// Setting `size` accurately will not prevent the code from working
// on older kernels. The module comments for `perf_event_open_sys`
// explain why in far too much detail.
attrs.size = std::mem::size_of::<perf_event_attr>() as u32;

let mut builder = Self {
    attrs,
    who: EventPid::ThisProcess,
    cpu: None,
    event_data: data,
};

builder.enabled(false);
builder.exclude_kernel(true);
builder.exclude_hv(true);

I ran git blame on the comment that says that overwriting settings provided by the event is the right thing to do:

commit 70729063dd10462f33d049007ba074374f8a9ead
Author: Phantomical <phantom@lynches.ca>
Date:   Wed Apr 12 23:54:38 2023 -0700

    Convert Event from an enum to a trait

    This opens up uses of enum so that others outside the crate can add
    their own event types. Ideally we would support all the relevant events
    within the crate but that's not always feasible. Having this be a trait
    which just sets some fields within the group will mean that users of the
    crate have the ability to implement something even if we haven't added
    support explicitly.

I understood this to mean that implementors of Event should actually be able to overwrite whatever fields they want. I removed the code that overwrites fields set by the event. Coupled with the change above, it meant that the following code was sufficient to subscribe to tracepoints:

let mut sampler = Builder::new(Tracepoint::with_name("block/block_rq_complete")?)
    .any_pid()
    .one_cpu(1)
    .sample_period(1)
    .build()?
    .sampled(8192)?;

Much better! Static tracepoints now worked!

There is a clear alternative to changing the new function. I could have added a new function, named something like Builder::new_no_default. That would preserve the existing behavior of Builder::new. It would also probably be easier to upstream than the breaking change. But, I was not convinced that preserving API-compatibility would be worthwhile when the API is actively broken. To lessen the impact, each implementor of Event could write the correct values it needs. This necessitated a larger change. It was a good opportunity to introduce a number of related improvements to the code.

Breaking changes break tests

Fortunately, the change I made broke two of perf_event2’s unit tests. The first broken test was this one:

#[test]
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
fn test_sampler_rdpmc() {
    let mut sampler = Builder::new(events::Hardware::INSTRUCTIONS)
        .enabled(true)
        .build()
        .expect("failed to build counter")
        .sampled(1024)
        .expect("failed to build sampler");

    let read = sampler.read_user();
    sampler.disable().unwrap();
    let value = sampler.read_full().unwrap();

    assert!(read.time_running() <= value.time_running().unwrap());
    assert!(read.time_enabled() <= value.time_enabled().unwrap());

    if let Some(count) = read.count() {
        assert!(count <= value.count(), "{count} <= {}", value.count());
    }
}

Usefully libtest pointed out that the time_running method was returning a None variant. That did not surprise me, because I had removed this line from Builder::new:

builder.read_format(ReadFormat::TOTAL_TIME_ENABLED | ReadFormat::TOTAL_TIME_RUNNING);

Simply restoring this field for Hardware fixed this test. Restoring it for Software fixed the other. (Code)

Are these ‘fixes’ necessary?

At this point, I had established that perf_event2 was fully controllable, in the sense that it allowed me to explicitly set every field in every argument to perf_event_open. Were fixes necessary? It was my opinion that these fixes were necessary, because perf_event2 adds value by providing low-friction, unsurprising Rust APIs to access the perf event subsystem. Without these changes, users of this crate would have to discover undocumented requirements of the perf_event_open syscall by experiment, as I had. However, my fixes were not yet in a state that I could try to upstream in good conscience.

Working backwards from the user

To genuinely improve the perf_event2 crate, and to upstream my changes, I will need to define what a good experience using this crate is, and then work backwards from that.

Requirements

Small, simple, working examples for all event types, including:
- Tracepoints
- Probes
- Software events
- Hardware events
Return diagnostically useful errors if the user tries to ask for something impossible.
Opinionated defaults, so that this crate is as easy to use as the perf CLI.
Greatly expanded test coverage.

Nice-to-have: removal of some unsafe code

While unsafe code is not intrinsically bad, it is more burdensome to maintain than the safe subset of Rust. I have identified a few usages of unsafe which are probably unnecessary. In particular, some of the unsafe transmutations in perf_event2 could be replaced with safe equivalents, using zerocopy. See Jack’s blog post for an explanation of how zerocopy’s automated reasoning engine works.

Thanks for reading! If you found this useful, check out dirty_mike on GitHub. I will also be publishing a follow-up with larger changes fixing perf_event2.