Background
Lately I’ve been working on dirty_mike, which is an experimental self-profiling library for Rust applications on Linux.
I have a need to take quick-and-dirty, ad-hoc measurements of arbitrary subsystems.
This is what dirty_mike
does.
Linux’s perf event subsystem
To obtain data, dirty_mike
uses Linux’s perf event subsystem, which is a powerful system introspection tool, that is the basis for the userspace perf
application.
While perf
can do everything dirty_mike
can, it does not have a neat, easy-to-use Rust API.
Ultimately, that’s why dirty_mike
exists.
It makes use of the Rust crate perf_event2
to access the perf event subsystem.
This crate is a fork of the original perf_event
.
While many improvements have been upstreamed, the most valuable ones for our purposes have not been.
perf_event2
wraps the C API with some niceties, and is well-documented enough to make it possible to write programs that inspect systems by just reading the RustDoc.
It is a natural fit for dirty_mike
.
Hooking into kernel events
At the start of writing this blog post, dirty_mike
could only take counting measurements.
I wanted to extend this capability to the sampled measurements produced by the kernel.
The API for sampling and the publication of arbitrary events is much the same.
The perf event subsystem works relatively simply, at least conceptually:
- The kernel exposes named events which can be subscribed to. See Brendan Gregg’s examples.
- The
perf_event_open
API allows you to register your interest in such events. - The kernel also agrees to write events to a ring buffer mapped into your process’s address space.
- The data put into this ring buffer follows the “schema” defined by the C structs in
linux/perf_event.h
.
The perf_event2
crate: idiomatic Rust wrapping the C API
Rather conveniently, the perf_event2
crate has safe, idiomatic Rust APIs for accessing these events.
Instead of wrangling C-structs, we can just use tagged unions, like this one.
pub enum Record<'a> {
Mmap(Mmap<'a>),
Lost(Lost),
Comm(Comm<'a>),
Exit(Exit),
Throttle(Throttle),
Unthrottle(Throttle),
Fork(Fork),
Read(Read),
Sample(Box<Sample<'a>>),
Mmap2(Mmap2<'a>),
Aux(Aux),
ITraceStart(ITraceStart),
LostSamples(LostSamples),
Switch,
SwitchCpuWide(SwitchCpuWide),
Namespaces(Namespaces<'a>),
KSymbol(KSymbol<'a>),
BpfEvent(BpfEvent),
CGroup(CGroup<'a>),
TextPoke(TextPoke<'a>),
AuxOutputHwId(AuxOutputHwId),
..
// and so on
}
Limitations, surprises, and bugs in perf_event2
For dirty_mike
to succeed, I am going to need a rough idea of the “performance envelope” of its main dependency, perf_event2
.
Its RustDoc inherits this comment from the original crate:
Linux’s perf_event_open API can report all sorts of things this crate doesn’t yet understand: stack traces, logs of executable and shared library activity, tracepoints, kprobes, uprobes, and so on. And beyond the counters in the kernel header files, there are others that can only be found at runtime by consulting sysfs, specific to particular processors and devices. For example, modern Intel processors have counters that measure power consumption in Joules.
If you find yourself in need of something this crate doesn’t support, please consider submitting a pull request.
While this comment asserts that perf_event2
does not support tracepoints, kprobes or uprobes, it does have an Event
trait, with implementations for Breakpoint, Cache, Dynamic, Hardware, KProbe, Raw, Software, Tracepoint
and MSR
.
This trait appears in the bound in the generic function Builder::new
, so code asking for say, tracepoints, will compile but it may not work.
Experimentation is needed.
Let’s get some context switch events
I wrote some code (link) to try subscribe to the simplest, most straight-forward event in the perf event API: context switches.
use perf_event::{Builder, Group, events};
fn main() -> Result<(), Box<dyn std::error::Error>> {
let group = Group::builder().context_switch(true).build_group()?;
let ctr = group.into_counter();
let mut sampled = ctr.sampled(128)?;
sampled.enable()?;
for i in 0..10 {
println!("Waiting for event {}...", i);
match sampled.next_blocking(Some(std::time::Duration::from_millis(500))) {
Some(sample) => match sample.parse_record() {
Ok(record) => println!("Event record: {:?}", record),
Err(e) => eprintln!("Failed to parse record: {}", e),
},
None => {
eprintln!("timeout");
}
}
}
sampled.disable()?;
Ok(())
}
It did not work as I expected. Running it, I got:
❯ sudo ./target/debug/examples/experiment
Waiting for event 0...
timeout
Waiting for event 1...
Failed to parse record: unexpected EOF during parsing
Waiting for event 2...
Failed to parse record: unexpected EOF during parsing
Waiting for event 3...
timeout
It looked like the primary value-add of perf_event2
- a neat API - was not working as expected.
I’d now identified two unpleasant surprises, though not bugs per se.
- The first call to
next_blocking
always timed out, and if supplied withNone
would hang indefinitely. - I expected
parse_record
to return aRecord::Switch
variant, but instead it failed.
Why is this code not working?
Spend two minutes looking for dumb, obvious mistakes.
When troubleshooting, I often like to spend a few minutes in a methodology-free, exploratory search.
In my experience, powerful kernel APIs (such as KVM, eBPF, perf event) can be frustrating to use.
It’s very easy to supply faulty inputs, or violate subtle conditions documented in the headers.
As such, the very first thing to do is to look for stupid mistakes.
I scanned the perf_event.h
header for admonitions, or easy-to-miss requirements.
I discovered that the perf event subsystem provides two distinct ways to gather context switch events:
- Set the
context_switch
bit inperf_event_attr
(the main argument to the syscallperf_event_open
). - Alternatively, in the same struct, set the
config
field toPERF_COUNT_SW_CONTEXT_SWITCHES
and set thetype
field toPERF_TYPE_SOFTWARE
.
The existence of two distinct mechanisms to obtain context switches suggested a possible explanation for the surprising behavior. My hypothesis was now that the kernel reported context switches in two distinct ways, depending on how it was asked for them. In other words, I had a skill issue.
Am I even getting context switch events?
Testing this hypothesis was very straightforward.
The kernel had promised to deposit events into the ring buffer managed by perf_event2
.
I knew based on the output that some iterations of the loop were taking the branch that received some event.
Therefore, something was appearing in this ring buffer, even though perf_event2
couldn’t parse it.
Looking at the return value of next_blocking
I saw Option<Record<'a>>
, which had some methods I could use to inspect what was being pulled off the ring buffer.
Important to note: this struct is distinct from the enum by the same name.
Changing the code to print out the ty
(link), I saw that the record’s type was 14
.
I consulted the header to confirm that 14 indeed corresponded to a context switch event.
Then, I used the len
method to confirm what I already strongly suspected was true: the record was 0-sized, which led parse_record
to return an Err
variant.
Great!
Now I knew I was getting events.
While I had not found a significant bug, I did find an opportunity to add some more diagnostically useful methods to the Record
type.
The pro-social thing to do is to make it easier for the next person to debug.
Doing a bit of clicking around docs.rs, I found that as expected, perf_event2
depended on some bindgen
glue code.
Happily, bindgen
translated the constants from the header to Rust code, but they were sequestered in the perf_event_open_sys
crate.
It would therefore be relatively easy to make this code nicer to use.
Should perf_event2
unify the APIs for both ways of obtaining context switches?
From my perspective it violates the least-surprise principle, but others may be accustomed to, and indeed depend on, this idiosyncrasy of perf_event_open
.
I defer the decision on what color to paint this particular bikeshed, at least for now.
Static tracepoints also not working
I then tried to subscribe to static tracepoints, against the advice of the documentation.
The first thing I did was to sample the CPU every 1000 cycles, check that code works, and then swap out the CPU_CYCLES
event for a static tracepoint, but with a much lower sample period.
let tp = events::Hardware::CPU_CYCLES;
let mut sampler = Builder::new(tp)
.sample_period(1_000)
.build()?
.sampled(8192)?;
Clearly, the CPU sampling code was working:
❯ sudo ./target/debug/examples/experiment
Event record 0: Sample(
Sample { .. },
)
However, when I tried the static tracepoint block/block_rq_complete
, the loop just hung.
In addition, I found that the CPU sampling code would hang after 67 iterations.
Perhaps I was not pulling items off the circular buffer after all.
To rule out permissions issues or a misconfigured kernel, I ran the perf
CLI sampling this static tracepoint:
❯ sudo perf record -e block:block_rq_complete -a -- sleep 10
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 1.725 MB perf.data (203 samples) ]
❯ sudo perf report --stdio
# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 203 of event 'block:block_rq_complete'
# Event count (approx.): 203
#
# Overhead Command Shared Object Symbol
# ........ ............... ................. ......................
#
92.12% swapper [kernel.kallsyms] [k] blk_update_request
2.46% usb-storage [kernel.kallsyms] [k] blk_update_request
1.97% kworker/u8:24-f [kernel.kallsyms] [k] blk_update_request
0.99% ksoftirqd/0 [kernel.kallsyms] [k] blk_update_request
0.99% kworker/u8:0-bt [kernel.kallsyms] [k] blk_update_request
0.49% DefaultDispatch [kernel.kallsyms] [k] blk_update_request
0.49% ThreadPoolForeg [kernel.kallsyms] [k] blk_update_request
0.49% kworker/u8:6-bt [kernel.kallsyms] [k] blk_update_request
Since this worked, I concluded that the problem was with my program.
Analysis
I had picked up a few issues using perf_event2
.
To make effective use of it I needed to sort out all the data I had to determine what was a perf_event2
bug, what was my fault, and what was simply an artifact of how the perf event subsystem works.
To summarize, the following issues were in play:
- The spurious failures parsing 0-sized samples.
- The mystery of the gummed up ring buffer.
- Static tracepoints don’t work at all, and they don’t explicitly fail either.
At this point I decided that the highest-leverage avenue of investigation would be to check out perf_event2
.
I cloned a copy of the repo onto my machine and set up a git worktree
to conveniently vendor it into my workspace.
Finding call sites of perf_event_open
Fortunately, the centre-of-mass of the perf event subsystem is concentrated around the perf_event_open
syscall.
I needed to identify how perf_event2
converted my inputs to calls to this function.
Conveniently, there is only one call site of this syscall in the entire crate:
let result = check_errno_syscall(|| unsafe {
sys::perf_event_open(&mut attrs, pid, cpu, group_fd, flags as c_ulong)
});
As a formality, I ran the perf
CLI under strace
to double check that it definitely used the same syscall, and that I wasn’t completely off-base:
❯ sudo cat /sys/kernel/debug/tracing/events/block/block_rq_complete/id
1367
❯ sudo strace -v -e perf_event_open -- perf record -e block:block_rq_complete -a -- sleep 10 2>&1 >/dev/null | rg 1367
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0x88 /* PERF_ATTR_SIZE_??? */, config=1367, sample_period=1, sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_RAW|PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_ID|PERF_FORMAT_LOST, disabled=1, inherit=1, pinned=0, exclusive=0, exclude_user=0, exclude_kernel=0, exclude_hv=0, exclude_idle=0, mmap=0, comm=0, freq=0, inherit_stat=0, enable_on_exec=0, task=0, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=1, exclude_host=0, exclude_guest=1, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=0, comm_exec=0, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, ksymbol=0, bpf_event=0, aux_output=0, cgroup=0, text_poke=0, build_id=0, inherit_thread=0, remove_on_exec=0, sigtrap=0, wakeup_events=0, config1=0, config2=0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0, aux_sample_size=0, sig_data=0, config3=0}, -1, 0, -1, PERF_FLAG_FD_CLOEXEC) = 4
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0x88 /* PERF_ATTR_SIZE_??? */, config=1367, sample_period=1, sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_RAW|PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_ID|PERF_FORMAT_LOST, disabled=1, inherit=1, pinned=0, exclusive=0, exclude_user=0, exclude_kernel=0, exclude_hv=0, exclude_idle=0, mmap=0, comm=0, freq=0, inherit_stat=0, enable_on_exec=0, task=0, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=1, exclude_host=0, exclude_guest=1, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=0, comm_exec=0, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, ksymbol=0, bpf_event=0, aux_output=0, cgroup=0, text_poke=0, build_id=0, inherit_thread=0, remove_on_exec=0, sigtrap=0, wakeup_events=0, config1=0, config2=0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0, aux_sample_size=0, sig_data=0, config3=0}, -1, 1, -1, PERF_FLAG_FD_CLOEXEC) = 5
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0x88 /* PERF_ATTR_SIZE_??? */, config=1367, sample_period=1, sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_RAW|PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_ID|PERF_FORMAT_LOST, disabled=1, inherit=1, pinned=0, exclusive=0, exclude_user=0, exclude_kernel=0, exclude_hv=0, exclude_idle=0, mmap=0, comm=0, freq=0, inherit_stat=0, enable_on_exec=0, task=0, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=1, exclude_host=0, exclude_guest=1, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=0, comm_exec=0, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, ksymbol=0, bpf_event=0, aux_output=0, cgroup=0, text_poke=0, build_id=0, inherit_thread=0, remove_on_exec=0, sigtrap=0, wakeup_events=0, config1=0, config2=0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0, aux_sample_size=0, sig_data=0, config3=0}, -1, 2, -1, PERF_FLAG_FD_CLOEXEC) = 6
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0x88 /* PERF_ATTR_SIZE_??? */, config=1367, sample_period=1, sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_TID|PERF_SAMPLE_TIME|PERF_SAMPLE_CPU|PERF_SAMPLE_PERIOD|PERF_SAMPLE_RAW|PERF_SAMPLE_IDENTIFIER, read_format=PERF_FORMAT_ID|PERF_FORMAT_LOST, disabled=1, inherit=1, pinned=0, exclusive=0, exclude_user=0, exclude_kernel=0, exclude_hv=0, exclude_idle=0, mmap=0, comm=0, freq=0, inherit_stat=0, enable_on_exec=0, task=0, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=1, exclude_host=0, exclude_guest=1, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=0, comm_exec=0, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, ksymbol=0, bpf_event=0, aux_output=0, cgroup=0, text_poke=0, build_id=0, inherit_thread=0, remove_on_exec=0, sigtrap=0, wakeup_events=0, config1=0, config2=0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0, aux_sample_size=0, sig_data=0, config3=0}, -1, 3, -1, PERF_FLAG_FD_CLOEXEC) = 8
This was an unexpected discovery: perf
makes one perf_event_open
call for each logical CPU on my system (a nine year old dual-core Thinkpad).
My code was not doing that:
❯ sudo strace -v -e perf_event_open -- ./target/debug/examples/experiment
perf_event_open({type=PERF_TYPE_TRACEPOINT, size=0x88 /* PERF_ATTR_SIZE_??? */, config=1367, sample_period=1000, sample_type=0, read_format=PERF_FORMAT_TOTAL_TIME_ENABLED|PERF_FORMAT_TOTAL_TIME_RUNNING, disabled=1, inherit=0, pinned=0, exclusive=0, exclude_user=0, exclude_kernel=1, exclude_hv=1, exclude_idle=0, mmap=0, comm=0, freq=0, inherit_stat=0, enable_on_exec=0, task=0, watermark=0, precise_ip=0 /* arbitrary skid */, mmap_data=0, sample_id_all=0, exclude_host=0, exclude_guest=0, exclude_callchain_kernel=0, exclude_callchain_user=0, mmap2=0, comm_exec=0, use_clockid=0, context_switch=0, write_backward=0, namespaces=0, ksymbol=0, bpf_event=0, aux_output=0, cgroup=0, text_poke=0, build_id=0, inherit_thread=0, remove_on_exec=0, sigtrap=0, wakeup_events=0, config1=0, config2=0, sample_regs_user=0, sample_regs_intr=0, aux_watermark=0, sample_max_stack=0, aux_sample_size=0, sig_data=0, config3=0}, 0, -1, -1, PERF_FLAG_FD_CLOEXEC) = 3
I was now in a position to test the hypotheses that perf_event2
was not faithfully mapping my inputs to perf_event_open
.
Aside: what do the em-dash enthusiasts think?
At this point I asked Claude (Sonnet 4) and ChatGPT (o3) for an explanation as to which parameters made a difference. Claude merely style-transferred a diff of the two entries to a wall of unhelpful markdown. Impressively – or so I thought – o3 saw the problem immediately in its chain of thought, and then explained it to me:
TL;DR
#1 is a sampler – it fires an interrupt every single time the trace-point hits and pushes a full sample record into an mmap ring.
#2 is a counter – it never generates samples at all; you poll/read it later to get an aggregate hit-count (+ timing scalars).
Everything else (inheritance flags, what execution modes are excluded, what lands in a read(), etc.) follows from that basic design choice.
It then expanded:
Setting sample_type = 0 for a trace-point almost always means “I meant to use a counter, not a trace-point”
That is not stated in the man page. A useful piece of information, if true.
Reproducing perf
’s behavior
The thorough thing to do at this point was to set all arguments in my code to match the arguments that perf
used, and then to experiment with what effect they had.
Maximum laziness
But, if o3 was correct, then all I had to do was ensure that perf_event_attr
’s sample_type
field was given the right bit flags.
Looking at the arguments to perf_event_open
, I learned that Builder
has a field which is precisely the perf_event_attr
struct:
#[derive(Clone)]
pub struct Builder<'a> {
attrs: perf_event_attr,
...
}
Searching for code that writes to the sample_type
field I found the Builder::sample
method.
How convenient!
Disappointingly, merely setting the sample_type
field to be identical to the working example did not help:
.sample(
SampleFlag::RAW
| SampleFlag::IP
| SampleFlag::TID
| SampleFlag::TIME
| SampleFlag::CPU
| SampleFlag::PERIOD
| SampleFlag::IDENTIFIER,
)
No shortcuts for me
The shortcut suggested by o3 did not work.
I made sure that the final perf_event_open
syscall would have identical arguments to the working example:
let mut sampler = Builder::new(tp)
.any_pid()
.one_cpu(1)
.include_kernel()
.inherit(true)
.sample_id_all(true)
.exclude_guest(true)
.sample(
SampleFlag::RAW
| SampleFlag::IP
| SampleFlag::TID
| SampleFlag::TIME
| SampleFlag::CPU
| SampleFlag::PERIOD
| SampleFlag::RAW
| SampleFlag::IDENTIFIER,
)
.read_format(ReadFormat::ID | ReadFormat::LOST)
.sample_period(1)
.build()?
.sampled(8192)?;
After confirming with strace
that this had the intended effect on the arguments provided to the syscall, I tried this incantation out and as expected, the tracepoint events appeared in the ring buffer.
Also, the event record parser worked:
Event record 26: Sample(
Sample {
ip: 18446744071855823341,
pid: 499,
tid: 499,
time: 900888146338066,
id: 3554,
cpu: 1,
period: 1
I had established that perf_event2
did indeed support tracepoints.
The hypothesis that it was not faithfully providing my inputs to perf_event_open
was now disconfirmed.
Providing excrutiatingly specific flags worked.
Obtaining a minimal reproducible sample of the bug
Of the dozen-or-so flags and arguments that could be responsible for failing to pick up tracepoints, only a few actually wound up mattering. I minimized the code required to receive tracepoints:
let mut sampler = Builder::new(tp)
.any_pid()
.one_cpu(1)
.include_kernel()
.sample(SampleFlag::RAW)
.sample_period(1)
.build()?
.sampled(8192)?;
Sure, this code “works”, but it would be better if Builder
did what it said it would do when asking for tracepoints.
Ideally, you’d only have to call the new
function.
Fixing the bugs
I noticed that the Event
trait, which Tracepoint
implements, simply requires its implementors to place the correct settings in the perf_event_attr
struct.
pub trait Event: Sized {
// Required method
fn update_attrs(self, attr: &mut perf_event_attr);
// Provided method
fn update_attrs_with_data(
self,
attr: &mut perf_event_attr,
) -> Option<Arc<dyn EventData>> { ... }
}
I had hoped that changing just one impl
block (code), would have populated Builder
with the right settings.
impl Event for Tracepoint {
fn update_attrs(self, attr: &mut bindings::perf_event_attr) {
attr.set_exclude_kernel(0);
attr.sample_type |= crate::SampleFlag::RAW.bits();
attr.type_ = bindings::PERF_TYPE_TRACEPOINT;
attr.config = self.id;
}
}
Alas, Builder::new
immediately overwrote several values set by update_attrs
:
// Do the update_attrs bit before we set any of the default state so
// that user code can't break configuration we really care about.
let data = event.update_attrs_with_data(&mut attrs);
// Setting `size` accurately will not prevent the code from working
// on older kernels. The module comments for `perf_event_open_sys`
// explain why in far too much detail.
attrs.size = std::mem::size_of::<perf_event_attr>() as u32;
let mut builder = Self {
attrs,
who: EventPid::ThisProcess,
cpu: None,
event_data: data,
};
builder.enabled(false);
builder.exclude_kernel(true);
builder.exclude_hv(true);
I ran git blame
on the comment that says that overwriting settings provided by the event is the right thing to do:
commit 70729063dd10462f33d049007ba074374f8a9ead
Author: Phantomical <phantom@lynches.ca>
Date: Wed Apr 12 23:54:38 2023 -0700
Convert Event from an enum to a trait
This opens up uses of enum so that others outside the crate can add
their own event types. Ideally we would support all the relevant events
within the crate but that's not always feasible. Having this be a trait
which just sets some fields within the group will mean that users of the
crate have the ability to implement something even if we haven't added
support explicitly.
I understood this to mean that implementors of Event
should actually be able to overwrite whatever fields they want.
I removed the code that overwrites fields set by the event.
Coupled with the change above, it meant that the following code was sufficient to subscribe to tracepoints:
let mut sampler = Builder::new(Tracepoint::with_name("block/block_rq_complete")?)
.any_pid()
.one_cpu(1)
.sample_period(1)
.build()?
.sampled(8192)?;
Much better! Static tracepoints now worked!
There is a clear alternative to changing the new
function.
I could have added a new function, named something like Builder::new_no_default
.
That would preserve the existing behavior of Builder::new
.
It would also probably be easier to upstream than the breaking change.
But, I was not convinced that preserving API-compatibility would be worthwhile when the API is actively broken.
To lessen the impact, each implementor of Event
could write the correct values it needs.
This necessitated a larger change.
It was a good opportunity to introduce a number of related improvements to the code.
Breaking changes break tests
Fortunately, the change I made broke two of perf_event2
’s unit tests.
The first broken test was this one:
#[test]
#[cfg(any(target_arch = "x86", target_arch = "x86_64"))]
fn test_sampler_rdpmc() {
let mut sampler = Builder::new(events::Hardware::INSTRUCTIONS)
.enabled(true)
.build()
.expect("failed to build counter")
.sampled(1024)
.expect("failed to build sampler");
let read = sampler.read_user();
sampler.disable().unwrap();
let value = sampler.read_full().unwrap();
assert!(read.time_running() <= value.time_running().unwrap());
assert!(read.time_enabled() <= value.time_enabled().unwrap());
if let Some(count) = read.count() {
assert!(count <= value.count(), "{count} <= {}", value.count());
}
}
Usefully libtest
pointed out that the time_running
method was returning a None
variant.
That did not surprise me, because I had removed this line from Builder::new
:
builder.read_format(ReadFormat::TOTAL_TIME_ENABLED | ReadFormat::TOTAL_TIME_RUNNING);
Simply restoring this field for Hardware
fixed this test.
Restoring it for Software
fixed the other.
(Code)
Are these ‘fixes’ necessary?
At this point, I had established that perf_event2
was fully controllable, in the sense that it allowed me to explicitly set every field in every argument to perf_event_open
.
Were fixes necessary?
It was my opinion that these fixes were necessary, because perf_event2
adds value by providing low-friction, unsurprising Rust APIs to access the perf event subsystem.
Without these changes, users of this crate would have to discover undocumented requirements of the perf_event_open
syscall by experiment, as I had.
However, my fixes were not yet in a state that I could try to upstream in good conscience.
Working backwards from the user
To genuinely improve the perf_event2
crate, and to upstream my changes, I will need to define what a good experience using this crate is, and then work backwards from that.
Requirements
- Small, simple, working examples for all event types, including:
- Tracepoints
- Probes
- Software events
- Hardware events
- Return diagnostically useful errors if the user tries to ask for something impossible.
- Opinionated defaults, so that this crate is as easy to use as the
perf
CLI. - Greatly expanded test coverage.
Nice-to-have: removal of some unsafe code
While unsafe
code is not intrinsically bad, it is more burdensome to maintain than the safe subset of Rust.
I have identified a few usages of unsafe
which are probably unnecessary.
In particular, some of the unsafe transmutations in perf_event2
could be replaced with safe equivalents, using zerocopy
.
See Jack’s blog post for an explanation of how zerocopy
’s automated reasoning engine works.
Thanks for reading! If you found this useful, check out dirty_mike on GitHub.
I will also be publishing a follow-up with larger changes fixing perf_event2
.