Notes on porting Wild - One must imagine /sys/fs happy

Introduction

Wild

Wild is a new open source linker, written by David Lattimore. It overdelivers on its promises to be wildly fast. I’m obsessed with toolchain speed, so naturally, I gravitated to Wild. At first, I was cautiously optimistic, but found that it just worked as a drop in replacement for LLD and Mold. Integrating Wild into my development workflow has made a noticeable difference to my iteration speed. For example, Wild can link Clang in 120ms, which is incredibly fast. That’s less than the time it takes a packet to leave my laptop in Johannesburg and arrive at my development machine in Dublin.

Breaking out of the Linux monoculture

A few months ago, Colin Percival wrote up a blog post detailing his recent work on FreeBSD. As I read this blog post, it dawned on me that I have spent almost all of my career immersed in a Linux monoculture, and that this was probably to my detriment. At the same time, I’ve been meaning to learn how to use DTrace for a while. For the last few weeks, I’ve been running FreeBSD and Illumos (OmniOS), and it’s been an incredibly fun experience. While writing Rust on Illumos, I noticed my builds getting snagged on something slow, right at the end, again and again. I quickly confirmed my suspicions that it was the linker, and set about replacing it with Wild.

Running Wild on Illumos

Like Linux and BSD, binaries on the Illumos system are linked based on the System V ABI. It was reasonable to simply try Wild and see what happened. I checked out and ran Wild and its test suite (at f2f9776):

test result: FAILED. 10 passed; 67 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.25s

Of the 77 integration tests, only 10 passed. These tests are comprised of small, self-contained programs that are linked by Wild. Each integration test consists of four steps:

Compile the source code.
Link with Wild and GNU ld (and possibly Mold and Gold if present).
Run the resulting executable.
Compare the output of Wild against the other linkers.

I decided to try identify why these tests were failing and to fix them, one by one. On analyzing the first test case, I found that the program Wild linked received a SIGKILL as soon as I tried to run it. That is, it was failing at the third step:

❯ /home/omnios/wild/wild/tests/build/trivial.c-default-host.wild
zsh: killed     /home/omnios/wild/wild/tests/build/trivial.c-default-host.wild

When the exact same program was linked by GNU ld, it ran just fine. Wild’s test suite comes with an invaluable tool - linker-diff - for comparing the output of Wild to other linkers. I ran the linker-diff tool, and found that there were dozens of differences between the binaries, and nothing stood out as clearly wrong in the binary Wild produced. At this point, my instinct was to trace this program in GDB - surely the program was doing something illegal, and I could single step until I found it? After all, it was a trivial program of just a few hundred instructions. I ran it under GDB and placed a breakpoint on the entry point, which was the symbol _start:

(gdb) b _start
Breakpoint 1 at 0x401374
(gdb) r
Starting program: /home/omnios/wild/wild/tests/build/trivial.c-default-host.wild
During startup program terminated with signal SIGKILL, Killed.
(gdb)

I double checked that _start was indeed the entry point as designated in the ELF file. SIGKILL? What gives?

Enter DTrace

At this point, I began to try identify what was sending my program the SIGKILL by tracing it with DTrace. I knew that the system had tens of thousands of DTrace providers. I grepped around for signal handling providers, and filtered out function boundary trace providers:

❯ sudo dtrace -l | grep -i signal | grep -iv fbt
  314    syscall                                     lwp_cond_signal entry
  315    syscall                                     lwp_cond_signal return
  702       proc           genunix                      sigtimedwait signal-clear
  705       proc           genunix                         sigtoproc signal-send
  706       proc           genunix                         sigtoproc signal-discard
  707       proc           genunix                              psig signal-handle
 2203        sdt                ip                    cc_cong_signal cwnd-cc-cong-signal

This felt a little bit like cheating. It couldn’t be that easy to hook into whatever is sending a signal, right? Wrong! It is that easy. In fact, the provider even comes with documentation:

❯ sudo dtrace -lv -n signal-send
   ID   PROVIDER            MODULE                          FUNCTION NAME
  705       proc           genunix                         sigtoproc signal-send

        Probe Description Attributes
                Identifier Names: Private
                Data Semantics:   Private
                Dependency Class: Unknown

        Argument Attributes
                Identifier Names: Evolving
                Data Semantics:   Evolving
                Dependency Class: ISA

        Argument Types
                args[0]: lwpsinfo_t *
                args[1]: struct psinfo *
                args[2]: int

I confirmed in the online documentation that the third argument was the signal number. I then wrote a one-liner to print a backtrace when this probe fired, and then re-ran my program in another tmux pane:

❯ sudo dtrace -n 'proc:::signal-send /args[2] == 9/ { stack(); }'
dtrace: description 'proc:::signal-send ' matched 1 probe
CPU     ID                    FUNCTION:NAME
  4    705            sigtoproc:signal-send
              genunix`psignal+0x34
              elfexec`elfexec+0x480
              genunix`gexec+0x667
              genunix`exec_common+0x73b
              genunix`exece+0x58
              unix`sys_syscall+0x1a8

The third-from-last function in the callstack - elfexec+0x480 - is clearly in the ELF loader. I now had the exact address of the code that was rejecting the binary Wild produced.

At this point I asked Claude for advice, and it suggested using mdb to disassemble the code at this kernel address, which I duly ran:

❯ TERM=xterm sudo mdb -k
Loading modules: [ unix genunix specfs dtrace mac cpu.generic uppc apix scsi_vhci zfs ip hook neti sockfs arp usba stmf stmf_sbd mm lofs sata sd random cpc ufs logindmux ptm klmmod nfs ]
> elfexec+0x480::dis
elfexec+0x450:                  xorl   %r13d,%r13d
elfexec+0x453:                  xorl   %ebx,%ebx
elfexec+0x455:                  call   +0x7b39286       <uprintf>
elfexec+0x45a:                  movl   -0xf0(%rbp),%edi
elfexec+0x460:                  cmpl   $-0x1,%edi       <0xffffffff>
elfexec+0x463:                  jne    +0x97    <elfexec+0x500>
elfexec+0x469:                  movq   0xffffffffffffff00(%rbp),%rdi
elfexec+0x470:                  movl   $0x9,%esi
elfexec+0x475:                  movl   $0x8,%r15d
elfexec+0x47b:                  call   +0x7b73ea0       <psignal>
elfexec+0x480:                  testl  %r13d,%r13d
elfexec+0x483:                  je     +0x11    <elfexec+0x496>
elfexec+0x485:                  movq   -0x98(%rbp),%rdi
elfexec+0x48c:                  movl   $0x38,%esi
elfexec+0x491:                  call   +0x7afc35a       <kmem_free>
elfexec+0x496:                  movq   -0xe0(%rbp),%rdi
elfexec+0x49d:                  testq  %rdi,%rdi
elfexec+0x4a0:                  je     +0x9     <elfexec+0x4ab>
elfexec+0x4a2:                  movq   -0x80(%rbp),%rsi
elfexec+0x4a6:                  call   +0x7afc345       <kmem_free>
elfexec+0x4ab:                  testq  %rbx,%rbx

Unfortunately, the OmniOS kernel that I was using did not have debug symbols built in, and I could not find an easy way to obtain them. Had I had debug symbols, the next few steps I took would have been unnecessary.

Anyhow, I opened up the source code at the branch used to stamp out this kernel (usefully, the exact commit is obtainable from uname -a: omnios-r151054-6ad70ba62c). I then navigated to the definition of elfexec and searched for callsites of psignal. Luckily, there was only one:

bad:
    if (fd != -1)        /* did we open the a.out yet */
        (void) execclose(fd);

    psignal(p, SIGKILL);

Now, all I had to do was find the branch that jumped here with a goto bad. Searching for goto bad, I was disheartened to find 24 results, but I very quickly noticed most of them were gated by this if statement:

if (intphdr != NULL) {
    /// ...
    // Many of the gotos live here.
    /// ...
}

Tracing the data flow into this intphdr pointer, I found that it was passed to the function mapelfexec:

static int
mapelfexec(
    vnode_t *vp,
    Ehdr *ehdr,
    uint_t nphdrs,
    caddr_t phdrbase,
    Phdr **uphdr,
    Phdr **intphdr,
    Phdr **stphdr,
    Phdr **dtphdr,
    Phdr *dataphdrp,
    caddr_t *bssbase,
    caddr_t *brkbase,
    intptr_t *voffset,
    uintptr_t *minaddrp,
    size_t len,
    size_t *execsz,
    size_t *brksize)

and within this function, the **intphdr argument was set inside this branch:

        case PT_INTERP:
            /*
             * The ELF specification is unequivocal about the
             * PT_INTERP program header with respect to any PT_LOAD
             * program header:  "If it is present, it must precede
             * any loadable segment entry." Linux, however, makes
             * no attempt to enforce this -- which has allowed some
             * binary editing tools to get away with generating
             * invalid ELF binaries in the respect that PT_INTERP
             * occurs after the first PT_LOAD program header.  This
             * is unfortunate (and of course, disappointing) but
             * it's no worse than that: there is no reason that we
             * can't process the PT_INTERP entry (if present) after
             * one or more PT_LOAD entries.  We therefore
             * deliberately do not check ptload here and always
             * store dyphdr to be the PT_INTERP program header.
             */
            *intphdr = phdr;
            break;

How interesting! I’d found a warning about dodgy ELF files being allowed on Linux and not Illumos. I was getting closer. I then checked the binary Wild produced, to see what program headers it actually had:

❯ readelf -l /home/omnios/wild/wild/tests/build/trivial.c-default-host.wild

Elf file type is EXEC (Executable file)
Entry point 0x401370
There are 4 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000400040 0x0000000000400040
                 0x00000000000000e0 0x00000000000000e0  R      0x8
  LOAD           0x0000000000000000 0x0000000000400000 0x0000000000400000
                 0x0000000000000370 0x0000000000000370  R      0x1000
  LOAD           0x0000000000000370 0x0000000000401370 0x0000000000401370
                 0x0000000000000031 0x0000000000000031  R E    0x1000
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10

 Section to Segment mapping:
  Segment Sections...
   00
   01     .eh_frame
   02     .text
   03

Well, that was disappointing: the particulars of the block comment advising caution were actually irrelevant. The branch that sets interp could not possibly have been taken, because this binary doesn’t have a PT_INTERP program header. Using this information, I now knew that the if (intphdr != NULL) branch also could not have been taken. That left only three branches:

if (error != 0)
    goto bad;

if (uphdr != NULL && intphdr == NULL)
    goto bad;

if (dtrphdr != NULL && dtrace_safe_phdr(dtrphdr, args, voffset) != 0) {
    uprintf("%s: Bad DTrace phdr in %s\n", exec_file, exec_file);
    goto bad;
}

The first and third branches were trivial to rule out following the exact same flow-of-data analysis I did for intphdr. Again, I would have preferred to have debug symbols, and to have used mdb, but I did not wind up needing them.

Nonetheless, I was astounded. It took me only 15 minutes to find the line of code in the kernel that rejected the binary produced by Wild. Concretely, Illumos disallowed the loading of a binary that had a PHDR program header without a corresponding PT_INTERP program header. Note that the nomenclature can be quite confusing - PHDR self-referentially refers to the byte slice within the ELF file that contains the program headers.

I duly filed an issue explaining my findings, and then followed up with a very janky workaround PR. This PR ignored the -C flag expected by the system linker, and manually set the expected dynamic linker on Illumos. We were now passing 27 tests:

test result: FAILED. 27 passed; 51 failed; 0 ignored; 0 measured; 0 filtered out; finished in 0.49s

Linking Rust programs with Wild

The simplest way to link Rust programs with Wild is to instruct Cargo to delegate linking to the Clang driver, and to pass the Clang driver the name of a linker:

[target.x86_64-unknown-illumos]
linker = "clang"
rustflags = [
    "-C", "link-arg=--ld-path=/home/omnios/.cargo/bin/wild",
]

Unfortunately, this simply did not work. First, I noticed the painful slowness of the link, and then I checked for the calling card Wild leaves in the .comment section:

❯ readelf -p.comment target/debug/rg

String dump of section '.comment':
  [     1]  rustc version 1.90.0 (1159e78c4 2025-09-14)
  [    2d]  GCC: (OmniOS 151054/14.2.0-il-1) 14.2.0
  [    55]  @(#)illumos  May 2025

I then tried using the older, -fuse-ld argument, to no avail:

  = note: clang: error: invalid linker name in argument '-fuse-ld=wild'

After confirming Wild was indeed on the $PATH, I passed in its absolute path:

[target.x86_64-unknown-illumos]
linker = "clang"
rustflags = [
    "-C", "link-arg=-fuse-ld=/home/omnios/.cargo/bin/wild",
]

Success!

❯ readelf -p.comment target/debug/rg

String dump of section '.comment':
  [     0]  GCC: (OmniOS 151054/14.2.0-il-1) 14.2.0
  [    29]  @(#)illumos  May 2025
  [    3f]  rustc version 1.90.0 (1159e78c4 2025-09-14)
  [    6b]  Linker: Wild version 0.6.0

Despite the apparent success, this left me with more questions than answers:

Why is --ld-path silently ignored by Clang?
Why does -fuse-ld work in such a surprising way?
- This argument doesn’t even show up in the --help for the Clang driver.
- It accepts ld as a non-absolute path, but nothing else I tried.
Why does the driver pass -C to the provided linker regardless of what it is?
- It passes -C to GNU ld, which of course explodes with the error /usr/gnu/bin/ld: unrecognized option '-C'.

Clang driver

At this point, I opened the source code for the Clang driver on GitHub. I immediately discovered that Solaris and its derivatives such as Illumos have their own driver, with specific logic:

// Accept 'bfd' and 'gld' as aliases for the GNU linker.
if (UseLinker == "bfd" || UseLinker == "gld")
  // FIXME: Could also use /usr/bin/gld here.
  return "/usr/gnu/bin/ld";

I checked out and built the driver from source, and ran a few experiments with it to determine answers to my questions.

The --ld-path argument was ignored because it had simply never been implemented in the Solaris driver.
The -fuse-ld argument is undocumented presumably because it is “deprecated”.
The driver is bimodal: it has one mode for running the Solaris Link Editor and one for running GNU ld.
The -fuse-ld argument accepts absolute paths to linkers but drives the linker as if it is the Solaris Link Editor, no matter what.

The Solaris driver code was full of FIXME and TODO comments, so I resolved to fix them. I changed the driver logic to accept --ld-path as an argument, and to drive that linker in a way that is compatible with GNU ld - which all of the alternative linkers aim to be.

I changed the Cargo config to use the modified driver:

[target.x86_64-unknown-illumos]
linker = "/home/omnios/llvm-project/build/bin/clang"
rustflags = [
    "-C", "link-arg=--ld-path=/home/omnios/.cargo/bin/wild",
]

Wild then errored out as follows:

wild: error: -m elf_x86_64_sol2 is not yet supported

This was good news. It meant that the modified driver was indeed driving Wild as if it was GNU ld! I raised a small PR in Wild to accept elf_x86_64_sol2 as a valid value for the so-called “emulation” argument and to set the dynamic linker to /lib/amd64/ld.so.1 if this value is provided. This also let us remove the janky workaround from earlier. I confirmed that the produced RipGrep binary worked as expected.

Finally, Wild was working properly on Illumos!

I raised PR #163000 in Clang to change the driver to work sensibly. Hopefully this PR will get merged, so that using Wild on Illumos will be as seamless as it is on Linux.