Lessons Learned Building High-Performance Rust Profiler

hotpath Rust performance profiler is represented by a flame Photo by Paul Bulai on Unsplash

The Rust performance book features over a dozen different profiling tools. So I’m not sure if the world needed a new Rust profiler. Still, I spent the last 6+ months building hotpath-rs. In this post, I’ll describe the design decisions behind the library and share a few performance challenges I encountered while working on it. We’ll go deep into the low-level details: cache-line contention, async futures instrumentation, and decoding raw CPU traces back into Rust symbols.

hotpath profiler 101

The next section is a brief overview of the library. Click here if you want to jump straight into the implementation details.

Over the last months, the hotpath profiler has grown to over 100k downloads on crates.io and is slowly gaining more adoption in the Rust ecosystem.

hotpath-rs download stats on crates.io

Before diving into implementation details, let’s quickly look at what hotpath does and why I built it. It’s an “all-in-one” Rust profiler / debugging toolkit designed to quickly identify performance bottlenecks.

The core idea is to combine multiple sources of data into reports that are quick and easy to mentally parse. You need only two macros, hotpath::main and hotpath::measure, to get started with instrumenting your codebase.

Let’s see it in action:

examples/overview.rs

#[hotpath::measure]
fn sync_work() {
    let mut result: u64 = 1;
    for i in 0..20000 {
        result = result.wrapping_mul(black_box(i as u64).wrapping_add(7));
        result ^= result >> 3;
    }
}

#[hotpath::measure]
fn sync_alloc() {
    for _ in 0..1000 {
        let buf: Vec<u8> = vec![1; 1024];
        std::hint::black_box(&buf);
    }
}

#[hotpath::measure]
async fn async_sleep() {
    tokio::time::sleep(Duration::from_millis(10)).await;
}

#[tokio::main]
#[hotpath::main]
async fn main() {
    for _ in 0..1000 {
        sync_work();
        sync_alloc();
        async_sleep().await;
    }
}

This example features 3 functions, showcasing different modes of execution present in any Rust program:

sync_work - synchronous function that executes a CPU-bound code
sync_alloc - synchronous function that allocates memory
async_sleep - function that sleeps asynchronously. A bit artificial, but it’s meant to simulate a slow async I/O. Waiting for an SQL query or an HTTP endpoint would yield similar performance results.

Looking at the example, can you tell what is the REAL bottleneck?

Let’s see a hotpath report output before we answer this question:

cargo run --example overview --features='hotpath,hotpath-alloc,hotpath-cpu'

[hotpath] 15.61s | timing, alloc, cpu

timing - Function execution time metrics.
+------------------------+-------+----------+----------+---------+---------+
| Function               | Calls | Avg      | P95      | Total   | % Total |
+------------------------+-------+----------+----------+---------+---------+
| cpu_basic::main        | 1     | 15.62 s  | 15.62 s  | 15.61 s | 100.00% |
+------------------------+-------+----------+----------+---------+---------+
| cpu_basic::async_sleep | 1000  | 11.77 ms | 12.10 ms | 11.77 s | 75.42%  |
+------------------------+-------+----------+----------+---------+---------+
| cpu_basic::sync_work   | 1000  | 3.14 ms  | 4.59 ms  | 3.14 s  | 20.11%  |
+------------------------+-------+----------+----------+---------+---------+
| cpu_basic::sync_alloc  | 1000  | 1.23 µs  | 2.62 µs  | 1.23 ms | 0.01%   |
+------------------------+-------+----------+----------+---------+---------+

alloc-bytes - Exclusive allocation bytes by each function.
Total: 1.1 MB
+------------------------+-------+----------+----------+-----------+---------+
| Function               | Calls | Avg      | P95      | Total     | % Total |
+------------------------+-------+----------+----------+-----------+---------+
| cpu_basic::sync_alloc  | 1000  | 1.0 KB   | 1.0 KB   | 1000.0 KB | 89.28%  |
+------------------------+-------+----------+----------+-----------+---------+
| cpu_basic::main        | 1     | 120.1 KB | 120.1 KB | 120.1 KB  | 10.72%  |
+------------------------+-------+----------+----------+-----------+---------+
| cpu_basic::async_sleep | 1000  | 0 B      | 0 B      | 0 B       | 0.00%   |
+------------------------+-------+----------+----------+-----------+---------+
| cpu_basic::sync_work   | 1000  | 0 B      | 0 B      | 0 B       | 0.00%   |
+------------------------+-------+----------+----------+-----------+---------+

cpu - CPU sampling attribution per function (exclusive). 
+------------------------+---------+---------+
| Function               | Samples | % Total |
+------------------------+---------+---------+
| cpu_basic::sync_work   | 1915914 | 56.13%  |
+------------------------+---------+---------+
| cpu_basic::async_sleep | 14056   | 0.41%   |
+------------------------+---------+---------+
| cpu_basic::sync_alloc  | 1581    | 0.05%   |
+------------------------+---------+---------+
samply load /tmp/hotpath/61089-1778083683167502000/hp.json.gz

hotpath timing, alloc, and CPU usage report.

Optionally, you can run samply load to see an interactive call tree or flamegraph generated with samply:

Samply optional full interactive call tree

We can identify a different bottleneck depending on which report section we analyze:

async_sleep - takes ~75% of total program execution time, but less than 1% of CPU.
sync_alloc - took less than 1% of processing time but is responsible for ~90% of memory allocations.
sync_work - burned ~56% of total CPU cycles but took only ~20% of processing time

I came up with this example to exaggerate the core idea behind hotpath’s approach to profiling: different performance signals tell different stories.

This distinction is especially important for async-heavy Rust programs, where “slow” often means waiting on I/O, timers, channels, or the runtime scheduler rather than actually burning CPU. Traditional CPU profilers like samply or perf can show where CPU time is spent, but they cannot reliably capture time futures spend suspended while waiting to be polled by the runtime.

While working on Rust performance issues, I often found myself jumping between multiple specialized tools - samply for CPU usage, cargo-instruments for allocations, and tokio-console for async behavior… hotpath was built to bring those different signals together into a single report that’s fast to read and easy to reason about.

Naturally, this famous XKCD comic felt very relevant…

How standards proliferate

Source: xkcd.com/927

Live profiling TUI

Another unique feature of hotpath is live profiling and debugging capabilities. You can see it in action by running:

ssh demo.hotpath.rs

It executes an interactive TUI program that displays its own live performance metrics.

hotpath-rs TUI showing its own performance metrics

hotpath-rs TUI displaying its own instrumented future performance and lifetime metrics

You can think of it as “debugging on steroids”. hotpath::channel!, hotpath::stream! and hotpath::future! macros allow you to inspect data flowing in your system in real time. The required code changes are minimal because macros return exactly the same types and are 100% no-op unless the --features=hotpath is enabled.

In my current workflow, I don’t use hotpath TUI extensively; I prefer the static one-off reports and A/B benchmarks. But if you’re working on a long-running interactive program, you might find it helpful.

Where I do find the TUI dashboard useful is for optimizing the hotpath itself. I maintain an instrumented fork of the Zed editor, and test every new feature of hotpath against it. Given the scale and complexity of Zed, I treat it as the ultimate stress test, ensuring that lib should work with most other Rust programs.

You can check out the fork with instructions on how to run it yourself from this PR, or see instrumented Zed in action in the short demo:

Hopefully, I have your curiosity now. Read on if you’re interested in how the library works under the hood and the techniques I use to keep it performant.

Profiling the profiler

The core mechanism behind hotpath instrumentation is RAII (Resource Acquisition Is Initialization) guards. Here’s a simplified explanation of what hotpath::measure macro compiles to:

#[hotpath::measure]
fn sync_alloc() {
    for _ in 0..1000 {
        let buf: Vec<u8> = vec![1; 1024];
        std::hint::black_box(&buf);
    }
}

becomes:

fn sync_alloc() {
    let _guard = MeasurementGuard::new();
    for _ in 0..1000 {
        let buf: Vec<u8> = vec![1; 1024];
        std::hint::black_box(&buf);
    }
}

And here’s a simplified implementation of the MeasurementGuard struct:

pub struct MeasurementGuard {
    name: &'static str,
    start: Instant,
    sender: Sender<Measurement>,
}

impl Drop for MeasurementGuard {
    fn drop(&mut self) {
        let duration = self.start.elapsed();

        let measurement = Measurement {
            name: self.name,
            duration: duration
        };

        let _ = self.sender.send(measurement);
    }
}

MeasurementGuard uses std::time::Instant to measure time between its creation and drop, which is determined by how long an instrumented function executes. It’s the so-called “wall-clock timing” approach to profiling programs, which is significantly different from CPU sampling. The primary difference is in the results you’ll get when measuring async functions, which are notoriously hard to benchmark using only CPU profiling. Check out this docs section for an in-depth analysis of these two approaches.

Fixing the MPSC channel bottleneck

The MeasurementGuard’s Drop trait implementation is the hottest hot path in the project. It’s executed inline with the instrumented code, so it directly impacts the performance of measured programs.

In the initial library versions, the code resembled the above implementation, i.e., measurements were sent for background processing one by one. I’ve measured (using benchmark_noop.rs) the per-instrumentation impact to be ~500ns. Far from ideal if you want to instrument functions that execute on a microsecond scale.

I’ve used samply to inspect overhead, and it produced the following output:

Samply Thread::unpark bottleneck

samply call tree Thread::unpark bottleneck

As you can see ~22%, of all CPU traces landed in the Thread::unpark method. Apparently, even if messages are sent to a crossbeam MPSC channel every 1 microsecond, a receiver thread enters the parked state and has to be unparked to receive the next message. The solution was to send messages in batches, gathered per thread. It completely eliminated the Thread::unpark samples and reduced the avg. overhead per measurement to ~80ns.

I did follow up on this issue, and found out that there’s a mismatch in how crossbeam-channel handles receivers depending on whether they are used with a select! macro. A directly subscribed receiver uses a spin wait with backoff logic to rectify exactly this Thread::park issue. But select! macro bypasses this mechanism and causes the receiver thread to park even if data arrives in microsecond-scale intervals. I’ve submitted a PR trying to address this. Not sure if it’s the correct solution, but if you’re using crossbeam_channel::select! macro with high-frequency data, then, batching will likely measurably improve performance.

BTW, sending instrumentation data to a separate thread for processing is the most performance-critical area in the hotpath’s codebase. It directly impacts the performance of benchmarked programs and affects measurements. If you have an idea for a data-sharing mechanism with lower overhead than crossbeam channels, plz get in touch!

Alloc metrics CPU cache busting

Measuring memory allocations uses a system that is slightly more complicated than RAII guards with an internal timer. I’ve borrowed the initial implementation from the allocation-counter crate.

It uses a thread-local stack to attribute allocations and deallocations to each thread:

pub(crate) struct ThreadAllocStats {
    pub(crate) tid: AtomicU64,
    pub(crate) alloc_bytes: AtomicU64,
    pub(crate) dealloc_bytes: AtomicU64,
}

const MAX_THREADS: usize = 256;

static THREAD_ALLOC_STATS: [ThreadAllocStats; MAX_THREADS] = {
    const INIT: ThreadAllocStats = ThreadAllocStats::new();
    [INIT; MAX_THREADS]
};

A static preallocated array of AtomicU64, one per thread. When the hotpath-alloc feature is enabled, we hook into a system allocator and proxy allocations and deallocations through our custom CountingAllocator:

cfg_if::cfg_if! {
    if #[cfg(feature = "hotpath-alloc")] {
        #[global_allocator]
        static GLOBAL: functions::alloc::allocator::CountingAllocator = functions::alloc::allocator::CountingAllocator {};
    }

    unsafe impl GlobalAlloc for CountingAllocator {
        unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
            crate::lib_on::functions::alloc::core::track_alloc(layout.size());

            unsafe { System.alloc(layout) }
        }

        unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
            crate::lib_on::functions::alloc::core::track_dealloc(layout.size());

            unsafe {
                System.dealloc(ptr, layout);
            }
        }
    }
}

The GlobalAlloc trait defines alloc and dealloc, called for EVERY single allocation and deallocation in the instrumented Rust program.

That’s how we can attribute total alloc/dealloc stats for each thread. I use the benchmark_alloc.rs script to measure the overhead of an exaggerated function that does nothing but allocate memory:

#[hotpath::measure]
fn alloc() {
    for _ in 0..1000 {
        let vec = vec![1u8; 128];
        std::hint::black_box(vec);
    }
}

Here’s the benchmark execution time depending on the number of threads:

cargo build --example benchmark_alloc --features='hotpath,hotpath-alloc' --release 
hyperfine --warmup 3 'HOTPATH_ALLOC_NUM_THREADS=$THREADS_NUM ./target/release/examples/benchmark_alloc'

Threads	Time (ms)
1	~210
2	~720
4	~1000
8	~3700
12	~4400

This benchmark simulates an extreme case of a Rust program that does nothing but allocate. Surprisingly, the initial implementation worked without issues, even with the Zed debug build, which spawns ~60 threads. But you wouldn’t believe, a simple trick to instantly speed up Rust programs with multiple threads competing for a shared static resource:

#[repr(align(64))]
pub(crate) struct ThreadAllocStats {
    /// Thread ID (0 unused)
    pub(crate) tid: AtomicU64,
    pub(crate) alloc_bytes: AtomicU64,
    pub(crate) dealloc_bytes: AtomicU64,
}

Here are the same benchmark results after prepending ThreadAllocStats with #[repr(align(64))] annotation:

Threads	Before (ms)	After (ms)	% Improvement
1	~210	~210	0%
2	~720	~215	70%
4	~1000	~280	72%
8	~3700	~500	86%
12	~4400	~660	85%

This technique, called “cache padding”, forces each ThreadAllocStats instance to start at a 64-byte boundary, matching a common CPU cache-line size. Without padding, ThreadAllocStats in memory size was 24 bytes, so two instances could fit in a single CPU cache line. With multiple threads invalidating each other’s caches, performance was getting worse with each added thread. Check out the story of optimizing zlib-rb implementation for an in-depth exploration of this technique.

Async functions memory tracking

Another challenge was tracking memory allocations for instrumented functions. For synchronous functions, the implementation is still largely based on the allocation-counter crate. It uses thread_local data (AllocationInfoStack if you’re interested in the impl details) for tracking allocations. We use a RAII guard pattern similar to wall-clock time measurement:

impl AllocMeasurementGuard {
    pub fn new(name: &'static str, sender: Sender<Measurement>) -> Self {
        push_alloc_frame();
        Self { name, sender }
    }
}

impl Drop for AllocMeasurementGuard {
    fn drop(&mut self) {
        let (bytes, count) = pop_alloc_frame();
        let _ = self.sender.send(Measurement { 
          name: self.name, bytes, count 
        });
    }
}

It’s a simplified implementation to explain how it works. We maintain a thread_local stack and use push_alloc_frame to mark which synchronous function is currently executing. pop_alloc_frame marks functions as completed and reads allocation metrics for background processing.

But async Rust functions, depending on the runtime implementation, may migrate between threads. For example, the default #[tokio::main] macro spawns threads for each virtual CPU. You can easily check it in the Threads tab of a hotpath TUI:

hotpath-rs TUI showing per thread metrics

hotpath TUI displays per-thread CPU and memory metrics

So, for async functions our thread-local data won’t work. Allocation info gathered for the function on thread A is no longer available if it executes pop_alloc_frame on a thread B.

The previous version of hotpath allowed async allocations tracking only for #[tokio::main(flavor = "current_thread")] runtime. I thought that as long as we use a single thread, measurements should be accurate. Unit tests returned correct results, and it supposedly worked. But benchmarking an Axum server app using the current_thread runtime, under a multithreaded oha loadtest, produced “interesting” results.

hotpath-rs buggy benchmark of async functions memory

Can you spot an issue here?

A ~60s benchmark supposedly churned ~400GB of memory with a single routes::serve_doc_page method call allocating ~23MB on avg.

The bug was caused by using an RAII guard for instrumenting an asynchronous function, which, contrary to synchronous calls, does not execute continuously. Runtime can yield control of a future at any time and start executing a different one, even on a single thread. It caused the allocation metrics of measured functions to intertwine, massively inflating the numbers.

The solution was to instrument async functions as futures, which they really are. The current version of hotpath uses a similar implementation for tracking async memory usage:

pub struct InstrumentedFuture<F: Future> {
    #[pin]
    inner: F,
    sender: Option<&'static Sender<FutureEvent>>,
    future_id: u32,
    completed: bool,
}

impl<F: Future> Future for InstrumentedFuture<F> {
    fn poll(self: Pin<&mut Self>, cx: &mut Context<'_>) -> Poll<Self::Output> {
          crate::functions::alloc::guard::push_alloc_stack();
          this.inner.poll(cx)
          let (bytes, count) = crate::functions::alloc::guard::pop_alloc_stack();
          let _ = self.sender.send(Measurement { name: self.future_id, bytes, count });
    }
}

We wrap every instrumented function/future into an InstrumentedFuture struct, which holds a pinned reference (inner) to the original future. The hotpath::measure macro handles all the wrapping. And we push/pop the allocations data stack when runtime calls poll on our wrapper. Counterintuitively, the poll function is synchronous, so it’s guaranteed to execute continuously, and our allocation info no longer gets mixed up with other asynchronous calls.

And here’s the same benchmark after a fix:

hotpath-rs fixed benchmark of async functions memory

This mechanism does not use any runtime-specific APIs, nor does it require setting additional config like RUSTFLAGS="--cfg tokio_unstable", or any changes to your project other than the hotpath::measure macro. The newest version of hotpath can correctly attribute memory allocations regardless of which runtime you’re using, tokio, smol, or anything custom.

CPU profiling integration

CPU profiling is a recent addition to the library. The core idea behind hotpath is combining different performance signals into easy-to-understand reports. Wall-clock timing and memory allocation metrics are useful on their own, but they don’t show the full picture.

An initial prototype used the pprof-rs crate, allowing sampling of CPU execution within the same process. I liked the simplicity of this approach, but it did not scale. The debug build of Zed slowed down to a crawl with pprof-rs profiling enabled.

A solution was to integrate samply profiler. It’s running in a separate process, so it works even with the debug profile Zed. Integration required a hacky workaround, because launching samply record --pid from the same process or its direct child caused it to get stuck. My best guess is that attaching the profiler temporarily suspends the target process and all its children. A workaround consists of a separate hotpath-samply binary, that spawns in a detached mode, and only then can we launch samply record without crashing the host process. “It’s broken, but it works”, and hotpath-cpu mode is available since the recent 0.16.0 release.

One caveat is that on Linux you have to prefix profiled programs with setsid -w, otherwise they exit before a final report is produced. See docs for more info.

Interpreting raw CPU traces

The host process and hotpath-samply communicate via a custom file-based API. Once hotpath allocations and timing profiling are completed, lib creates a stop-profiling file in a predefined location, and it communicates that we should also finish the CPU sampling by sending a SIGINT signal to samply. Later a done file creation communicates that the hp.json.gz file with profiling data is ready to be parsed. I came up with the hacky protocol by chatting with my AI friends. Let me know if you can think of a cleaner way to establish cross-process communication.

hp.json.gz is where things start to get interesting. This file contains raw data about CPU samples collected by samply profiler. Very raw:

{
  "threads": [{
    "name": "cpu_nested",
    "tid": 218490,
    "isMainThread": true,

    "stringArray": ["cpu_nested", "0x0ea0", "0x0c10"],

    "resourceTable": {
      "length": 1,
      "name": [0],
      "lib": [0],
      "type": [1]
    },

    "funcTable": {
      "length": 2,
      "name": [1, 2],
      "resource": [0, 0]
    },

    "frameTable": {
      "length": 2,
      "address": [0x0ea0, 0x0c10],
      "func": [0, 1],
      "nativeSymbol": [null, null]
    },

    "stackTable": {
      "length": 2,
      "prefix": [null, 0],
      "frame": [0, 1]
    },

    "samples": {
      "length": 1,
      "stack": [1],
      "timeDeltas": [3.12],
      "threadCPUDelta": [950],
      "weight": [1]
    }
  }]
}

A minimal example of samply profiling output

You’ll notice that this JSON does not include any information about function symbols, so to attribute collected samples to our instrumented functions, we have to map them manually.

We have to start by mapping symbol data to addresses. We can do it by extracting symbol ranges from the program binary. That’s why you have to use a custom profile if you want to profile programs running in a --release mode:

[profile.profiling]
inherits = "release"
debug = true

Otherwise, the symbol data is not in the binary. hotpath uses object crate to extract ranges corresponding to known function symbols.

examples/cpu_nested.rs

#[hotpath::measure]
fn heavy_work(iterations: u32) -> u64 {
    let mut result: u64 = 1;
    for i in 0..iterations {
        result = result.wrapping_mul(black_box(i as u64).wrapping_add(7));
        result ^= result >> 3;
    }
    result
}

#[hotpath::measure]
fn light_work(iterations: u32) -> u64 {
    let mut result: u64 = 0;
    for i in 0..iterations {
        result = result.wrapping_add(black_box(i as u64));
    }
    result
}

#[hotpath::measure]
fn outer_work() -> u64 {
    let mut total: u64 = 0;
    total = total.wrapping_add(heavy_work(500_000));
    total = total.wrapping_add(light_work(100_000));
    total
}

#[hotpath::main]
fn main() {
    let mut total: u64 = 0;
    for _ in 0..1000 {
        total = total.wrapping_add(outer_work());
    }
    black_box(total);
}

It produces the following data structure:

[
    (
        2980,
        3348,
        "cpu_nested::heavy_work",
    ),
    (
        3348,
        3656,
        "cpu_nested::light_work",
    ),
    (
        3656,
        3904,
        "cpu_nested::outer_work",
    ),
]

The JSON profile does not directly say: “this sample belongs to heavy_work“. Instead, it stores sampled stack traces as compact table references.

Here’s how we can resolve the symbol:

samples.stack[0] => 1
stackTable.frame[1] => 1
frameTable.address[1] => 0x0c10 => 3088
symbol_ranges.contains(3088) => cpu_nested::heavy_work

And here’s our leaf function symbol!

Exclusive vs inclusive CPU attribution

For exclusive CPU attribution, this is enough. The sample is attributed to the first instrumented function found in the leaf frame.

hotpath-cpu supports HOTPATH_CPU_INCLUSIVE=1 config, which attributes CPU usage to both instrumented children and parent functions.

Running cpu_nested in a default exclusive CPU attributions mode produces:

cargo run --example cpu_nested --features='hotpath,hotpath-cpu'

+------------------------+---------+---------+
| Function               | Samples | % Total |
+------------------------+---------+---------+
| cpu_nested::heavy_work | 4420575 | 86.44%  |
+------------------------+---------+---------+
| cpu_nested::light_work | 627107  | 12.26%  |
+------------------------+---------+---------+

and in inclusive mode, we get:

HOTPATH_CPU_INCLUSIVE=1 cargo run --example cpu_nested --features='hotpath,hotpath-cpu'

+------------------------+---------+---------+
| Function               | Samples | % Total |
+------------------------+---------+---------+
| cpu_nested::main       | 5235886 | 99.06%  |
+------------------------+---------+---------+
| cpu_nested::outer_work | 5234714 | 99.04%  |
+------------------------+---------+---------+
| cpu_nested::heavy_work | 4515297 | 85.43%  |
+------------------------+---------+---------+
| cpu_nested::light_work | 719417  | 13.61%  |
+------------------------+---------+---------+

These two modes are useful for analyzing how children’s instrumented functions usage relates to their parents.

So how can we attribute CPU traces to non-leaf instrumented functions? The answer is… LINKED LISTS! (╯°□°）╯︵ ┻━┻

Apparently, they do exist outside of LeetCode…

So coming back to our example:

samples.stack[0] => 1
stackTable.frame[1] => 1
frameTable.address[1] => 0x0c10 => 3088
symbol_ranges.contains(3088) => cpu_nested::heavy_work

To find the parent frame, we have to follow the prefix field:

stackTable.prefix[1] => 0

stack node 0:
  stackTable.frame[0] => 0
  frameTable.address[0] => 0x0ea0 => 3744
  symbol_ranges.contains(3744) => cpu_nested::outer_work

stackTable.prefix[0] => null

null means we’ve reached the root of the sampled stack and can stop walking the linked list. This is the core mechanism behind inclusive CPU attribution in hotpath-cpu mode.

I owe linked lists an apology. ┬─┬ノ( º _ ºノ)

CPU delta vs. samples count

So we can now attribute CPU samples to our instrumented functions. Now consider the following example:

examples/cpu_park_vs_busy.rs

const COMPUTE_BATCH: u32 = 200_000;

#[hotpath::measure]
fn busy_compute(iterations: u32) -> u64 {
    let mut result: u64 = 1;
    for i in 0..iterations {
        result = result.wrapping_mul(black_box(i as u64).wrapping_add(7));
        result ^= result >> 3;
    }
    result
}

#[hotpath::measure]
fn park_main(duration: Duration) {
    thread::park_timeout(duration);
}

#[hotpath::main]
fn main() {
    let start = Instant::now();
    while start.elapsed() < Duration::from_secs(5) {
        black_box(busy_compute(COMPUTE_BATCH));
    }

    park_main(Duration::from_secs(5));
}

cargo run --example cpu_park_vs_busy --features='hotpath,hotpath-cpu'

timing - Execution duration of functions.
+--------------------------------+-------+-----------+-----------+---------+---------+
| Function                       | Calls | Avg       | P95       | Total   | % Total |
+--------------------------------+-------+-----------+-----------+---------+---------+
| cpu_park_vs_busy::main         | 1     | 10.22 s   | 10.23 s   | 10.22 s | 100.00% |
+--------------------------------+-------+-----------+-----------+---------+---------+
| cpu_park_vs_busy::park_main    | 1     | 5.00 s    | 5.00 s    | 5.00 s  | 48.95%  |
+--------------------------------+-------+-----------+-----------+---------+---------+
| cpu_park_vs_busy::busy_compute | 16415 | 304.46 µs | 333.57 µs | 5.00 s  | 48.91%  |
+--------------------------------+-------+-----------+-----------+---------+---------+

cpu - CPU sampling attribution per function (exclusive)
+--------------------------------+---------+---------+
| Function                       | Samples | % Total |
+--------------------------------+---------+---------+
| cpu_park_vs_busy::busy_compute | 5044956 | 99.05%  |
+--------------------------------+---------+---------+
| cpu_park_vs_busy::park_main    | 1780    | 0.03%   |
+--------------------------------+---------+---------+

So, in terms of timing, park_main and busy_compute each took ~50% of the processing time. But CPU usage is reported at 99% vs 0.03%. Let’s compare it to the samply UI report:

samply UI reporting parked thread data

It reports a 50/50 split similar to the hotpath wall-clock time report. The only hint of CPU usage is a yellow chart at the top of the report.

I’ve considered different approaches and, for now, settled on using threadCPUDelta rather than timing or sample count to calculate CPU attribution %. So the default samply call traces UI attributes samples to a function symbol, even if the CPU is dormant (i.e., thread in parked state) when the sample is collected. Recreating this approach would replicate the wall-clock timing (except for async functions! see sampling comparison for an in-depth info on this topic).

So now hotpath-cpu highlights areas where the CPU really “goes brrr” and augments the report with this new perspective.

Future hotpath releases will likely add more CPU instrumentation modes; feedback on which approach is best is appreciated!

`inline` compiler hints

The current hotpath-cpu implementation uses one more hack to correctly resolve the function symbols:

examples/always_inlined.rs

#[hotpath::measure]
#[inline(always)]
fn always_inlined(iterations: u32) -> u64 {
    let mut result: u64 = 1;
    for i in 0..iterations {
        result = result.wrapping_mul(black_box(i as u64).wrapping_add(7));
        result ^= result >> 3;
    }
    result
}

#[hotpath::main]
fn main() {
    let mut total: u64 = 0;
    for _ in 0..2000 {
        total = total.wrapping_add(always_inlined(50_000));
    }
    black_box(total);
}

heavy_work uses an #[inline(always)] annotation, which causes its address symbol range to disappear from the binary. The current implementation of hotpath::measure macro implicitly strips user-provided inline annotation, and applies inline(never) to prevent this issue. Working on this post, I’ve noticed that samply can still correctly attribute samples even to inlined functions, it even reports in the UI that the function was inlined:

samply UI reporting an inlined function

Apparently, samply uses a gimli crate to resolve symbols based on DWARF format. I did not get around to researching this topic yet, but future hotpath releases will likely drop the inline(never) hack and use DWARF format parsing in case symbol ranges are missing from the binary.

See this GH issue if you would like to contribute this improvement.

Summary

A large chunk of my career was dedicated to optimizing Ruby on Rails applications, where numbers below a millisecond rarely even appeared in logs. So working on problems where nanosecond-scale improvements can have a significant impact has completely changed my perspective on performance engineering. Profiling the profiler is fun. I’m looking forward to continuing work on hotpath-rs.

BTW, as much as I enjoy feeding LLMs with my writing, if you’re a human reading this, please leave a comment here or on Reddit so I know you exist.