- Configurable Sampling Triggers: Typed triggers now fully support hardware-specific configuration:
- Intel:
perf::MemoryLoadssupports configurablemin_latencyfor PEBS load latency filtering.perf::MemoryStoresandperf::MemoryLoadsAuxprovide type-safe alternatives to string-basedmem-storesandmem-loads-auxtriggers. String-based triggers remain available and documented. - AMD:
perf::IbsOpandperf::IbsFetchnow build their counter configuration directly from hardware capabilities, fully supportingis_uop,is_l3_miss_only, andis_randflags. String-based trigger variants (ibs_op_uops,ibs_op_l3missonly,ibs_op_uops_l3missonly,ibs_fetch_l3missonly) still work but are no longer documented. Use typed triggers for full configurability.
- Intel:
- Fixed-Function PMC Scheduling: On Intel processors, the built-in events
instructions,cycles,cpu-cycles, andref-cyclesare now automatically scheduled to dedicated pinned groups. This prevents the kernel scheduler from placing fixed-function PMC events alongside generic events, which would distort multiplexing ratios. Fixed groups do not count against the generic PMC limit. - Fixed PMC Detection: Added
HardwareInfo::physical_fixed_performance_counters_per_logical_core(), which reads the number of fixed-function performance counters on Intel. - Built-in Event: Added
ref-cycles(reference cycles at a fixed frequency, unaffected by turbo boost or power-saving states) to the built-in hardware event list. - Bugfixes:
- Fixed out-of-bounds access when reading live counter values.
- Fixed wrong parsing order for throttle events in the sample decoder.
- Fixed sample decoder always reporting a data access source even when none was available (e.g., when sampling stores on Intel hardware).
- Fixed various decoding bugs and hardened the sample decoder against malformed records.
- Adding events to an already-opened
EventCounternow raises an exception instead of silently failing.
- Header Restructuring: Headers have been reorganized into
counter/,sample/,metric/,analyzer/, andutil/subdirectories and renamed from.hto.hpp. The previous.hheaders remain as forwarding includes with deprecation notices and will be removed in v1.0. - Breaking:
EventCounter::add()andstart()(includingSampler::start()and all multi-thread/core variants) now returnvoidinstead ofbool. Errors are communicated via exceptions; the return values were unused. - Compile Flag for AUX Buffer Support: Added
PERFCPP_NO_SAMPLE_AUXcompile flag to disable auxiliary buffer sampling on systems with Linux kernels older than 5.5 that lackPERF_SAMPLE_AUXsupport. Thanks to @rconnorlawson. - Perf File Export: Fixed bugs in perf format when materializing samples into file that can be read via
perf [mem] report. - NMI Watchdog Detection: Hardware counter detection now accounts for the NMI watchdog permanently consuming one hw-PMU counter, fixing incorrect counter counts on systems with the watchdog enabled.
- RAPL Power Metrics: Added built-in
watts-pkg,watts-cores, andwatts-rammetrics for measuring power consumption via RAPL energy counters (see the documentation). - Conan Package: Added Conan 2.x package recipe for easier integration.
- Config Setter Naming: Standardized
Configsetter naming — setters no longer use theis_prefix (e.g.,pinned(bool)instead ofis_pinned(bool)). The oldis_pinned(bool)andis_debug(bool)setters are deprecated and will be removed in v1.0. - SampleResult CSV Export: Added
SampleResult::to_csv()returning astd::string, complementing the existing file-based overload. - Per-Element Results: Added
result_of_thread(thread_id),result_of_process(process_id), andresult_of_core(core_id)to query individual results fromMultiThreadEventCounter,MultiProcessEventCounter, andMultiCoreEventCounter. Process and core variants returnstd::optional<CounterResult>since the ID may not be present. - Documentation: Rewrote and restructured all documentation pages for consistency, conciseness, and correctness. Documentation is now hosted at jmuehlig.github.io/perf-cpp.
- CSV Export: Samples can now be exported to CSV format using
SampleResult::to_csv(), enabling custom analysis with statistical tools, spreadsheets, or data processing pipelines (see the documentation). - Flamegraphs: Samples can be exported to flamegraphs via
SampleResult::to_flamegraphs(std::string&&)(see the documentation). - Fine-Grained Sampling Configuration: The sampling API now provides more granular control over which data fields are recorded. Previously coarse-grained options have been split into specific setters, allowing precise selection of hardware-specific metrics. See the sampling documentation for more information. New setters include:
- Physical Instruction Pointer (AMD IBS Fetch PMU only)
- Instruction Type (AMD IBS Op PMU for
ReturnandBranchtypes) - Branch Type (AMD IBS Op PMU only)
- Instruction Latency (Intel: instruction retirement cycles; AMD IBS Op PMU: uOp tag-to-retirement, completion-to-retirement, tag-to-completion; AMD IBS Fetch PMU: fetch latency)
- Instruction Cache (AMD IBS Fetch PMU only)
- Instruction TLB (AMD IBS Fetch PMU only)
- Instruction Fetch (AMD IBS Fetch PMU only)
- Data TLB Page Size (AMD IBS Op PMU only)
- Data TLB Latency (AMD IBS Op PMU only)
- Data Access Width (AMD IBS Op PMU only)
- Data Access Misalignment Penalty (AMD IBS Op PMU only)
- MHB/MAB Allocations (AMD IBS Op PMU only)
- Auxiliary Values
- Added
perf::Config::include_host(bool)to enable excluding host events when counting hardware events in virtual machines (see the documentation). - Added
perf::Config::is_pinned(bool)to enable pinning events to the PMU (see the documentation).
- Bugfix: The library could not compile for specific Linux kernels (see #10).
- Symbol Translation: Improved translation from instruction pointer to symbol.
This update enables exporting sampled data to the standard perf format for analysis with existing tools.
- Bugfix: The library crashed when events loaded from an external CSV file contained empty spaces (see #8). Thanks to @Liteom.
- Bugfix: The library could not compile for specific Linux kernels not providing
PERF_MEM_LVLNUM_UNCandPERF_MEM_SNOOPX_PEER(see #7). Thanks to @Raphalex46 for pointing out. - Perf Data Export: Samples can now be written as perf data files using
Sampler::to_perf_file(), enabling analysis with standard perf ecosystem tools like perf report (see the documentation). Note that this feature is experimental.
This update simplifies the handling of counter definitions by introducing a default instance.
- Default Counter Definitions: Supplying a user-defined
perf::CounterDefinitionto eachperf::EventCounterorperf::Sampleris no longer required. If none is provided, a default instance is used automatically. Custom definitions now extend the default set of events instead of duplicating them.
- Metric Functions: Metrics now support built-in functions such as
ratio(A, B)andsum(A, B, C, ...), enabling more expressive and reusable formulas (see the documentation). - Optimized Compile-time Event Injection: The generated runtime event registration class is now only created if it does not already exist, reducing unnecessary recompilation.
- Improved Live Event Accuracy: Live event values now account for partial runtime durations via time scaling, improving accuracy when counters were not active for the full measurement window.
This update extends event discovery to ARM platforms, improves hardware counter introspection, and enhances the flexibility of metric definitions.
- Automatic Event Discovery on ARM: Hardware event types are now automatically detected on ARM architectures when initializing a
perf::CounterDefinitioninstance. - Hardware Counter Introspection: The number of available physical performance counters per logical core, along with the number of events each counter can multiplex, is now determined automatically when creating a
perf::EventCounter. - Recursive and Scientific Metrics: Metric expressions can now reference other metrics recursively. Support for scientific notation (e.g.,
1e5) in formula-based metrics has also been added.
This release expands symbolic analysis capabilities, introduces FlameGraph generation, and improves hardware event management through both runtime and compile-time support.
- Symbol Resolution: Instruction pointers captured during sampling can now be resolved to function names using
perf::SymbolResolver(see the documentation). - FlameGraph Export: Sampling data can be converted into formats compatible with visualization tools such as Brendan Gregg's FlameGraph, Speedscope, and flamegraph.com using
perf::analyzer::FlameGraphGenerator(see the documentation). - Built-in Event Definitions: A set of
x86-specific hardware events is now bundled in events/x86 and can be loaded at runtime usingperf::CounterDefinition. This serves as an alternative to themake perf-listtarget. - Compile-time Event Injection: Processor-specific event definitions can now be embedded directly at build time by configuring CMake with
-DGEN_PROCESSOR_EVENTS=1. These are immediately available viaperf::CounterDefinition(see the documentation). - Automatic Event Discovery: Additional event types—including RAPL energy counters and AMD IO MMU events—are now automatically detected during the creation of a
perf::CounterDefinitioninstance (issue #6).
- Unified the behaviour of the
timeandtimestampfields in the sampling API, removing discrepancies between the two.
This version rolls out a redesigned sampling API.
Recorded data are now grouped into dedicated sub-structures (such as Metadata, InstructionExecution, and DataAccess) inside perf::Sample (see the documentation).
The previous flat API is still available but deprecated and will be removed in v0.12.
- New Sampling Interface: Work with clearly separated sample sections, exposing additional AMD IBS fields that are not surfaced by the
perf_event_openrecords. - Explicit Latency Attributes: Vendor-specific latency signals–cache-access on Intel and cache-miss on AMD–are now surfaced as distinct fields.
- Heterogeneous-core Support: Sampling can target multiple PMU domains (e.g., cpu_core and cpu_atom) on hybrid Intel processors.
- New feature: The auxiliary event is added automatically if required by the (Intel-) hardware (see the documentation).
- New feature: The Memory Access Analyzer allows to describe complex data objects and maps sampled memory addresses in order to report latency and access information (see the documentation).
- The number of pages for the sampling buffer is now aligned automatically in case the number is not configured properly, i.e., a power of two plus one page for the header.
- New feature: Copy sampled data from the mmap-ed perf buffer into application-level buffer whenever the buffer comes close to full (see the documentation).
- Removed deprecated warnings about the sampling interface (and the old sampling interface).
- New feature: Access interim results from counters without stopping the counter using live counters.
- New feature: Sampling the user stack (see the documentation).
- New feature: Create custom metrics using expressions, e.g.,
"instructions/cycles"(see the documentation). - New feature: Use metric when sampling counter values.
- New feature: Control scheduling of events to physical hardware counters (see the documentation).
- New feature: Added time events (e.g.,
seconds,milliseconds, etc.) as virtual counters (see the documentation).
- Fixed multiple compatibility issues where the code relied on Linux kernel features that might not available on different versions.
- Fixed compatibility for older Linux versions that don't provide
PERF_MEM_BLK,PERF_MEM_LVLNUM, andPERF_MEM_REMOTE.
- Fixed error using decltype instead of typeof (by @toge)
- Restructured the build-system – thanks to @foolnotion:
- Examples are no longer included into default build and must be activated with
-DBUILD_EXAMPLES=1(see documentation). - New feature: Added option to install the library using
-DCMAKE_INSTALL_PREFIX=/path/to/libperf-cpp(see documentation).
- Examples are no longer included into default build and must be activated with
- New feature: Define period or frequency along with trigger events when sampling (see documentation).
- New feature:
cgroupsampling (see documentation). - New feature: Sampling for context switches (see documentation).
- New feature: Sampling for throttle events (see documentation).
- New feature: Sampling for raw values (see documentation).
- New feature: Sampling for transaction aborts (see documentation).
- New feature: Print results from
perf::EventCounteras a table usingperf::CounterResult::to_string(). - Automatically discover AMD Instruction Based Sampling (IBS) PMUs when running on AMD hardware (see documentation).
- Automatically discover Intel Processor Event Based Sampling (PEBS) memory events when running on Intel hardware (see documentation).
- Enable Intel PEBS by default (used interrupt-based sampling so far, if not specified otherwise in
perf::SampleConfig::precise_ip()). - Support Linux Kernel down to
4.0– Kernels no longer need to be specified via compiler defines. - Close sampler automatically (i.e., free all buffers and close counters) when destructing.
- Fixed compilation error on ARM machines (
__builtin_cpu_is()is not supported) – thanks to @Tratori.
This release comes with many new features, especially focusing on the interface for sampling and error handling using exceptions.
Please note that we will maintain backward compatibility for the "old"-styled interface until v0.8.0.
Deprecated interfaces are marked as such using [[deprecated()]] annotations and may yield warnings during compilation.
Changelog:
- Samples can now be asked if they contain losses (and if so, how many). Sample records can be lost, e.g., if the buffer is out of capacity or the CPU is too busy.
- Errors when adding performance counters and opening/starting samplers are now communicated via exceptions instead of an error variable.
- Introduced a new interface for specifying the data that should be recorded for triggers through
Sampler::values(). - Introduced a new interface for specifying the triggers for sampling through
Sampler::trigger(). - Added the option to use multiple triggers for sampling (including example).
- Added the option to use different precisions for each trigger.
- Added the option to
open()the sampler separately. If the sampler is not opened separately,start()will open the sampler. - Added option to ask samples if they are precise (depends on the precision level for triggers).
- Using Counter-Names from
perf::CounterDefinition(viastd::string_view) instead of copying strings for more performance. - Switched from
PERF_MEM_LVL_*to newerPERF_MEM_LVLNUMnamespace asPERF_MEM_LVL_*is marked as deprecated inlinux/perf_event.h. - Added multithread and multicore recording.
- Added multithread and multicore sampling.
- Switched to LGPL (instead of AGPL).
- Added more complex
WeightStructsampling (viaPERF_SAMPLE_WEIGHT_STRUCT) to enable sampling for instruction latencies on newer hardware (e.g., Intel's Sapphire Rapids). - Implemented debug output for counters by setting an
is_debugflag in the config. - Added more complex branch sampling.
- Implemented autocorrect of
precise_ipconfiguration if the hardware rejects the initial user-set config. - Implemented auxiliary counter to enable memory sampling on Intel's Sapphire Rapids.
- Disabled counter
cgroup-switchesfor Linux Kernel< 5.13(was first introduced with that version). - Disabled sampling for Data Page Size and Code Page Size for Linux Kernel
< 5.11(was first introduced with that version).
- Added support for register sampling.
- Added
make perf-listto automatically extract perf counters from the underlying hardware. - Added support for sampling data and code page sizes.
- Added support for event sampling.
- Added full documentation.
- Fixed
std::moveonperf::CounterDefintion.
- Added metrics (e.g., CPI).
- Added json/csv conversion from results.
- Added examples.