C++ application with complex control flow (many branches, virtual calls, deep call trees) where -O3 alone leaves significant performance on the table. The metric was end-to-end throughput of a compiler-like workload (parsing + optimization + code generation). The optimizer makes guesses about branch probabilities and inlining without runtime data, often getting it wrong.
PGO (Profile-Guided Optimization) feeds actual runtime profiling data back into the compiler, enabling:
- Accurate branch probability annotations (hot paths get fall-through layout)
- Informed inlining decisions (inline functions on hot paths, skip cold ones)
- Hot/cold code splitting (frequently executed code packed together for better I-cache utilization)
- Better register allocation along hot paths
LTO (Link-Time Optimization) performs whole-program optimization across translation units, enabling cross-module inlining, dead code elimination, and interprocedural constant propagation.
Combined PGO+LTO achieved 22% throughput improvement on a real-world workload. PGO alone gave ~15%, LTO alone ~8%, but they compound because LTO exposes more inlining opportunities for PGO-guided decisions.
| Configuration | Throughput (ops/s) | Binary Size |
|---|---|---|
| -O3 baseline | 1,000 | 12.1 MB |
| -O3 + LTO | 1,082 | 10.8 MB |
| -O3 + PGO | 1,148 | 12.4 MB |
| -O3 + PGO + LTO | 1,221 | 11.2 MB |
# GCC PGO workflow (three-step):
# 1. Build instrumented binary
g++ -O3 -fprofile-generate=./profdata -flto -o app_instrumented *.cpp
# 2. Run with representative workload to collect profile
./app_instrumented < representative_input.txt
# 3. Rebuild using profile data
g++ -O3 -fprofile-use=./profdata -flto -o app_optimized *.cpp
# Clang uses -fprofile-instr-generate / -fprofile-instr-use instead- Non-representative training data: PGO with synthetic benchmarks that don't match production traffic led to worse performance than baseline (-3%) because the optimizer optimized for the wrong hot paths. The training workload must closely match production.
- PGO on very small programs: The overhead of instrumentation and the three-step build process isn't worth it for programs under ~10K lines where
-O3already does well.
GCC 13.1 / Clang 17, Linux. PGO is supported by all major compilers. LTO requires all translation units to be compiled with the same compiler.