|
| 1 | +# Profile-Guided Optimization and Link-Time Optimization |
| 2 | + |
| 3 | +## Problem |
| 4 | + |
| 5 | +C++ application with complex control flow (many branches, virtual calls, deep call trees) where `-O3` alone leaves significant performance on the table. The metric was end-to-end throughput of a compiler-like workload (parsing + optimization + code generation). The optimizer makes guesses about branch probabilities and inlining without runtime data, often getting it wrong. |
| 6 | + |
| 7 | +## What Worked |
| 8 | + |
| 9 | +**PGO (Profile-Guided Optimization)** feeds actual runtime profiling data back into the compiler, enabling: |
| 10 | +- Accurate branch probability annotations (hot paths get fall-through layout) |
| 11 | +- Informed inlining decisions (inline functions on hot paths, skip cold ones) |
| 12 | +- Hot/cold code splitting (frequently executed code packed together for better I-cache utilization) |
| 13 | +- Better register allocation along hot paths |
| 14 | + |
| 15 | +**LTO (Link-Time Optimization)** performs whole-program optimization across translation units, enabling cross-module inlining, dead code elimination, and interprocedural constant propagation. |
| 16 | + |
| 17 | +Combined PGO+LTO achieved 22% throughput improvement on a real-world workload. PGO alone gave ~15%, LTO alone ~8%, but they compound because LTO exposes more inlining opportunities for PGO-guided decisions. |
| 18 | + |
| 19 | +## Experiment Data |
| 20 | + |
| 21 | +| Configuration | Throughput (ops/s) | Binary Size | |
| 22 | +|--------------|-------------------|-------------| |
| 23 | +| -O3 baseline | 1,000 | 12.1 MB | |
| 24 | +| -O3 + LTO | 1,082 | 10.8 MB | |
| 25 | +| -O3 + PGO | 1,148 | 12.4 MB | |
| 26 | +| -O3 + PGO + LTO | 1,221 | 11.2 MB | |
| 27 | + |
| 28 | +## Code Example |
| 29 | + |
| 30 | +```bash |
| 31 | +# GCC PGO workflow (three-step): |
| 32 | +# 1. Build instrumented binary |
| 33 | +g++ -O3 -fprofile-generate=./profdata -flto -o app_instrumented *.cpp |
| 34 | + |
| 35 | +# 2. Run with representative workload to collect profile |
| 36 | +./app_instrumented < representative_input.txt |
| 37 | + |
| 38 | +# 3. Rebuild using profile data |
| 39 | +g++ -O3 -fprofile-use=./profdata -flto -o app_optimized *.cpp |
| 40 | + |
| 41 | +# Clang uses -fprofile-instr-generate / -fprofile-instr-use instead |
| 42 | +``` |
| 43 | + |
| 44 | +## What Didn't Work |
| 45 | + |
| 46 | +- **Non-representative training data**: PGO with synthetic benchmarks that don't match production traffic led to *worse* performance than baseline (-3%) because the optimizer optimized for the wrong hot paths. The training workload must closely match production. |
| 47 | +- **PGO on very small programs**: The overhead of instrumentation and the three-step build process isn't worth it for programs under ~10K lines where `-O3` already does well. |
| 48 | + |
| 49 | +## Environment |
| 50 | + |
| 51 | +GCC 13.1 / Clang 17, Linux. PGO is supported by all major compilers. LTO requires all translation units to be compiled with the same compiler. |
0 commit comments