Skip to content

benchmark: alloc profile testitem (Ipopt + MadNLP)#95

Open
jack-champagne wants to merge 4 commits into
benchmarks/directtrajopt-initial-v2from
benchmarks/alloc-profile-v1
Open

benchmark: alloc profile testitem (Ipopt + MadNLP)#95
jack-champagne wants to merge 4 commits into
benchmarks/directtrajopt-initial-v2from
benchmarks/alloc-profile-v1

Conversation

@jack-champagne
Copy link
Copy Markdown
Member

@jack-champagne jack-champagne commented May 20, 2026

Summary

Adds an allocation profile of the bilinear-N=51 solve under each solver
(Ipopt + MadNLP), using HBJ's new benchmark_memory! + report_alloc_profile
API. Produces a JLD2 AllocProfileResult artifact per solver and prints a
top-K breakdown by allocated type, leaf call site, and frame.

Picks up where closed DTO#71
left off — that PR's analyzer logic has since shipped in HBJ as src/analyze.jl;
this PR is the DTO-side wiring that exercises it.

Stacked on #75 so
reviewers see only this PR's diff. GitHub will auto-retarget to main once
#75 merges.

What's in the PR

  • benchmark/alloc_profile.jl — new testitem "Alloc profile: bilinear N=51 (Ipopt + MadNLP)". Uses max_iter = 30 and sample_rate = 0.01 (full-rate sampling hangs the bilinear solve >15 min, per closed DTO#71's notes; even 0.01 runs each solver at ~30-40 min under Profile.Allocs overhead, which is fundamental to Julia's Profile.Allocs regardless of max_iter).
  • .github/workflows/alloc-profile.yml (new) — dedicated workflow for the alloc profile testitem. Paths-filtered to benchmark/alloc_profile.jl, benchmark/problem_utils.jl, benchmark/Project.toml, and itself, so it only runs when the alloc-profile config changes. 90-min timeout. Uploads benchmark/results/allocs/ as a separate artifact.
  • .github/workflows/benchmark.yml — filters the alloc-profile testitem out of the main benchmark workflow. Main workflow's timeout reverts to 60 min (back from the brief 180 min experiment) since the heavy testitem has moved.
  • benchmark/Project.toml — HBJ pin bumped from 5401542c (v0.2.0 prep) to c38418cb (post-HBJ#12, ships the analyzer).

Why split workflows

Local hard data on a fast workstation: alloc-profile testitem alone = 47m19s. GH Actions = ~74m. Profile.Allocs adds ~30-40 min per solve regardless of max_iter — the per-allocation check overhead is the bottleneck, not the sampling rate. Keeping the testitem in the main benchmark workflow forced every src/ or benchmark/ PR to pay that cost (or pre-split, hit timeout).

By splitting, the main benchmark workflow stays at ~32 min for the fast green-light signal; alloc profile runs in its own 90-min-budget workflow only when the alloc-profile config actually changes. Same TestItemRunner filter pattern already used to keep main-package CI from picking up benchmark testitems.

Sample output (from local run, identical shape on CI)

=== Alloc profile: Ipopt (bilinear N=51, sample_rate=0.01) ===
  samples=26597  total≈13.13 MB (scaled to 1.28 GB via 1/sample_rate)

Top 10 allocated types (scaled ×100):
  21.47 MB  Memory{ForwardDiff.Dual{...eval_jacobian...}}
  19.41 MB  Memory{ForwardDiff.Dual{...eval_hessian_of_lagrangian...}}
  13.23 MB  ForwardDiff.HessianConfig{...eval_hessian_of_lagrangian...}
  ...

Hot spot: Memory{ForwardDiff.Dual{...}} buffers in the integrators' jacobian + hessian AD pipelines. Same conclusion as the evaluator micro-bench (eval_hessian_lagrangian dominates per-iter time). Real, actionable.

Test plan

  • Main Benchmarks workflow green (~32 min) — confirms the alloc-profile filter removed it cleanly
  • New Alloc Profile workflow green (~75 min) — confirms the split workflow runs the testitem and uploads artifacts
  • Artifact alloc-profile-95-<sha> contains alloc_bilinear_N51_ipopt_<sha>_allocs.jld2 + the MadNLP equivalent
  • Workflow log shows === Alloc profile: ... === sections with populated "Top N allocated types" tables (top entry not a Profile.Allocs.* noise type)

Follow-ups

Wires the new HBJ analyzer (`benchmark_memory!` + `report_alloc_profile`)
into the benchmark suite as a fourth testitem. Produces JLD2
allocation-profile artifacts under `benchmark/results/allocs/` (already
covered by the existing workflow's `benchmark/results/` upload) and
prints a top-K breakdown by type / leaf call site / frame for each
solver.

The testitem uses the same bilinear N=51 problem as the existing
Ipopt-vs-MadNLP timing testitem so allocation hotspots line up with
the timing numbers.

`sample_rate = 0.01` keeps the trace tractable — `Profile.Allocs` slows
the solve roughly linearly in number of allocations, and the bilinear
solve produces millions of allocs; full-rate sampling on N=10 hung
>15 min in earlier experiments (cf. closed DTO#71). The 1/sample_rate
extrapolation applied by `report_alloc_profile` rebuilds the totals.

Bumps HBJ pin from 5401542c (v0.2.0 prep) to c38418cb (post-#12,
analyzer + Piccolo-aligned CI) since the analyzer's exports
(`top_alloc_types`, `report_alloc_profile`, …) didn't exist at the
v0.2.0 prep commit. Other consumers of the bench env (timing,
scaling) already work against c38418cb — no behavioral change for
them.

Stacked on `benchmarks/directtrajopt-initial-v2` (PR #75) so reviewers
see only this testitem's diff. Will retarget to main when #75 lands.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Profile.Allocs adds ~30-40min per solve regardless of max_iter — local timing on a fast workstation showed 47m for the testitem alone, ~74m on GH Actions runners. That makes it impractical to gate every PR on the alloc profile (push to 180min timeout would burn ~115min of CI per benchmark/src/ change). Local hard data: bilinear N=51, max_iter=30, sample_rate=0.01: Ipopt section ~22min local / ~42min CI, MadNLP section ~25min local / ~32min CI.

Solution: split into a dedicated alloc-profile workflow with a paths filter (benchmark/alloc_profile.jl, benchmark/problem_utils.jl, benchmark/Project.toml, .github/workflows/alloc-profile.yml). Main benchmark workflow filters the alloc-profile testitem out and reverts to timeout-minutes: 60 (now ~32min wall time again). The alloc-profile workflow gets its own 90min budget, uploads artifacts under benchmark/results/allocs/.

TestItemRunner filter mirrors the same approach already used in test/runtests.jl to keep main-package CI from picking up benchmark testitems.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant