Skip to content

benchmarks: Initial DirectTrajOpt benchmark suite (Ipopt vs MadNLP)#75

Open
jack-champagne wants to merge 15 commits into
mainfrom
benchmarks/directtrajopt-initial-v2
Open

benchmarks: Initial DirectTrajOpt benchmark suite (Ipopt vs MadNLP)#75
jack-champagne wants to merge 15 commits into
mainfrom
benchmarks/directtrajopt-initial-v2

Conversation

@jack-champagne
Copy link
Copy Markdown
Member

Re-open of #67 (was merged into wrong base, reverted, re-applied).

Summary

Initial benchmarking infrastructure for DirectTrajOpt.jl comparing Ipopt and MadNLP solvers.

Depends on: HarmoniqsBenchmarks.jl

What's included

  • benchmark/Project.toml — Benchmark environment with HarmoniqsBenchmarks, BenchmarkTools, TestItems, MadNLP
  • benchmark/benchmarks.jl — Three @testitem suites:
    • Evaluator micro-benchmarks (bilinear N=51)
    • Ipopt vs MadNLP macro-benchmark (bilinear N=51)
    • Memory scaling study — sweep N ∈ {25, 51, 101} × state_dim ∈ {4, 8, 16}
  • .github/workflows/benchmark.yml — CI workflow

Test plan

  • Run benchmark suite — all 3 benchmarks produce JLD2 artifacts
  • Verify Ipopt vs MadNLP timing numbers
  • CI green

🤖 Generated with Claude Code

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

jack-champagne and others added 5 commits May 15, 2026 16:18
- Add docs/src/benchmarks.md mirroring CuQuantum.jl pattern: problem
  description, result tables, environment info, reproduction instructions
- Wire benchmarks page into docs/make.jl
- Extract shared problem constructors to benchmark/problem_utils.jl
  (make_bilinear_problem, make_scaled_problem) to eliminate duplication
- Remove hardcoded package_version, pass runner consistently
- Update README with docs link and regression detection example

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Remove fragile Base.loaded_modules_order lookup for MadNLPSolverExt;
  MadNLPOptions is exported directly from DirectTrajOpt
- Change MadNLP print_level from 1 (TRACE) to 6 (ERROR) to match
  Ipopt's silent output and avoid polluting benchmark stdout
- Derive state_dim/control_dim from problem_dims(prob) instead of
  hardcoding in micro-benchmark MicroBenchmarkResult
- Set BENCHMARK_RUNNER=github-actions in CI workflow so results
  are distinguishable from local runs
- Remove unused ForwardDiff and LinearAlgebra from benchmark/Project.toml

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Pin HarmoniqsBenchmarks source to fix/widen-dto-compat branch which
  has DirectTrajOpt compat "0.8, 0.9" (fixes Pkg.instantiate failure)
- Run JuliaFormatter on benchmark files

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…merge

The compat-widening branch has merged to HBJ main, so point the benchmark
Project.toml at main. HBJ is not yet registered in General, so we still
need the explicit URL+rev source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@jack-champagne jack-champagne force-pushed the benchmarks/directtrajopt-initial-v2 branch from 8d3ff5f to ff9d6f6 Compare May 15, 2026 20:21
@jack-champagne jack-champagne changed the base branch from feat/madnlp-gpu-passthroughs to main May 20, 2026 05:01
The uncommented top-level `Base.loaded_modules_order` lookup and the
`for seed = 0:99` wins-counting loop are scope creep from the original
PR — the file is not picked up by `@run_package_tests` nor referenced
anywhere, so the only visible effect of uncommenting was a future
landmine: the top-level lookup will throw `BoundsError` if
MadNLPSolverExt is not loaded, and the loop contains `exit(1)` which
would kill the test runner outright.

Restore the original commented-out state. If we want to revive this as
a real test later, wrap it in `@testitem` with `@test err < 1e-3`
instead of `exit(1)`.
`contains(ti.filename, "benchmark")` would silently exclude any future
testitem whose filename happens to contain "benchmark" (e.g.
`test/regressions/hessian_benchmark.jl`). Switch to a path-component
match via `splitpath` so only files actually under a `benchmark/`
directory are excluded.
Tracking `rev = "main"` makes benchmark CI results non-reproducible: a
force-push or follow-up commit on HBJ silently changes the benchmark
schema for every old commit's CI re-run. Pin to the v0.2.0 prep tip
(5401542c — convergence schema + extractors), bump as needed when we
adopt new HBJ helpers.

Drop in favor of `[compat]` once HBJ registers in General.
`make_scaled_problem` accepted `seed::Int = 42` but the scaling sweep
never passed one — so every cell used seed 42, which means the timing
table was effectively measuring nine variations of the same random
instance rather than a sweep across instances.

Generate a deterministic per-cell seed (`1000 + 100*N + dim`). Both
solvers still receive the same instance for that cell so the
Ipopt-vs-MadNLP comparison stays apples-to-apples; only across cells
does the underlying problem change.

This is still single-shot per cell — a follow-up could sweep K seeds
and report median if we want statistical robustness.
The bare `try … catch end` blocks for `pkg_version` and the inline
`try … catch; ; "unknown" end` for the git SHA both hid real failures
silently (and the latter had a syntactically dubious `;` continuation).
Capture the exception with `@warn` so CI surfaces a "Failed to look up
…" message if the lookup ever breaks, rather than letting downstream
results carry "unknown" with no explanation.

Also hoist the git rev-parse into a named local for readability.
First-call JIT compile (Ipopt/MadNLP extension load, KKT/AD codegen)
dominates wall-time on small problems and biases the Ipopt-vs-MadNLP
comparison by solver order — whichever solver runs first pays the MOI
extension JIT that the second avoids.

Add a tiny `max_iter = 2` warmup solve per solver before the timed
section in both the macro-benchmark and the memory-scaling sweep. Each
@testitem runs in its own Julia process, so the warmup needs to be
inside the testitem (compilation cache is per-process).

This makes the reported wall-times steady-state numbers rather than
"cold start + a few iterations of solve work".
The numbers in the benchmark page were single-run local samples that
got captured into the docs build verbatim. Without a workflow wiring
real CI artifacts into the docs, presenting them as bolded headline
claims ("MadNLP is 33% faster", "40–45% less memory") would let them
become the canonical DirectTrajOpt figures people quote — which is
misleading because they shift with hardware, BLAS, MUMPS, Julia
version, and even solver order before the warmup fix.

- Add an upfront note that the tables are example output showing the
  schema of each benchmark, not pinned reference measurements.
- Remove the "33% faster / 43% fewer / 40–45% less / 30–35% faster"
  bolded comparison sentences.
- Reframe the scaling table around what it's actually good for: each
  solver vs itself over time, not a single-shot solver comparison.

Real numbers should come from the CI workflow's JLD2 artifacts. Wiring
those into the docs build is a follow-up.
JuliaFormatter line-broke the `DirectTrajOpt.solve!(...; options=...)`
calls in the warmup blocks. Apply the formatter pass so the Formatter
CI check stays green.
jack-champagne added a commit that referenced this pull request May 20, 2026
The convergence suite under `benchmark/convergence/` uses `@testitem`
blocks that depend on `HarmoniqsBenchmarks` — which isn't in the
package's main test environment. `@run_package_tests` was picking those
up and failing with `ArgumentError: Package HarmoniqsBenchmarks not
found in current path` on Julia 1.10/1.11/1.12.

Filter out anything whose path includes a `benchmark` component. Match
the path component (via `splitpath`) rather than a substring so future
test files like `foo_benchmark.jl` aren't accidentally excluded.

Mirrors the same filter added in PR #75 (initial benchmark suite).
MadNLP's `print_level` is a LogLevels enum where 0 isn't a valid value
(`ArgumentError: invalid value for Enum LogLevels: 0`). The timed
benchmark calls in this file already use `print_level=6` (silent at
ERROR level) — make the warmup solves match.

Caught by the benchmark CI workflow.
Single-shot timing on random instances was noisy enough to be
misleading. The previous CI run on this PR produced one anomalous cell:

  N=25, dim=8: Ipopt 0.011s / 22KB alloc  ← almost certainly an
                                            iter-0 termination on a
                                            seed-degenerate initial
                                            point

The seed for that cell was `1000 + 100*25 + 8 = 3508`. Other cells
trended roughly as expected with problem size, but cell-to-cell single
shots are noisy enough that the comparison numbers aren't trustworthy
(e.g. MadNLP at 25×16 took 29s but at 51×16 took 39s; MadNLP at 25×4
was slower than at 51×4).

Switch to median over `n_seeds = 3` independent random instances per
(N, dim) cell:

  - Seed scheme: `1000 + 100*N + dim + 10_000*(k-1)` for k ∈ 1:3.
    Per-cell distinct, per-sample distinct.
  - Both solvers receive the *same* instance for each (cell, seed) so
    per-seed Ipopt-vs-MadNLP comparisons stay fair.
  - All 3×K BenchmarkResults per cell are saved to JLD2 — the printed
    table shows medians but the raw distribution is preserved for
    downstream analysis.

Adds Statistics to benchmark/Project.toml for `median`.

Runtime cost: ~3x the previous scaling-sweep section, which puts the
whole benchmark workflow at ~15 min wall time (previously ~9 min,
still well under the workflow's default 6h timeout).

Docs page (`docs/src/benchmarks.md`) updated to describe the
median-of-K scheme and note that per-seed results are persisted to
the JLD2 artifact.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant