Move benchmarks to daily cron (#1302)

ludfjig · web-flow · commit 98a9030af451 · 2026-03-11T16:29:04.000-07:00
* refactor: consolidate Benchmarks.yml into dep_benchmarks.yml

Delete Benchmarks.yml and add its features (artifact upload, baseline_tag,
baseline_run_id, retention_days inputs) to dep_benchmarks.yml. Update
CreateRelease.yml to call dep_benchmarks.yml with a matrix directly.

Signed-off-by: Ludvig Liljenberg &lt;4257730+ludfjig@users.noreply.github.com&gt;

* feat: move benchmarks from per-PR to daily cron

Remove the benchmarks job from ValidatePullRequest.yml and add a new
DailyBenchmarks.yml workflow that runs benchmarks daily, comparing against
the previous day's run artifacts with 90-day retention.

Signed-off-by: Ludvig Liljenberg &lt;4257730+ludfjig@users.noreply.github.com&gt;

* docs: update benchmarking docs to reflect daily cron workflow

Replace references to per-PR benchmarks and Benchmarks.yml with the new
DailyBenchmarks.yml and dep_benchmarks.yml workflows.

Signed-off-by: Ludvig Liljenberg &lt;4257730+ludfjig@users.noreply.github.com&gt;

* Add permission and fix docs

Signed-off-by: Ludvig Liljenberg &lt;4257730+ludfjig@users.noreply.github.com&gt;

* Update issue title

Signed-off-by: Ludvig Liljenberg &lt;4257730+ludfjig@users.noreply.github.com&gt;

---------

Signed-off-by: Ludvig Liljenberg &lt;4257730+ludfjig@users.noreply.github.com&gt;
diff --git a/.github/workflows/Benchmarks.yml b/.github/workflows/Benchmarks.yml
diff --git a/.github/workflows/CreateRelease.yml b/.github/workflows/CreateRelease.yml
@@ -72,8 +72,16 @@ jobs:
 
   benchmarks:
     needs: [build-guests]
-    uses: ./.github/workflows/Benchmarks.yml
+    strategy:
+      fail-fast: true
+      matrix:
+        hypervisor: [hyperv, 'hyperv-ws2025', mshv3, kvm]
+        cpu: [amd, intel]
+    uses: ./.github/workflows/dep_benchmarks.yml
     secrets: inherit
+    with:
+      hypervisor: ${{ matrix.hypervisor }}
+      cpu: ${{ matrix.cpu }}
     permissions:
       contents: read
 
diff --git a/.github/workflows/DailyBenchmarks.yml b/.github/workflows/DailyBenchmarks.yml
@@ -0,0 +1,71 @@
+# yaml-language-server: $schema=https://json.schemastore.org/github-workflow.json
+
+name: Daily Benchmarks
+
+on:
+  schedule:
+    - cron: '0 0 * * *' # Runs at 00:00 UTC every day
+  workflow_dispatch: # Allow manual triggering
+
+permissions:
+  contents: read
+  actions: read
+
+jobs:
+  # Find the most recent successful run of this workflow so we can download
+  # its benchmark artifacts as a baseline for day-over-day comparison.
+  find-baseline:
+    runs-on: ubuntu-latest
+    outputs:
+      run-id: ${{ steps.find-run.outputs.run_id }}
+    steps:
+      - name: Find latest successful run
+        id: find-run
+        # gh run list returns runs sorted by creation date descending (implicit).
+        # On the first-ever run, this outputs empty and dep_benchmarks.yml
+        # will skip the baseline download (continue-on-error).
+        run: |
+          run_id=$(gh run list --repo "${{ github.repository }}" --workflow DailyBenchmarks.yml --status success --limit 1 --json databaseId --jq '.[0].databaseId // empty')
+          echo "run_id=$run_id" >> "$GITHUB_OUTPUT"
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
+
+  # Build release guest binaries needed by the benchmark suite.
+  build-guests:
+    uses: ./.github/workflows/dep_build_guests.yml
+    secrets: inherit
+    with:
+      config: release
+
+  # Run benchmarks across all hypervisor/cpu combos, comparing against
+  # the previous day's results. Artifacts are retained for 90 days.
+  benchmarks:
+    needs: [build-guests, find-baseline]
+    strategy:
+      fail-fast: true
+      matrix:
+        hypervisor: [hyperv, 'hyperv-ws2025', mshv3, kvm]
+        cpu: [amd, intel]
+    uses: ./.github/workflows/dep_benchmarks.yml
+    secrets: inherit
+    with:
+      hypervisor: ${{ matrix.hypervisor }}
+      cpu: ${{ matrix.cpu }}
+      baseline_run_id: ${{ needs.find-baseline.outputs.run-id }}
+      retention_days: 90
+
+  # File a GitHub issue if any job fails.
+  notify-failure:
+    runs-on: ubuntu-latest
+    needs: [build-guests, benchmarks]
+    if: always() && (needs.build-guests.result == 'failure' || needs.benchmarks.result == 'failure')
+    permissions:
+      issues: write
+    steps:
+      - name: Checkout code
+        uses: actions/checkout@v6
+
+      - name: Notify Benchmark Failure
+        run: ./dev/notify-ci-failure.sh --title="Benchmark Failure - ${{ github.run_number }}" --labels="area/benchmarks,area/testing,lifecycle/needs-review,release-blocker"
+        env:
+          GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
diff --git a/.github/workflows/ValidatePullRequest.yml b/.github/workflows/ValidatePullRequest.yml
@@ -125,27 +125,6 @@ jobs:
       cpu: ${{ matrix.cpu }}
       config: ${{ matrix.config }}
 
-  # Run benchmarks - release only, needs guest artifacts, runs in parallel with build-test
-  benchmarks:
-    needs:
-      - docs-pr
-      - build-guests
-    # Required because update-guest-locks is skipped on non-dependabot PRs,
-    # and a skipped dependency transitively skips all downstream jobs.
-    # See: https://github.com/actions/runner/issues/2205
-    if: ${{ !cancelled() && !failure() }}
-    strategy:
-      fail-fast: true
-      matrix:
-        hypervisor: [hyperv, 'hyperv-ws2025', mshv3, kvm]
-        cpu: [amd, intel]
-    uses: ./.github/workflows/dep_benchmarks.yml
-    secrets: inherit
-    with: 
-      docs_only: ${{ needs.docs-pr.outputs.docs-only }}
-      hypervisor: ${{ matrix.hypervisor }}
-      cpu: ${{ matrix.cpu }}
-
   fuzzing:
     needs:
       - docs-pr
@@ -187,7 +166,6 @@ jobs:
       - code-checks
       - build-test
       - run-examples
-      - benchmarks
       - fuzzing
       - spelling
       - license-headers
diff --git a/.github/workflows/dep_benchmarks.yml b/.github/workflows/dep_benchmarks.yml
@@ -1,5 +1,28 @@
 # yaml-language-server: $schema=https://json.schemastore.org/github-workflow.json
 
+# Reusable workflow to run benchmarks on a single hypervisor/cpu combination.
+#
+# Baseline comparison:
+#   The workflow supports two mutually exclusive ways to load a baseline for
+#   Criterion to compare against:
+#
+#   1. baseline_run_id — Downloads benchmark artifacts from a previous workflow
+#      run (by run ID). Used by DailyBenchmarks.yml for day-over-day comparison.
+#
+#   2. baseline_tag — Downloads benchmark tarballs from a GitHub Release (by tag).
+#      If empty (the default), `gh release download` fetches from the latest
+#      stable release. Used by CreateRelease.yml.
+#
+#   If baseline_run_id is set, baseline_tag is ignored.
+#   If neither is set, the latest stable release is used.
+#   Both downloads use continue-on-error so the first-ever run (no baseline
+#   available) succeeds without comparison.
+#
+# Artifact upload:
+#   Benchmark results are always uploaded as workflow artifacts named
+#   benchmarks_<OS>_<hypervisor>_<cpu>. The retention_days input controls
+#   how long they are kept (default: 5 days).
+
 name: Run Benchmarks
 
 on:
@@ -18,6 +41,21 @@ on:
         description: CPU architecture for the build (passed from caller matrix)
         required: true
         type: string
+      baseline_tag:
+        description: Release tag to download baseline benchmarks from (e.g. dev-latest). Ignored if baseline_run_id is set. If empty, downloads from the latest stable release.
+        required: false
+        type: string
+        default: ""
+      baseline_run_id:
+        description: Workflow run ID to download baseline benchmark artifacts from. Takes precedence over baseline_tag.
+        required: false
+        type: string
+        default: ""
+      retention_days:
+        description: Number of days to retain benchmark artifacts
+        required: false
+        type: number
+        default: 5
 
 env:
   CARGO_TERM_COLOR: always
@@ -74,11 +112,29 @@ jobs:
       - name: Build
         run: just build release
 
-      - name: Download benchmarks from "latest"
-        run: just bench-download ${{ runner.os }} ${{ inputs.hypervisor }} ${{ inputs.cpu }} dev-latest # compare to prerelease
+      - name: Download baseline from previous run
+        if: ${{ inputs.baseline_run_id != '' }}
+        uses: actions/download-artifact@v8
+        with:
+          name: benchmarks_${{ runner.os }}_${{ inputs.hypervisor }}_${{ inputs.cpu }}
+          path: ./target/criterion/
+          run-id: ${{ inputs.baseline_run_id }}
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+        continue-on-error: true
+
+      - name: Download baseline from release
+        if: ${{ inputs.baseline_run_id == '' }}
+        run: just bench-download ${{ runner.os }} ${{ inputs.hypervisor }} ${{ inputs.cpu }} ${{ inputs.baseline_tag }}
         env:
           GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
         continue-on-error: true
 
       - name: Run benchmarks
         run: just bench-ci main
+
+      - uses: actions/upload-artifact@v7
+        with:
+          name: benchmarks_${{ runner.os }}_${{ inputs.hypervisor }}_${{ inputs.cpu }}
+          path: ./target/criterion/
+          if-no-files-found: error
+          retention-days: ${{ inputs.retention_days }}
diff --git a/docs/benchmarking-hyperlight.md b/docs/benchmarking-hyperlight.md
@@ -2,10 +2,10 @@
 
 Hyperlight uses the [Criterion](https://bheisler.github.io/criterion.rs/book/index.html) framework to run and analyze benchmarks. A benefit to this framework is that it doesn't require the nightly toolchain.
 
-## When Benchmarks are ran
+## When Benchmarks are run
 
-1. Every time a branch gets a push
-    - Compares the current branch benchmarking results to the "dev-latest" release (which is the most recent push to "main" branch). This is done as part of `dep_rust.yml`, which is invoked by `ValidatePullRequest.yml`. These benchmarks are for the developer to compare their branch to main, and the results can only be seen in the GitHub action logs, and nothing is saved. 
+1. Daily (scheduled)
+    - Benchmarks run daily via `DailyBenchmarks.yml`, comparing results against the previous day's run. Results are stored as workflow artifacts with 90-day retention.
 
     ```
     sandboxes/create_sandbox
@@ -15,9 +15,9 @@ Hyperlight uses the [Criterion](https://bheisler.github.io/criterion.rs/book/ind
     ```
    
 2. For each release
-    - For each release, benchmarks are ran as part of the release pipeline in `CreateRelease.yml`, which invokes `Benchmarks.yml`. These benchmark results are compared to the previous release, and are uploaded as port of the "Release assets" on the GitHub release page.
+    - For each release, benchmarks are run as part of the release pipeline in `CreateRelease.yml`, which invokes `dep_benchmarks.yml`. These benchmark results are compared to the previous release, and are uploaded as part of the "Release assets" on the GitHub release page.
 
-Currently, benchmarks are ran on windows, linux-kvm (ubuntu), and linux-hyperv (mariner). Only release builds are benchmarked, not debug.
+Currently, benchmarks are run on windows, linux-kvm (ubuntu), and linux-hyperv (mariner). Only release builds are benchmarked, not debug.
 
 ## Criterion artifacts