-
Notifications
You must be signed in to change notification settings - Fork 250
[scheduler] Add end-to-end stress tests to scheduler #2425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,150 @@ | ||
| name: OpenCue Scheduler Stress Pipeline | ||
|
|
||
| # Runs the scheduler booking + accounting stress suite | ||
| # (rust/crates/scheduler/tests/stress_tests.rs): a full pipeline::run against a | ||
| # seeded farm, with an end-of-run audit that cross-checks the Redis acct:* | ||
| # hashes against SUM(proc) in Postgres and asserts cap enforcement. | ||
| # | ||
| # When it runs — and when it deliberately doesn't: | ||
| # - Pull requests: only when the scheduler crate, its proto dependency, the | ||
| # DB migrations, or this workflow change. The suite needs a migrated | ||
| # Postgres plus a Redis container and takes several minutes — running it | ||
| # for Python/CueGUI/docs changes would burn runner time for zero signal. | ||
| # - Nightly on master: catches drift from changes that slipped past the | ||
| # paths filter (e.g. shared workspace dependencies) and gives a daily | ||
| # throughput data point under fixed scale. | ||
| # - Manually (workflow_dispatch): for benchmarking a branch at custom scale. | ||
| # | ||
| # What is a gate vs. what is informational: | ||
| # - The job FAILS on correctness regressions: accounting drift between Redis | ||
| # and Postgres, cap breaches (subscription burst / job max-cores), booking | ||
| # liveness (<90% drain, no saturation rejections), or leftover test data. | ||
| # - The throughput numbers (frames/s) are reported in the step summary but | ||
| # are NOT asserted on: shared runners are too noisy for perf gating. For | ||
| # real benchmarking run the suite locally in release mode (see | ||
| # docs/_docs/developer-guide/scheduler-stress-testing.md). | ||
|
|
||
| on: | ||
| pull_request: | ||
| branches: ["master"] | ||
| paths: | ||
| - "rust/crates/scheduler/**" | ||
| - "rust/crates/opencue-proto/**" | ||
| - "rust/Cargo.toml" | ||
| - "cuebot/src/main/resources/conf/ddl/postgres/migrations/**" | ||
| - ".github/workflows/scheduler-stress-pipeline.yml" | ||
| schedule: | ||
| # Nightly on master. Odd minute to avoid the top-of-hour scheduling rush. | ||
| - cron: "23 9 * * *" | ||
| workflow_dispatch: | ||
| inputs: | ||
| stress_jobs: | ||
| description: "Drain-phase job count (default 300)" | ||
| required: false | ||
| stress_hosts: | ||
| description: "Drain-phase host count (default 1200)" | ||
| required: false | ||
| stress_frames_per_layer: | ||
| description: "Drain-phase frames per layer (default 5)" | ||
| required: false | ||
| stress_timeout_secs: | ||
| description: "Per-phase hard timeout in seconds (default 600)" | ||
| required: false | ||
|
|
||
| concurrency: | ||
| group: ${{ github.workflow }}-${{ github.ref }} | ||
| cancel-in-progress: true | ||
|
|
||
| env: | ||
| CARGO_TERM_COLOR: always | ||
| CARGO_INCREMENTAL: 0 | ||
| CARGO_NET_RETRY: 10 | ||
| RUSTUP_MAX_RETRIES: 10 | ||
|
|
||
| jobs: | ||
| stress: | ||
| runs-on: ubuntu-22.04 | ||
| timeout-minutes: 45 | ||
|
|
||
| # The suite's Postgres half. Redis is NOT listed here on purpose: the test | ||
| # starts its own throwaway Redis via testcontainers, so all accounting | ||
| # state is guaranteed to die with the test process. | ||
| services: | ||
| postgres: | ||
| image: postgres:15.1 | ||
| env: | ||
| POSTGRES_USER: cuebot | ||
| POSTGRES_PASSWORD: cuebot_password | ||
| POSTGRES_DB: cuebot | ||
| ports: | ||
| - 5432:5432 | ||
| options: >- | ||
| --health-cmd "pg_isready -U cuebot -d cuebot" | ||
| --health-interval 5s | ||
| --health-timeout 5s | ||
| --health-retries 10 | ||
|
|
||
| steps: | ||
| - uses: actions/checkout@v4 | ||
|
|
||
| - name: Install build dependencies | ||
| run: | | ||
| sudo apt-get update && sudo apt-get install -y libx11-dev protobuf-compiler libcurl4-openssl-dev postgresql-client | ||
| curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs > rust-install.sh | ||
| bash ./rust-install.sh -y | ||
|
|
||
| - name: Cache cargo deps | ||
| uses: Swatinem/rust-cache@v2 | ||
| with: | ||
| workspaces: rust | ||
| cache-on-failure: true | ||
|
|
||
| # Plain psql instead of the sandbox Flyway image: the migrations are | ||
| # versioned plain-SQL files, so applying them in numeric order is exactly | ||
| # what Flyway would do, without building a JDK image first. | ||
| - name: Apply database migrations | ||
| working-directory: cuebot/src/main/resources/conf/ddl/postgres/migrations | ||
| env: | ||
| PGPASSWORD: cuebot_password | ||
| run: | | ||
| for f in $(ls V*.sql | sort -t V -k2 -n); do | ||
| echo "== $f" | ||
| psql -q -v ON_ERROR_STOP=1 -h localhost -U cuebot -d cuebot -f "$f" | ||
| done | ||
|
|
||
| - name: Run stress suite | ||
| working-directory: rust | ||
| shell: bash | ||
| env: | ||
| STRESS_JOBS: ${{ inputs.stress_jobs }} | ||
| STRESS_HOSTS: ${{ inputs.stress_hosts }} | ||
| STRESS_FRAMES_PER_LAYER: ${{ inputs.stress_frames_per_layer }} | ||
| STRESS_TIMEOUT_SECS: ${{ inputs.stress_timeout_secs }} | ||
| run: | | ||
| cargo test -p scheduler --features stress-tests --test stress_tests -- --nocapture 2>&1 | tee stress-output.log | ||
|
|
||
| - name: Publish phase report | ||
| if: always() | ||
| working-directory: rust | ||
| shell: bash | ||
| run: | | ||
| if [ -f stress-output.log ] && grep -q "^================ phase" stress-output.log; then | ||
| { | ||
| echo "## Scheduler stress suite" | ||
| echo '```' | ||
| sed -n '/^================ phase/,$p' stress-output.log | sed -n '1,80p' | ||
| echo '```' | ||
| echo "_Throughput numbers are informational; only the accounting/enforcement assertions gate this job._" | ||
| } >> "$GITHUB_STEP_SUMMARY" | ||
| else | ||
| echo "## Scheduler stress suite: no phase report produced (failed before the run?)" >> "$GITHUB_STEP_SUMMARY" | ||
| fi | ||
|
|
||
| - name: Upload full output | ||
| if: always() | ||
| uses: actions/upload-artifact@v4 | ||
| with: | ||
| name: scheduler-stress-output | ||
| path: rust/stress-output.log | ||
| if-no-files-found: ignore | ||
| retention-days: 30 | ||
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,204 @@ | ||||||||||||||||||
| --- | ||||||||||||||||||
| title: "Scheduler Stress Testing" | ||||||||||||||||||
| nav_order: 102 | ||||||||||||||||||
| parent: Reference | ||||||||||||||||||
| layout: default | ||||||||||||||||||
| linkTitle: "Scheduler Stress Testing" | ||||||||||||||||||
| date: 2026-06-12 | ||||||||||||||||||
| description: > | ||||||||||||||||||
| How to run, tune, and interpret the Rust scheduler's booking and accounting | ||||||||||||||||||
| stress suite, locally and in CI | ||||||||||||||||||
| --- | ||||||||||||||||||
|
|
||||||||||||||||||
| # Scheduler Stress Testing | ||||||||||||||||||
|
|
||||||||||||||||||
| ### Running and interpreting the booking + accounting stress suite | ||||||||||||||||||
|
Comment on lines
+13
to
+15
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Keep the heading hierarchy to one level at a time.
♻️ Proposed fix-### Running and interpreting the booking + accounting stress suite
+## Running and interpreting the booking + accounting stress suite📝 Committable suggestion
Suggested change
🧰 Tools🪛 markdownlint-cli2 (0.22.1)[warning] 15-15: Heading levels should only increment by one level at a time (MD001, heading-increment) 🤖 Prompt for AI AgentsSource: Linters/SAST tools |
||||||||||||||||||
|
|
||||||||||||||||||
| --- | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Overview | ||||||||||||||||||
|
|
||||||||||||||||||
| The stress suite (`rust/crates/scheduler/tests/stress_tests.rs`) exercises the | ||||||||||||||||||
| [Rust scheduler](/docs/developer-guide/scheduler/)'s full production dispatch | ||||||||||||||||||
| path at scale — `pipeline::run` end to end: Redis accounting bootstrap → | ||||||||||||||||||
| cluster feed → pending-job query → host matching → dispatch (proc insert, host | ||||||||||||||||||
| ledger decrement, frame start) — against a deterministic, bulk-seeded farm. | ||||||||||||||||||
|
|
||||||||||||||||||
| It is both a **correctness gate** and a **benchmark harness**: | ||||||||||||||||||
|
|
||||||||||||||||||
| - **Correctness**: after each phase an audit cross-checks every Redis `acct:*` | ||||||||||||||||||
| hash the run touched against `SUM(proc)` in Postgres (the canonical record — | ||||||||||||||||||
| see the [Redis-Backed Accounting Reference](/docs/developer-guide/redis-accounting/)), | ||||||||||||||||||
| and verifies cap enforcement and ledger invariants. | ||||||||||||||||||
| - **Benchmark**: it reports booking throughput (frames/s over the active | ||||||||||||||||||
| booking window), host-matching efficiency (wasted attempt %), host-cache hit | ||||||||||||||||||
| ratio, and Redis Lua op counts. | ||||||||||||||||||
|
|
||||||||||||||||||
| The suite runs two phases in one process: | ||||||||||||||||||
|
|
||||||||||||||||||
| | Phase | Shape | What it proves | | ||||||||||||||||||
| |---|---|---| | ||||||||||||||||||
| | **drain** | Farm capacity comfortably exceeds demand (default: 1,200 hosts, 6,000 frames) | ≥90% of frames book; throughput measured; accounting stays exact under concurrency, including the force-rollback compensation path | | ||||||||||||||||||
| | **saturation** | Demand vastly exceeds tight subscription bursts and per-job core caps (default: 400 hosts, 3,000 frames, 150-core bursts) | The Redis Lua cap check is the binding constraint: bookings stop exactly at burst, caps are never breached, rejections flow through the hot path | | ||||||||||||||||||
|
|
||||||||||||||||||
| ### Invariants the audit asserts | ||||||||||||||||||
|
|
||||||||||||||||||
| 1. Every `acct:{sub,folder,job,layer,point}` hash holds exactly | ||||||||||||||||||
| `SUM(proc.int_cores_reserved)/100` cores and `SUM(proc.int_gpus_reserved)` | ||||||||||||||||||
| GPUs for its grouping — the same 5-dimension grouping and centicore→core | ||||||||||||||||||
| conversion the recompute loop uses. The suite pushes the recompute and | ||||||||||||||||||
| limit-reseed loops out to a 1-hour interval, so agreement here proves the | ||||||||||||||||||
| *dispatch hot path alone* (Lua book + force-rollback) kept Redis exact — | ||||||||||||||||||
| reconciliation never got a chance to paper over drift. | ||||||||||||||||||
| 2. Jobs with no bookings have no leaked Redis counters. | ||||||||||||||||||
| 3. Per-(show, alloc) booked cores never exceed the subscription burst. | ||||||||||||||||||
| 4. Per-job booked cores never exceed `job_resource.int_max_cores`. | ||||||||||||||||||
| 5. Host ledger: `int_cores - int_cores_idle == SUM(proc)` per host, never negative. | ||||||||||||||||||
| 6. One `RUNNING` frame per proc row. | ||||||||||||||||||
| 7. Trigger-maintained `job_stat.int_waiting_count` matches the frame table. | ||||||||||||||||||
| 8. After teardown, zero `stress_%` rows remain in any table the suite touches. | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Running locally | ||||||||||||||||||
|
|
||||||||||||||||||
| ### Prerequisites | ||||||||||||||||||
|
|
||||||||||||||||||
| - A migrated Postgres on `localhost:5432` (`cuebot` / `cuebot_password`). | ||||||||||||||||||
| From the repo root: `docker compose up -d flyway` (brings up `db` and applies | ||||||||||||||||||
| migrations). If the Flyway image won't build in your environment (e.g. | ||||||||||||||||||
| SSL-inspecting proxies break its package mirrors), apply the migrations | ||||||||||||||||||
| directly — they are plain SQL: | ||||||||||||||||||
|
|
||||||||||||||||||
| ```bash | ||||||||||||||||||
| cd cuebot/src/main/resources/conf/ddl/postgres/migrations | ||||||||||||||||||
| for f in $(ls V*.sql | sort -t V -k2 -n); do | ||||||||||||||||||
| docker exec -i opencue-db-1 psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f" | ||||||||||||||||||
| done | ||||||||||||||||||
|
Comment on lines
+71
to
+75
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Don't hard-code the Compose container name.
♻️ Proposed fix- docker exec -i opencue-db-1 psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f"
+ docker compose exec -T db psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f"📝 Committable suggestion
Suggested change
🤖 Prompt for AI Agents |
||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| - A running Docker daemon. The suite starts its own throwaway Redis container | ||||||||||||||||||
| via testcontainers; all accounting state dies with it. | ||||||||||||||||||
|
|
||||||||||||||||||
| ### Run | ||||||||||||||||||
|
|
||||||||||||||||||
| ```bash | ||||||||||||||||||
| cd rust | ||||||||||||||||||
| cargo test -p scheduler --features stress-tests --test stress_tests -- --nocapture | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| For meaningful benchmark numbers, use a release build: | ||||||||||||||||||
|
|
||||||||||||||||||
| ```bash | ||||||||||||||||||
| cargo test -p scheduler --release --features stress-tests --test stress_tests -- --nocapture | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
|
||||||||||||||||||
| ### Tuning | ||||||||||||||||||
|
|
||||||||||||||||||
| | Env var | Default | Meaning | | ||||||||||||||||||
| |---|---|---| | ||||||||||||||||||
| | `STRESS_JOBS` | 300 | drain-phase job count | | ||||||||||||||||||
| | `STRESS_LAYERS` | 4 | drain-phase layers per job | | ||||||||||||||||||
| | `STRESS_FRAMES_PER_LAYER` | 5 | drain-phase frames per layer | | ||||||||||||||||||
| | `STRESS_HOSTS` | 1200 | drain-phase host count | | ||||||||||||||||||
| | `STRESS_TAGS` | 8 | drain-phase manual tag count | | ||||||||||||||||||
| | `STRESS_SAT_JOBS` | 150 | saturation-phase job count | | ||||||||||||||||||
| | `STRESS_SAT_HOSTS` | 400 | saturation-phase host count | | ||||||||||||||||||
| | `STRESS_DRAIN_TARGET` | 0.9 | fraction of drain frames that must book | | ||||||||||||||||||
| | `STRESS_STALL_SECS` | 30 | watchdog: pause jobs after this long without a new booking | | ||||||||||||||||||
| | `STRESS_TIMEOUT_SECS` | 600 | watchdog: per-phase hard timeout | | ||||||||||||||||||
|
|
||||||||||||||||||
| Seeding is deterministic for a given scale (fixed RNG seed), so consecutive | ||||||||||||||||||
| runs at the same scale book the same workload — diffs in throughput between | ||||||||||||||||||
| runs reflect the code, not the data. | ||||||||||||||||||
|
|
||||||||||||||||||
| ### Reading the report | ||||||||||||||||||
|
|
||||||||||||||||||
| ``` | ||||||||||||||||||
| ================ phase: drain ================ | ||||||||||||||||||
| frames : 6000 seeded, 5988 dispatched (99.8%), waiting 6000 -> 12 | ||||||||||||||||||
| throughput : 975.1 frames/s over a 6.1s booking window (wall 43.3s) | ||||||||||||||||||
| matching : 3175 host attempts (41.9% wasted), 39 cluster rounds, host-cache hit 98% | ||||||||||||||||||
| accounting : 7452 redis lua ops, 5988 dispatches (metrics), 24040 booked cores, rejections [...] | ||||||||||||||||||
| audit : OK | ||||||||||||||||||
| ``` | ||||||||||||||||||
|
Comment on lines
+115
to
+122
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Label the sample-output fence. This fenced block is plain output, so give it a language tag ( ♻️ Proposed fix-```
+```text📝 Committable suggestion
Suggested change
🧰 Tools🪛 markdownlint-cli2 (0.22.1)[warning] 115-115: Fenced code blocks should have a language specified (MD040, fenced-code-language) 🤖 Prompt for AI AgentsSource: Linters/SAST tools |
||||||||||||||||||
|
|
||||||||||||||||||
| - **throughput** is measured from the first to the last `proc.ts_booked`, so it | ||||||||||||||||||
| excludes the post-drain shutdown tail of the feed (the `wall` figure includes it). | ||||||||||||||||||
| - **redis lua ops** above the dispatch count means the compensation path ran: | ||||||||||||||||||
| each failed dispatch costs one book plus one force-rollback. The audit | ||||||||||||||||||
| passing alongside a surplus is a *positive* signal — rollbacks netted out. | ||||||||||||||||||
| - In the saturation phase, expect large `subscription=` rejection counts and | ||||||||||||||||||
| every subscription pinned at exactly `burst/burst` cores. | ||||||||||||||||||
|
|
||||||||||||||||||
| ### Cleanup guarantees | ||||||||||||||||||
|
|
||||||||||||||||||
| All database rows the suite creates are prefixed `stress_`. The suite sweeps | ||||||||||||||||||
| that prefix **before** seeding (so leftovers from a crashed earlier run never | ||||||||||||||||||
| skew results) and **after** the run, then asserts zero residue. Redis state | ||||||||||||||||||
| needs no cleanup — the container is destroyed with the test. If a run is | ||||||||||||||||||
| killed hard (e.g. SIGKILL mid-phase), the next run's pre-sweep removes the | ||||||||||||||||||
| leftovers. | ||||||||||||||||||
|
|
||||||||||||||||||
| ## CI integration | ||||||||||||||||||
|
|
||||||||||||||||||
| The suite runs in the | ||||||||||||||||||
| [`scheduler-stress-pipeline.yml`](https://github.com/AcademySoftwareFoundation/OpenCue/blob/master/.github/workflows/scheduler-stress-pipeline.yml) | ||||||||||||||||||
| workflow. | ||||||||||||||||||
|
|
||||||||||||||||||
| ### When it runs | ||||||||||||||||||
|
|
||||||||||||||||||
| | Trigger | Scale | Purpose | | ||||||||||||||||||
| |---|---|---| | ||||||||||||||||||
| | Pull request touching `rust/crates/scheduler/**`, `rust/crates/opencue-proto/**`, `rust/Cargo.toml`, the Postgres migrations, or the workflow itself | defaults | Gate scheduler changes on booking/accounting correctness | | ||||||||||||||||||
| | Nightly (cron, master) | defaults | Catch drift from changes outside the paths filter; daily throughput data point | | ||||||||||||||||||
| | Manual (`workflow_dispatch`) | custom via inputs | Benchmark a branch at chosen scale | | ||||||||||||||||||
|
|
||||||||||||||||||
| ### When it deliberately does not run | ||||||||||||||||||
|
|
||||||||||||||||||
| - **PRs that don't touch the scheduler or schema** (Python, CueGUI, CueWeb, | ||||||||||||||||||
| docs, …). The suite needs a migrated Postgres, a Docker daemon, and several | ||||||||||||||||||
| minutes of runner time; for those changes it produces zero signal. | ||||||||||||||||||
| - **As a performance gate.** Shared CI runners have noisy CPU/IO, so the | ||||||||||||||||||
| workflow never asserts on frames/s — throughput is published in the job's | ||||||||||||||||||
| step summary (and the full log as an artifact) for humans to eyeball trends. | ||||||||||||||||||
| Benchmark conclusions should come from local release-mode runs on quiet | ||||||||||||||||||
| hardware. | ||||||||||||||||||
|
|
||||||||||||||||||
| ### What fails the job | ||||||||||||||||||
|
|
||||||||||||||||||
| Only correctness regressions: accounting drift between Redis and Postgres, a | ||||||||||||||||||
| cap breach, booking liveness failures (drain below target, or a saturated farm | ||||||||||||||||||
| producing no Redis rejections), a phase that never converges (hard-timeout), | ||||||||||||||||||
| or test data left behind after cleanup. | ||||||||||||||||||
|
|
||||||||||||||||||
| ### Launching a manual benchmark run | ||||||||||||||||||
|
|
||||||||||||||||||
| GitHub → Actions → *OpenCue Scheduler Stress Pipeline* → *Run workflow*, then | ||||||||||||||||||
| optionally override the job/host/frame counts and timeout. Results appear in | ||||||||||||||||||
| the run's step summary; the complete log is attached as the | ||||||||||||||||||
| `scheduler-stress-output` artifact (kept 30 days). | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Scope and limitations | ||||||||||||||||||
|
|
||||||||||||||||||
| - **RQD is not exercised.** The suite runs in `dry_run_mode`: the full booking | ||||||||||||||||||
| path executes (Redis Lua, proc insert, host ledger, frame start) but no gRPC | ||||||||||||||||||
| launch is sent. Frame *completion* and the Cuebot release path are out of | ||||||||||||||||||
| scope — see the [Redis-Backed Accounting Reference](/docs/developer-guide/redis-accounting/) | ||||||||||||||||||
| for how releases are reconciled. | ||||||||||||||||||
| - **Only scheduler-managed shows** (`show.b_scheduler_managed = true`) are | ||||||||||||||||||
| covered; Cuebot-managed accounting is Cuebot's test territory. | ||||||||||||||||||
| - The recompute / limit-reseed loops are intentionally dormant during the run | ||||||||||||||||||
| (see invariant 1); their CAS semantics are covered separately by | ||||||||||||||||||
| `tests/redis_integration.rs` (`--features redis-tests`). | ||||||||||||||||||
|
|
||||||||||||||||||
| ## Schema gotchas the suite encodes | ||||||||||||||||||
|
|
||||||||||||||||||
| These bit during development and are asserted/documented in the test code — | ||||||||||||||||||
| keep them in mind when extending the seeding: | ||||||||||||||||||
|
|
||||||||||||||||||
| - `alloc.str_tag` is `VARCHAR(24)` and `host.str_name` is `VARCHAR(30)`: | ||||||||||||||||||
| generated names must stay short. | ||||||||||||||||||
| - The pending-job query `INNER JOIN`s `folder_resource`: a folder without that | ||||||||||||||||||
| row makes every job in it silently unbookable. | ||||||||||||||||||
| - The `vs_waiting` view requires `job_resource.int_max_cores - int_cores >= 100` | ||||||||||||||||||
| (centicores): `int_max_cores = 0` does **not** mean "unlimited" on the query | ||||||||||||||||||
| path — use a large value instead. | ||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: AcademySoftwareFoundation/OpenCue
Length of output: 712
🏁 Script executed:
Repository: AcademySoftwareFoundation/OpenCue
Length of output: 543
🌐 Web query:
GitHub Actions security best practices pinning commit SHA vs version tag💡 Result:
In GitHub Actions, pinning to a full-length commit SHA is considered the primary security best practice, whereas using version tags introduces significant security risks due to their mutability [1][2][3]. Core Differences and Security Implications Commit SHA (Recommended): - Immutability: A commit SHA is a unique, immutable identifier for a specific version of code [4][2][5]. Once pinned, the action's code cannot be changed or replaced, ensuring that your workflow consistently runs the exact same code [2][5][6]. - Security: It protects against supply chain attacks where a malicious actor gains control of an action's repository and updates an existing tag (e.g., v1) to point to malicious code [7][6][8]. - Enforcement: GitHub now supports organization and repository-level policies that can mandate the use of full-length commit SHAs, causing any workflows using non-pinned references to fail [9][5][6]. Version Tag (Less Secure): - Mutability: Tags (e.g., v1.0.0, v1) are mutable references [10][5][8]. A maintainer or an attacker who gains access to the repository can delete or move a tag to point to a different commit [1][2][5]. - Convenience vs. Risk: While tagging is more convenient for tracking updates, it relies entirely on trusting the action's creator and the security of their infrastructure [1][2][3]. Best Practices for Implementation - Use Automated Dependency Management: Because pinning to a SHA can create maintenance overhead, use tools like Dependabot or Renovate to automate updates [7][6][8]. For example, Renovate offers presets (e.g., helpers:pinGitHubActionDigests) that keep commit SHAs up to date while maintaining a human-readable version string in comments [7][6]. - Include Metadata: When pinning to a SHA, include the version tag as a comment in your YAML file to maintain readability and make manual auditing easier [6]. - Implement Security Policies: Administrators should enable GitHub's organization-level policies to enforce full-length commit SHA pinning across all repositories to prevent the accidental use of mutable tags [9][5][6]. - Use Auditing Tools: Incorporate security tools such as zizmor, Scorecard, or other action-pinning checkers into your CI pipeline to automatically verify that dependencies are pinned correctly [6][8][11].
Citations:
Pin all GitHub Actions to commit SHAs and disable persisted credentials for checkout.
This workflow uses floating version tags (
@v4,@v2) for multiple actions (lines 88, 97, 145), creating supply chain integrity risks. Actions must be pinned to full commit SHAs to prevent malicious code injection if a tag is reassigned. Additionally,actions/checkoutshould disable credential persistence to limit token exposure.Suggested patch for line 88:
Patch
Apply the same pinning pattern to lines 97 and 145, or use a tool like Dependabot or Renovate to automate SHA updates.
🧰 Tools
🪛 zizmor (1.25.2)
[warning] 88-88: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false
(artipacked)
[error] 88-88: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Source: Linters/SAST tools