Skip to content

[scheduler] Add end-to-end stress tests to scheduler #2425

Closed
DiegoTavares wants to merge 3 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:stress-test
Closed

[scheduler] Add end-to-end stress tests to scheduler #2425
DiegoTavares wants to merge 3 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:stress-test

Conversation

@DiegoTavares

@DiegoTavares DiegoTavares commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Booking + accounting stress suite for the Rust scheduler.
cluster feed → job query → host matching → dispatch) against a realistic farm in
two phases inside one process:

  1. drain — farm capacity comfortably exceeds demand. Benchmarks booking
    throughput (frames/s over the active booking window) and requires ≥90%
    (STRESS_DRAIN_TARGET) of frames to dispatch.
  2. saturation — demand vastly exceeds tight subscription bursts and per-job
    core caps, so the Redis Lua cap check becomes the binding constraint. Verifies
    enforcement (no booking above burst / job max-cores) and that rejections
    actually flowed through the accounting hot path.
    After each phase, an audit cross-checks every Redis acct:* hash against
    SUM(proc) in Postgres plus host/frame/stat invariants — with the recompute and
    limit-reseed loops pushed beyond the test horizon, agreement proves the dispatch
    hot path (Lua book + force-rollback) kept accounting exact on its own.

Requirements

  • Postgres with migrations applied, from the repo root: docker compose up -d flyway
  • A Docker daemon (the suite starts a throwaway Redis container via testcontainers;
    all Redis state dies with the container)

Summary by CodeRabbit

New Features

  • Added comprehensive scheduler stress-testing framework with GitHub Actions workflow and documentation for end-to-end booking and accounting validation.
  • Improved limit reseeding mechanism using unconditional writes for enhanced reliability.

Bug Fixes

  • Fixed frame version handling during dispatch compensation to prevent stale row updates.

Documentation

  • Added scheduler stress-testing developer guide with setup, tuning, and CI integration details.
  • Updated scheduler testing documentation with smoke and stress test instructions.

Tests

  • Added integration test suite for stress testing and frame compensation validation.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2da2f05e-f53a-462f-bd73-d32b4afb844b

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Booking + accounting stress suite for the Rust scheduler.
cluster feed → job query → host matching → dispatch) against a realistic farm in
two phases inside one process:
1. **drain** — farm capacity comfortably exceeds demand. Benchmarks booking
throughput (frames/s over the active booking window) and requires ≥90%
(`STRESS_DRAIN_TARGET`) of frames to dispatch.
2. **saturation** — demand vastly exceeds tight subscription bursts and per-job
core caps, so the Redis Lua cap check becomes the binding constraint. Verifies
enforcement (no booking above burst / job max-cores) and that rejections
actually flowed through the accounting hot path.
After each phase, an audit cross-checks every Redis `acct:*` hash against
`SUM(proc)` in Postgres plus host/frame/stat invariants — with the recompute and
limit-reseed loops pushed beyond the test horizon, agreement proves the dispatch
hot path (Lua book + force-rollback) kept accounting exact on its own.

Requirements
 - Postgres with migrations applied, from the repo root: `docker compose up -d flyway`
 - A Docker daemon (the suite starts a throwaway Redis container via testcontainers;
   all Redis state dies with the container)
When a show was handed to an already-running scheduler, frames booked far
slower than under Cuebot despite folder/subscription headroom, with a flood
of resource-limit-exceeded{table=subscription} rejections.

Root cause: the booking hot path enforces caps against Redis-cached limits
that are seeded asynchronously, while Cuebot reads them live from PG. The
booking Lua treats a missing subscription burst as 0 = "reject all" —
folder/job caps fail open via their `> 0` guard, so subscription is the only
cap that fails closed on an unseeded limit. A show managed after startup was
never covered by the bootstrap reseed, and the periodic limit reseed —
CAS-guarded on acct:seq — was starved by force-rollback churn from flaky RQD
launches, so burst stayed unseeded and every booking was rejected. Units,
operators and field names between Cuebot and the scheduler otherwise match.

- Decouple the limit reseed from the acct:seq CAS. Limit fields are disjoint
  from the booked counters the hot path mutates, so they can be written with
  unconditional HSETs (RedisAccounting::reseed_unconditional) and no longer
  starve; the booked-counter recompute keeps its CAS.
- Seed a show's limits synchronously in the managed-shows refresh, before
  publishing it into the cache, so "managed implies seeded" holds and
  enforcement never sees an unseeded burst.

Also fix a related compensation bug: a dispatch bumps frame.int_version
V->V+1, but clear_frame was called with the stale pre-dispatch V, so its
guard never matched and a failed RQD launch left the frame stuck in RUNNING
forever (worst on manual/hostname tags pinned to flaky hosts). These flaky
launches are also the churn that starved the reseed above. update_frame_started
now returns the new version and the dispatcher threads it into
virtual_proc.frame.version for compensation.

Add tests/frame_dao_compensation.rs covering the clear_frame version boundary.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🧹 Nitpick comments (1)
rust/crates/scheduler/tests/frame_dao_compensation.rs (1)

174-175: ⚡ Quick win

Use transaction-scoped replication-role override to avoid pooled connection leakage.

Line 174 and Line 192 use session-level SET session_replication_role on a pooled connection while suppressing errors. If the reset path ever fails, subsequent tests on that connection can run with triggers/FKs disabled. Prefer SET LOCAL ... = 'replica' so it auto-resets at transaction end.

Proposed change
-        let _ = sqlx::query("SET session_replication_role = 'replica'")
+        let _ = sqlx::query("SET LOCAL session_replication_role = 'replica'")
             .execute(&mut *tx)
             .await;
@@
-        let _ = sqlx::query("SET session_replication_role = 'origin'")
-            .execute(&mut *tx)
-            .await;

Also applies to: 192-193

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/scheduler/tests/frame_dao_compensation.rs` around lines 174 -
175, The session-level SET session_replication_role command at line 174 and line
192 uses a pooled connection and suppresses errors, which can cause subsequent
tests to run with triggers and foreign keys disabled if the reset fails. Replace
both instances of the sqlx::query call with SET session_replication_role =
'replica' to instead use SET LOCAL session_replication_role = 'replica', which
automatically resets at transaction end and prevents leakage of disabled
replication role state across pooled connection reuses in subsequent tests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/scheduler-stress-pipeline.yml:
- Line 88: Replace all floating version tags in the
scheduler-stress-pipeline.yml workflow with full commit SHAs to prevent supply
chain attacks. Update the actions/checkout action at line 88 from `@v4` to a
pinned commit SHA, and apply the same pinning pattern to the other floating-tag
actions at lines 97 and 145 (currently using `@v2`). Additionally, for the
actions/checkout action specifically, add the persist-credentials option set to
false to disable credential persistence and limit token exposure.

In `@docs/_docs/developer-guide/scheduler-stress-testing.md`:
- Around line 115-122: The fenced code block containing the sample scheduler
output does not have a language tag, which violates markdownlint rule MD040. Add
a language tag to the opening triple backticks of the code fence containing the
phase: drain output. Use either `text` or `console` as the language tag (e.g.,
change the opening ``` to ```text or ```console) to properly label this as plain
output rather than code.
- Around line 13-15: The markdown heading hierarchy is incorrect, jumping from
level 1 (the page title "Scheduler Stress Testing") directly to level 3 (the
"Running and interpreting the booking + accounting stress suite" heading). This
violates markdown structure rules and makes navigation harder. Change the
"Running and interpreting the booking + accounting stress suite" heading from
### to ## to maintain proper sequential heading hierarchy from level 1 to level
2.
- Around line 71-75: The bash script hard-codes the Docker Compose container
name opencue-db-1, which only works with the default Compose project name and
breaks in other local setups. Replace the hard-coded container name in the
docker exec command with the Compose service name instead. Consider using docker
compose exec command which can reference the service directly by name, or define
the service name as a variable that can be reused across the migration loop.

In `@rust/crates/scheduler/src/accounting/managed_shows.rs`:
- Around line 88-100: When limit_reseed::reseed_limits fails for newly added
managed shows, the current code returns early without tracking these shows,
causing apply_booking to treat them as unmanaged and skip Redis enforcement
(fail-open behavior). Instead of returning immediately, store the failed `added`
shows in a pending or blocked state that can be checked in apply_booking to
enforce fail-closed behavior (reject bookings or prevent dispatch) until the
limits are successfully seeded on a subsequent tick.

In `@rust/crates/scheduler/tests/stress_tests.rs`:
- Around line 163-176: Add a field to the PhaseResult struct to capture pipeline
errors. Then, at the locations where pipeline errors occur (around line 275 and
other affected areas mentioned), instead of panicking or early returning, store
the error in the newly added PhaseResult field. This allows the existing failure
aggregation and cleanup code paths to execute properly, ensuring the watchdog is
stopped and final cleanup runs before the test completes, preventing pollution
of the shared Postgres test database with leftover stress_ rows.

In `@rust/crates/scheduler/tests/util/mod.rs`:
- Around line 142-157: The SQL SUM functions in both the WaitingFrameClause::All
and WaitingFrameClause::JobPrefix branches can return NULL when no rows match
the query condition, causing the .expect() calls to panic instead of gracefully
handling empty results. Modify both queries to wrap the SUM expressions with
COALESCE(..., 0) to ensure NULL values are converted to zero, and change the
sqlx::query_scalar type parameter from i32 to i64 for consistent handling of the
aggregate result.

In `@rust/crates/scheduler/tests/util/stress.rs`:
- Around line 1145-1148: The frame query uses a pattern matching approach that
does not account for the seeded frame naming convention, where frames are named
like `0001-{prefix}_...`. The current LIKE pattern with just the prefix will not
detect leftover frame rows because of the numeric prefix. Update the frame query
to use a pattern that matches the full seeded frame name shape, such as using
`____-{escaped_prefix}%` to account for the four-digit numeric prefix, or
alternatively count frames through stable stored IDs instead of name pattern
matching.
- Around line 821-851: The current leak check only validates acct:job keys for
absent entries by verifying their counters are zero. You need to extend this
audit to also cover acct:sub, acct:folder, acct:layer, and acct:point keys.
Query Redis for all keys matching each of these prefixes (similar to how you
query all_jobs from the database), and for each key not present in the expected
map, verify that both int_cores and int_gpus counters are zero. Add violations
to out.violations when nonzero values are found for any of these missing
accounting dimensions, using the same format as the existing acct:job leak
check.
- Around line 746-748: The redis_hash_i64 function uses unwrap_or(None) to
silently convert Redis read errors into None values, which then become 0, hiding
connection issues and missing counters during audits. Instead of using
unwrap_or(None) on the result of redis.hget(key, field).await, properly
propagate Redis errors by using expect or a Result return type so that actual
errors are surfaced rather than masked as zero values, ensuring the audit can
detect unhealthy Redis connections.
- Line 544: The LIKE pattern at line 544 using format!("{prefix}%") does not
escape underscore characters, which are SQL wildcards that match any single
character. This causes unintended matches across different test prefixes (for
example, stress_a_ would match the pattern stress_%). You need to escape the
underscore characters in the prefix variable by replacing them with escaped
versions (using backslash escaping) before creating the LIKE pattern, and then
add ESCAPE '\\' to all the affected LIKE clauses at lines 544, 674, 773, 1024,
1142 and their dependent queries (job, layer, host, alloc, show, dept, facility,
host_tag, and proc subqueries). Alternatively, switch those LIKE patterns to
literal comparisons using multiple OR conditions like LIKE '{prefix}' OR LIKE
'{prefix}_%' to match only names that literally start with the intended prefix.

---

Nitpick comments:
In `@rust/crates/scheduler/tests/frame_dao_compensation.rs`:
- Around line 174-175: The session-level SET session_replication_role command at
line 174 and line 192 uses a pooled connection and suppresses errors, which can
cause subsequent tests to run with triggers and foreign keys disabled if the
reset fails. Replace both instances of the sqlx::query call with SET
session_replication_role = 'replica' to instead use SET LOCAL
session_replication_role = 'replica', which automatically resets at transaction
end and prevents leakage of disabled replication role state across pooled
connection reuses in subsequent tests.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1575a799-04bc-40db-92e3-dfbc9c5fe47c

📥 Commits

Reviewing files that changed from the base of the PR and between ca5f70f and 6593712.

📒 Files selected for processing (15)
  • .github/workflows/scheduler-stress-pipeline.yml
  • docs/_docs/developer-guide/scheduler-stress-testing.md
  • docs/_docs/developer-guide/scheduler.md
  • rust/crates/scheduler/Cargo.toml
  • rust/crates/scheduler/src/accounting/limit_reseed.rs
  • rust/crates/scheduler/src/accounting/managed_shows.rs
  • rust/crates/scheduler/src/accounting/mod.rs
  • rust/crates/scheduler/src/accounting/redis_client.rs
  • rust/crates/scheduler/src/dao/frame_dao.rs
  • rust/crates/scheduler/src/pipeline/dispatcher/actor.rs
  • rust/crates/scheduler/tests/frame_dao_compensation.rs
  • rust/crates/scheduler/tests/stress_tests.rs
  • rust/crates/scheduler/tests/util.rs
  • rust/crates/scheduler/tests/util/mod.rs
  • rust/crates/scheduler/tests/util/stress.rs
💤 Files with no reviewable changes (1)
  • rust/crates/scheduler/tests/util.rs

--health-retries 10

steps:
- uses: actions/checkout@v4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/scheduler-stress-pipeline.yml | sed -n '85,100p'

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 712


🏁 Script executed:

cat -n .github/workflows/scheduler-stress-pipeline.yml | sed -n '140,150p'

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 543


🌐 Web query:

GitHub Actions security best practices pinning commit SHA vs version tag

💡 Result:

In GitHub Actions, pinning to a full-length commit SHA is considered the primary security best practice, whereas using version tags introduces significant security risks due to their mutability [1][2][3]. Core Differences and Security Implications Commit SHA (Recommended): - Immutability: A commit SHA is a unique, immutable identifier for a specific version of code [4][2][5]. Once pinned, the action's code cannot be changed or replaced, ensuring that your workflow consistently runs the exact same code [2][5][6]. - Security: It protects against supply chain attacks where a malicious actor gains control of an action's repository and updates an existing tag (e.g., v1) to point to malicious code [7][6][8]. - Enforcement: GitHub now supports organization and repository-level policies that can mandate the use of full-length commit SHAs, causing any workflows using non-pinned references to fail [9][5][6]. Version Tag (Less Secure): - Mutability: Tags (e.g., v1.0.0, v1) are mutable references [10][5][8]. A maintainer or an attacker who gains access to the repository can delete or move a tag to point to a different commit [1][2][5]. - Convenience vs. Risk: While tagging is more convenient for tracking updates, it relies entirely on trusting the action's creator and the security of their infrastructure [1][2][3]. Best Practices for Implementation - Use Automated Dependency Management: Because pinning to a SHA can create maintenance overhead, use tools like Dependabot or Renovate to automate updates [7][6][8]. For example, Renovate offers presets (e.g., helpers:pinGitHubActionDigests) that keep commit SHAs up to date while maintaining a human-readable version string in comments [7][6]. - Include Metadata: When pinning to a SHA, include the version tag as a comment in your YAML file to maintain readability and make manual auditing easier [6]. - Implement Security Policies: Administrators should enable GitHub's organization-level policies to enforce full-length commit SHA pinning across all repositories to prevent the accidental use of mutable tags [9][5][6]. - Use Auditing Tools: Incorporate security tools such as zizmor, Scorecard, or other action-pinning checkers into your CI pipeline to automatically verify that dependencies are pinned correctly [6][8][11].

Citations:


Pin all GitHub Actions to commit SHAs and disable persisted credentials for checkout.

This workflow uses floating version tags (@v4, @v2) for multiple actions (lines 88, 97, 145), creating supply chain integrity risks. Actions must be pinned to full commit SHAs to prevent malicious code injection if a tag is reassigned. Additionally, actions/checkout should disable credential persistence to limit token exposure.

Suggested patch for line 88:

Patch
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@<FULL_COMMIT_SHA>  # v4
+        with:
+          persist-credentials: false

Apply the same pinning pattern to lines 97 and 145, or use a tool like Dependabot or Renovate to automate SHA updates.

🧰 Tools
🪛 zizmor (1.25.2)

[warning] 88-88: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 88-88: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/scheduler-stress-pipeline.yml at line 88, Replace all
floating version tags in the scheduler-stress-pipeline.yml workflow with full
commit SHAs to prevent supply chain attacks. Update the actions/checkout action
at line 88 from `@v4` to a pinned commit SHA, and apply the same pinning pattern
to the other floating-tag actions at lines 97 and 145 (currently using `@v2`).
Additionally, for the actions/checkout action specifically, add the
persist-credentials option set to false to disable credential persistence and
limit token exposure.

Source: Linters/SAST tools

Comment on lines +13 to +15
# Scheduler Stress Testing

### Running and interpreting the booking + accounting stress suite

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep the heading hierarchy to one level at a time.

### jumps from the page title (#) straight to level 3, which trips MD001 and makes the section structure harder to scan. Use ## here, or fold it into the next section if the extra heading is unnecessary.

♻️ Proposed fix
-### Running and interpreting the booking + accounting stress suite
+## Running and interpreting the booking + accounting stress suite
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Scheduler Stress Testing
### Running and interpreting the booking + accounting stress suite
# Scheduler Stress Testing
## Running and interpreting the booking + accounting stress suite
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 15-15: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/_docs/developer-guide/scheduler-stress-testing.md` around lines 13 - 15,
The markdown heading hierarchy is incorrect, jumping from level 1 (the page
title "Scheduler Stress Testing") directly to level 3 (the "Running and
interpreting the booking + accounting stress suite" heading). This violates
markdown structure rules and makes navigation harder. Change the "Running and
interpreting the booking + accounting stress suite" heading from ### to ## to
maintain proper sequential heading hierarchy from level 1 to level 2.

Source: Linters/SAST tools

Comment on lines +71 to +75
```bash
cd cuebot/src/main/resources/conf/ddl/postgres/migrations
for f in $(ls V*.sql | sort -t V -k2 -n); do
docker exec -i opencue-db-1 psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f"
done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don't hard-code the Compose container name.

docker exec -i opencue-db-1 ... only works with the default Compose project name, so this fallback will break in many local setups. Use the Compose service name instead.

♻️ Proposed fix
-  docker exec -i opencue-db-1 psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f"
+  docker compose exec -T db psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```bash
cd cuebot/src/main/resources/conf/ddl/postgres/migrations
for f in $(ls V*.sql | sort -t V -k2 -n); do
docker exec -i opencue-db-1 psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f"
done
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/_docs/developer-guide/scheduler-stress-testing.md` around lines 71 - 75,
The bash script hard-codes the Docker Compose container name opencue-db-1, which
only works with the default Compose project name and breaks in other local
setups. Replace the hard-coded container name in the docker exec command with
the Compose service name instead. Consider using docker compose exec command
which can reference the service directly by name, or define the service name as
a variable that can be reused across the migration loop.

Comment on lines +115 to +122
```
================ phase: drain ================
frames : 6000 seeded, 5988 dispatched (99.8%), waiting 6000 -> 12
throughput : 975.1 frames/s over a 6.1s booking window (wall 43.3s)
matching : 3175 host attempts (41.9% wasted), 39 cluster rounds, host-cache hit 98%
accounting : 7452 redis lua ops, 5988 dispatches (metrics), 24040 booked cores, rejections [...]
audit : OK
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Label the sample-output fence.

This fenced block is plain output, so give it a language tag (text or console) to satisfy markdownlint MD040.

♻️ Proposed fix
-```
+```text
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
================ phase: drain ================
frames : 6000 seeded, 5988 dispatched (99.8%), waiting 6000 -> 12
throughput : 975.1 frames/s over a 6.1s booking window (wall 43.3s)
matching : 3175 host attempts (41.9% wasted), 39 cluster rounds, host-cache hit 98%
accounting : 7452 redis lua ops, 5988 dispatches (metrics), 24040 booked cores, rejections [...]
audit : OK
```
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 115-115: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/_docs/developer-guide/scheduler-stress-testing.md` around lines 115 -
122, The fenced code block containing the sample scheduler output does not have
a language tag, which violates markdownlint rule MD040. Add a language tag to
the opening triple backticks of the code fence containing the phase: drain
output. Use either `text` or `console` as the language tag (e.g., change the
opening ``` to ```text or ```console) to properly label this as plain output
rather than code.

Source: Linters/SAST tools

Comment on lines +88 to +100
if !added.is_empty() {
if let Err(err) = limit_reseed::reseed_limits(&redis, &dao).await {
// Defer publishing this tick: a managed show that is
// not yet in the cache dispatches without Redis
// enforcement (silent over-count, healed by the next
// recompute) - strictly safer than enforcing against an
// unseeded burst. Retried on the next tick.
warn!(
"Limit seed for newly-managed show(s) {:?} failed; \
deferring cache publish to next tick: {err}",
added
);
return;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Do not leave newly managed shows fail-open after reseed failure.

When this branch returns, the cache still excludes added; apply_booking then treats those DB-managed shows as unmanaged and skips Redis accounting/cap enforcement until a later successful tick. Track these shows as pending/blocked and make booking fail closed, or otherwise prevent dispatch, until their limits are seeded.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/scheduler/src/accounting/managed_shows.rs` around lines 88 - 100,
When limit_reseed::reseed_limits fails for newly added managed shows, the
current code returns early without tracking these shows, causing apply_booking
to treat them as unmanaged and skip Redis enforcement (fail-open behavior).
Instead of returning immediately, store the failed `added` shows in a pending or
blocked state that can be checked in apply_booking to enforce fail-closed
behavior (reject bookings or prevent dispatch) until the limits are successfully
seeded on a subsequent tick.

Comment on lines +142 to +157
WaitingFrameClause::All => {
sqlx::query_scalar::<_, i32>("SELECT sum(int_waiting_count)::INTEGER FROM job_stat")
.fetch_one(&*pool)
.await
.expect("Failed to query job stats") as usize
}
WaitingFrameClause::JobPrefix(prefix) => sqlx::query_scalar::<_, i32>(
"SELECT sum(job_stat.int_waiting_count)::INTEGER \
FROM job_stat JOIN job ON job.pk_job = job_stat.pk_job \
WHERE job.str_name LIKE $1",
)
.bind(format!("{}%", prefix))
.fetch_one(&*pool)
.await
.expect("Failed to query job stats")
as usize,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Handle empty SUM results as zero in waiting-count queries.

Line 143 and Line 148 can return NULL when no rows match, causing .expect(...) to panic. Use COALESCE(..., 0) and query as i64 to keep this helper deterministic.

Suggested patch
-        WaitingFrameClause::All => {
-            sqlx::query_scalar::<_, i32>("SELECT sum(int_waiting_count)::INTEGER FROM job_stat")
+        WaitingFrameClause::All => {
+            sqlx::query_scalar::<_, i64>(
+                "SELECT COALESCE(sum(int_waiting_count), 0)::BIGINT FROM job_stat",
+            )
                 .fetch_one(&*pool)
                 .await
                 .expect("Failed to query job stats") as usize
         }
-        WaitingFrameClause::JobPrefix(prefix) => sqlx::query_scalar::<_, i32>(
-            "SELECT sum(job_stat.int_waiting_count)::INTEGER \
+        WaitingFrameClause::JobPrefix(prefix) => sqlx::query_scalar::<_, i64>(
+            "SELECT COALESCE(sum(job_stat.int_waiting_count), 0)::BIGINT \
                    FROM job_stat JOIN job ON job.pk_job = job_stat.pk_job \
                    WHERE job.str_name LIKE $1",
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/scheduler/tests/util/mod.rs` around lines 142 - 157, The SQL SUM
functions in both the WaitingFrameClause::All and WaitingFrameClause::JobPrefix
branches can return NULL when no rows match the query condition, causing the
.expect() calls to panic instead of gracefully handling empty results. Modify
both queries to wrap the SUM expressions with COALESCE(..., 0) to ensure NULL
values are converted to zero, and change the sqlx::query_scalar type parameter
from i32 to i64 for consistent handling of the aggregate result.

}

// --- Stat / resource rows (set-based; tolerate trigger-created rows) ---
let like = format!("{prefix}%");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Description: Locate unescaped LIKE prefix scopes in the stress utility.
# Expected: every prefix-based LIKE either escapes %, _, and \ with an ESCAPE clause,
# or is replaced with a literal prefix predicate.
rg -n -C2 'format!\("\{prefix\}%"\)|format!\("\{\}%", farm\.prefix\)|LIKE \$1' rust/crates/scheduler/tests/util/stress.rs

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 6761


🏁 Script executed:

# Find where these functions are called and what prefixes are used
rg -n "seed_stress_data|audit_stress_data|watchdog_|clean_up_stress_data|residue_counts" rust/crates/scheduler/tests/util/stress.rs -A 2 | head -50

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 591


🏁 Script executed:

# Look for test files that call the stress functions
fd -e rs -path "*/tests/*" rust/crates/scheduler | head -20

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 246


🏁 Script executed:

# Search for prefix values passed to these functions
rg -n "stress_" rust/crates/scheduler/tests/ --type rust | grep -E "(prefix|PREFIX)" | head -30

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 682


🏁 Script executed:

# See how the stress functions are actually called in tests
rg -n "clean_up_stress_data|seed_stress_data|audit_stress_data|watchdog_stress_data|residue_counts" rust/crates/scheduler/tests/stress_tests.rs -B 2 -A 5

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 1298


🏁 Script executed:

# Check if there are other naming patterns that might conflict with underscores in prefixes
rg -n "str_name|str_tag" rust/crates/scheduler/tests/stress_tests.rs | head -20

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 148


🏁 Script executed:

# Look at the actual test to see what entities are being created
cat -n rust/crates/scheduler/tests/stress_tests.rs | head -100

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 5565


🏁 Script executed:

# Look at what data is generated and how it's named
rg -n "name|str_name" rust/crates/scheduler/tests/util/stress.rs | grep -E "(format|String|Name)" | head -30

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 596


🏁 Script executed:

# Check what naming patterns are used in seeding
cat -n rust/crates/scheduler/tests/util/stress.rs | sed -n '200,400p'

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 9134


🏁 Script executed:

# Understand the SQL LIKE underscore wildcard impact more clearly
# Let's look for name generation patterns
rg -n "format!\(" rust/crates/scheduler/tests/util/stress.rs | grep -E "(name|tag|prefix)" | head -20

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 780


Escape unescaped _ wildcard characters in LIKE prefixes to prevent cross-test data pollution.

In SQL LIKE patterns, _ matches any single character. The stress prefixes (stress_d_, stress_s_, broad stress_) and generated names all contain underscores. The current pattern {prefix}% will incorrectly match names that do not literally start with the intended prefix—for example, stress_a_* would match the cleanup pattern stress_%, and stress_x_abc123_* would match the specific prefix pattern stress_d_abc123%. This causes concurrent tests or runs with different stress IDs to delete or audit each other's data.

Affects lines 544, 674, 773, 1024, 1142 and all dependent LIKE queries (see initial rg output for full scope: job, layer, host, alloc, show, dept, facility, host_tag, and proc subqueries). Escape the prefix with LIKE ... ESCAPE or switch to literal comparisons (LIKE '{prefix}' OR LIKE '{prefix}_%' etc.).

Possible helper
+fn escaped_like_prefix(prefix: &str) -> String {
+    let escaped = prefix
+        .replace('\\', r"\\")
+        .replace('%', r"\%")
+        .replace('_', r"\_");
+    format!("{escaped}%")
+}
+
-    let like = format!("{prefix}%");
+    let like = escaped_like_prefix(&prefix);

Then add ESCAPE '\\' to the affected LIKE $1 predicates.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/scheduler/tests/util/stress.rs` at line 544, The LIKE pattern at
line 544 using format!("{prefix}%") does not escape underscore characters, which
are SQL wildcards that match any single character. This causes unintended
matches across different test prefixes (for example, stress_a_ would match the
pattern stress_%). You need to escape the underscore characters in the prefix
variable by replacing them with escaped versions (using backslash escaping)
before creating the LIKE pattern, and then add ESCAPE '\\' to all the affected
LIKE clauses at lines 544, 674, 773, 1024, 1142 and their dependent queries
(job, layer, host, alloc, show, dept, facility, host_tag, and proc subqueries).
Alternatively, switch those LIKE patterns to literal comparisons using multiple
OR conditions like LIKE '{prefix}' OR LIKE '{prefix}_%' to match only names that
literally start with the intended prefix.

Comment on lines +746 to +748
async fn redis_hash_i64(redis: &mut ConnectionManager, key: &str, field: &str) -> i64 {
let value: Option<i64> = redis.hget(key, field).await.unwrap_or(None);
value.unwrap_or(0)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Fail the audit on Redis read errors.

unwrap_or(None) turns a failed HGET into 0, which can hide missing/leaked counters and make the audit report a clean result from an unhealthy Redis connection.

Proposed fix
 async fn redis_hash_i64(redis: &mut ConnectionManager, key: &str, field: &str) -> i64 {
-    let value: Option<i64> = redis.hget(key, field).await.unwrap_or(None);
+    let value: Option<i64> = redis
+        .hget(key, field)
+        .await
+        .expect("Redis accounting HGET failed during stress audit");
     value.unwrap_or(0)
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
async fn redis_hash_i64(redis: &mut ConnectionManager, key: &str, field: &str) -> i64 {
let value: Option<i64> = redis.hget(key, field).await.unwrap_or(None);
value.unwrap_or(0)
async fn redis_hash_i64(redis: &mut ConnectionManager, key: &str, field: &str) -> i64 {
let value: Option<i64> = redis
.hget(key, field)
.await
.expect("Redis accounting HGET failed during stress audit");
value.unwrap_or(0)
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/scheduler/tests/util/stress.rs` around lines 746 - 748, The
redis_hash_i64 function uses unwrap_or(None) to silently convert Redis read
errors into None values, which then become 0, hiding connection issues and
missing counters during audits. Instead of using unwrap_or(None) on the result
of redis.hget(key, field).await, properly propagate Redis errors by using expect
or a Result return type so that actual errors are surfaced rather than masked as
zero values, ensuring the audit can detect unhealthy Redis connections.

Comment on lines +821 to +851
for (key, (cores_centi, gpus)) in &expected {
let want_cores = cores_centi / CORE_MULT;
let got_cores = redis_hash_i64(redis, key, "int_cores").await;
let got_gpus = redis_hash_i64(redis, key, "int_gpus").await;
if got_cores != want_cores {
out.violations.push(format!(
"{key}: int_cores={got_cores} but SUM(proc) says {want_cores}"
));
}
if got_gpus != *gpus {
out.violations.push(format!(
"{key}: int_gpus={got_gpus} but SUM(proc) says {gpus}"
));
}
}

let all_jobs: Vec<String> = sqlx::query_scalar("SELECT pk_job FROM job WHERE pk_show = $1")
.bind(&show)
.fetch_all(pool)
.await
.expect("job list query");
for job in &all_jobs {
if !booked_jobs.contains_key(job) {
let got = redis_hash_i64(redis, &format!("acct:job:{job}"), "int_cores").await;
if got != 0 {
out.violations.push(format!(
"acct:job:{job}: int_cores={got} leaked for a job with no procs"
));
}
}
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Audit zero counters for every accounting dimension.

The absent-key leak check only covers acct:job:*. A leaked nonzero acct:sub, acct:folder, acct:layer, or acct:point counter with no matching proc rows never enters expected and currently passes the audit. Build the full set of farm accounting keys and assert both counters are zero when a key is absent from expected.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/scheduler/tests/util/stress.rs` around lines 821 - 851, The
current leak check only validates acct:job keys for absent entries by verifying
their counters are zero. You need to extend this audit to also cover acct:sub,
acct:folder, acct:layer, and acct:point keys. Query Redis for all keys matching
each of these prefixes (similar to how you query all_jobs from the database),
and for each key not present in the expected map, verify that both int_cores and
int_gpus counters are zero. Add violations to out.violations when nonzero values
are found for any of these missing accounting dimensions, using the same format
as the existing acct:job leak check.

Comment on lines +1145 to +1148
("job", "SELECT count(*) FROM job WHERE str_name LIKE $1"),
("layer", "SELECT count(*) FROM layer WHERE str_name LIKE $1"),
("frame", "SELECT count(*) FROM frame WHERE str_name LIKE $1"),
("host", "SELECT count(*) FROM host WHERE str_name LIKE $1"),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Count frame residue using the seeded frame-name shape.

Seeded frames are named like 0001-{prefix}_..., so frame.str_name LIKE '{prefix}%' will not detect leftover frame rows. Use a frame-specific pattern such as ____-{escaped_prefix}%, or count frames through stable stored IDs.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@rust/crates/scheduler/tests/util/stress.rs` around lines 1145 - 1148, The
frame query uses a pattern matching approach that does not account for the
seeded frame naming convention, where frames are named like `0001-{prefix}_...`.
The current LIKE pattern with just the prefix will not detect leftover frame
rows because of the numeric prefix. Update the frame query to use a pattern that
matches the full seeded frame name shape, such as using `____-{escaped_prefix}%`
to account for the four-digit numeric prefix, or alternatively count frames
through stable stored IDs instead of name pattern matching.

@DiegoTavares DiegoTavares force-pushed the stress-test branch 2 times, most recently from 1decf38 to e7089bb Compare June 23, 2026 21:28
@DiegoTavares

Copy link
Copy Markdown
Collaborator Author

This PR became a mess. I'm rebasing this work into two separate PRs and reorganizing their commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant