Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
150 changes: 150 additions & 0 deletions .github/workflows/scheduler-stress-pipeline.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
name: OpenCue Scheduler Stress Pipeline

# Runs the scheduler booking + accounting stress suite
# (rust/crates/scheduler/tests/stress_tests.rs): a full pipeline::run against a
# seeded farm, with an end-of-run audit that cross-checks the Redis acct:*
# hashes against SUM(proc) in Postgres and asserts cap enforcement.
#
# When it runs — and when it deliberately doesn't:
# - Pull requests: only when the scheduler crate, its proto dependency, the
# DB migrations, or this workflow change. The suite needs a migrated
# Postgres plus a Redis container and takes several minutes — running it
# for Python/CueGUI/docs changes would burn runner time for zero signal.
# - Nightly on master: catches drift from changes that slipped past the
# paths filter (e.g. shared workspace dependencies) and gives a daily
# throughput data point under fixed scale.
# - Manually (workflow_dispatch): for benchmarking a branch at custom scale.
#
# What is a gate vs. what is informational:
# - The job FAILS on correctness regressions: accounting drift between Redis
# and Postgres, cap breaches (subscription burst / job max-cores), booking
# liveness (<90% drain, no saturation rejections), or leftover test data.
# - The throughput numbers (frames/s) are reported in the step summary but
# are NOT asserted on: shared runners are too noisy for perf gating. For
# real benchmarking run the suite locally in release mode (see
# docs/_docs/developer-guide/scheduler-stress-testing.md).

on:
pull_request:
branches: ["master"]
paths:
- "rust/crates/scheduler/**"
- "rust/crates/opencue-proto/**"
- "rust/Cargo.toml"
- "cuebot/src/main/resources/conf/ddl/postgres/migrations/**"
- ".github/workflows/scheduler-stress-pipeline.yml"
schedule:
# Nightly on master. Odd minute to avoid the top-of-hour scheduling rush.
- cron: "23 9 * * *"
workflow_dispatch:
inputs:
stress_jobs:
description: "Drain-phase job count (default 300)"
required: false
stress_hosts:
description: "Drain-phase host count (default 1200)"
required: false
stress_frames_per_layer:
description: "Drain-phase frames per layer (default 5)"
required: false
stress_timeout_secs:
description: "Per-phase hard timeout in seconds (default 600)"
required: false

concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

env:
CARGO_TERM_COLOR: always
CARGO_INCREMENTAL: 0
CARGO_NET_RETRY: 10
RUSTUP_MAX_RETRIES: 10

jobs:
stress:
runs-on: ubuntu-22.04
timeout-minutes: 45

# The suite's Postgres half. Redis is NOT listed here on purpose: the test
# starts its own throwaway Redis via testcontainers, so all accounting
# state is guaranteed to die with the test process.
services:
postgres:
image: postgres:15.1
env:
POSTGRES_USER: cuebot
POSTGRES_PASSWORD: cuebot_password
POSTGRES_DB: cuebot
ports:
- 5432:5432
options: >-
--health-cmd "pg_isready -U cuebot -d cuebot"
--health-interval 5s
--health-timeout 5s
--health-retries 10

steps:
- uses: actions/checkout@v4

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n .github/workflows/scheduler-stress-pipeline.yml | sed -n '85,100p'

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 712


🏁 Script executed:

cat -n .github/workflows/scheduler-stress-pipeline.yml | sed -n '140,150p'

Repository: AcademySoftwareFoundation/OpenCue

Length of output: 543


🌐 Web query:

GitHub Actions security best practices pinning commit SHA vs version tag

💡 Result:

In GitHub Actions, pinning to a full-length commit SHA is considered the primary security best practice, whereas using version tags introduces significant security risks due to their mutability [1][2][3]. Core Differences and Security Implications Commit SHA (Recommended): - Immutability: A commit SHA is a unique, immutable identifier for a specific version of code [4][2][5]. Once pinned, the action's code cannot be changed or replaced, ensuring that your workflow consistently runs the exact same code [2][5][6]. - Security: It protects against supply chain attacks where a malicious actor gains control of an action's repository and updates an existing tag (e.g., v1) to point to malicious code [7][6][8]. - Enforcement: GitHub now supports organization and repository-level policies that can mandate the use of full-length commit SHAs, causing any workflows using non-pinned references to fail [9][5][6]. Version Tag (Less Secure): - Mutability: Tags (e.g., v1.0.0, v1) are mutable references [10][5][8]. A maintainer or an attacker who gains access to the repository can delete or move a tag to point to a different commit [1][2][5]. - Convenience vs. Risk: While tagging is more convenient for tracking updates, it relies entirely on trusting the action's creator and the security of their infrastructure [1][2][3]. Best Practices for Implementation - Use Automated Dependency Management: Because pinning to a SHA can create maintenance overhead, use tools like Dependabot or Renovate to automate updates [7][6][8]. For example, Renovate offers presets (e.g., helpers:pinGitHubActionDigests) that keep commit SHAs up to date while maintaining a human-readable version string in comments [7][6]. - Include Metadata: When pinning to a SHA, include the version tag as a comment in your YAML file to maintain readability and make manual auditing easier [6]. - Implement Security Policies: Administrators should enable GitHub's organization-level policies to enforce full-length commit SHA pinning across all repositories to prevent the accidental use of mutable tags [9][5][6]. - Use Auditing Tools: Incorporate security tools such as zizmor, Scorecard, or other action-pinning checkers into your CI pipeline to automatically verify that dependencies are pinned correctly [6][8][11].

Citations:


Pin all GitHub Actions to commit SHAs and disable persisted credentials for checkout.

This workflow uses floating version tags (@v4, @v2) for multiple actions (lines 88, 97, 145), creating supply chain integrity risks. Actions must be pinned to full commit SHAs to prevent malicious code injection if a tag is reassigned. Additionally, actions/checkout should disable credential persistence to limit token exposure.

Suggested patch for line 88:

Patch
-      - uses: actions/checkout@v4
+      - uses: actions/checkout@<FULL_COMMIT_SHA>  # v4
+        with:
+          persist-credentials: false

Apply the same pinning pattern to lines 97 and 145, or use a tool like Dependabot or Renovate to automate SHA updates.

🧰 Tools
🪛 zizmor (1.25.2)

[warning] 88-88: credential persistence through GitHub Actions artifacts (artipacked): does not set persist-credentials: false

(artipacked)


[error] 88-88: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/scheduler-stress-pipeline.yml at line 88, Replace all
floating version tags in the scheduler-stress-pipeline.yml workflow with full
commit SHAs to prevent supply chain attacks. Update the actions/checkout action
at line 88 from `@v4` to a pinned commit SHA, and apply the same pinning pattern
to the other floating-tag actions at lines 97 and 145 (currently using `@v2`).
Additionally, for the actions/checkout action specifically, add the
persist-credentials option set to false to disable credential persistence and
limit token exposure.

Source: Linters/SAST tools


- name: Install build dependencies
run: |
sudo apt-get update && sudo apt-get install -y libx11-dev protobuf-compiler libcurl4-openssl-dev postgresql-client
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs > rust-install.sh
bash ./rust-install.sh -y

- name: Cache cargo deps
uses: Swatinem/rust-cache@v2
with:
workspaces: rust
cache-on-failure: true

# Plain psql instead of the sandbox Flyway image: the migrations are
# versioned plain-SQL files, so applying them in numeric order is exactly
# what Flyway would do, without building a JDK image first.
- name: Apply database migrations
working-directory: cuebot/src/main/resources/conf/ddl/postgres/migrations
env:
PGPASSWORD: cuebot_password
run: |
for f in $(ls V*.sql | sort -t V -k2 -n); do
echo "== $f"
psql -q -v ON_ERROR_STOP=1 -h localhost -U cuebot -d cuebot -f "$f"
done

- name: Run stress suite
working-directory: rust
shell: bash
env:
STRESS_JOBS: ${{ inputs.stress_jobs }}
STRESS_HOSTS: ${{ inputs.stress_hosts }}
STRESS_FRAMES_PER_LAYER: ${{ inputs.stress_frames_per_layer }}
STRESS_TIMEOUT_SECS: ${{ inputs.stress_timeout_secs }}
run: |
cargo test -p scheduler --features stress-tests --test stress_tests -- --nocapture 2>&1 | tee stress-output.log

- name: Publish phase report
if: always()
working-directory: rust
shell: bash
run: |
if [ -f stress-output.log ] && grep -q "^================ phase" stress-output.log; then
{
echo "## Scheduler stress suite"
echo '```'
sed -n '/^================ phase/,$p' stress-output.log | sed -n '1,80p'
echo '```'
echo "_Throughput numbers are informational; only the accounting/enforcement assertions gate this job._"
} >> "$GITHUB_STEP_SUMMARY"
else
echo "## Scheduler stress suite: no phase report produced (failed before the run?)" >> "$GITHUB_STEP_SUMMARY"
fi

- name: Upload full output
if: always()
uses: actions/upload-artifact@v4
with:
name: scheduler-stress-output
path: rust/stress-output.log
if-no-files-found: ignore
retention-days: 30
204 changes: 204 additions & 0 deletions docs/_docs/developer-guide/scheduler-stress-testing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,204 @@
---
title: "Scheduler Stress Testing"
nav_order: 102
parent: Reference
layout: default
linkTitle: "Scheduler Stress Testing"
date: 2026-06-12
description: >
How to run, tune, and interpret the Rust scheduler's booking and accounting
stress suite, locally and in CI
---

# Scheduler Stress Testing

### Running and interpreting the booking + accounting stress suite
Comment on lines +13 to +15

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep the heading hierarchy to one level at a time.

### jumps from the page title (#) straight to level 3, which trips MD001 and makes the section structure harder to scan. Use ## here, or fold it into the next section if the extra heading is unnecessary.

♻️ Proposed fix
-### Running and interpreting the booking + accounting stress suite
+## Running and interpreting the booking + accounting stress suite
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Scheduler Stress Testing
### Running and interpreting the booking + accounting stress suite
# Scheduler Stress Testing
## Running and interpreting the booking + accounting stress suite
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 15-15: Heading levels should only increment by one level at a time
Expected: h2; Actual: h3

(MD001, heading-increment)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/_docs/developer-guide/scheduler-stress-testing.md` around lines 13 - 15,
The markdown heading hierarchy is incorrect, jumping from level 1 (the page
title "Scheduler Stress Testing") directly to level 3 (the "Running and
interpreting the booking + accounting stress suite" heading). This violates
markdown structure rules and makes navigation harder. Change the "Running and
interpreting the booking + accounting stress suite" heading from ### to ## to
maintain proper sequential heading hierarchy from level 1 to level 2.

Source: Linters/SAST tools


---

## Overview

The stress suite (`rust/crates/scheduler/tests/stress_tests.rs`) exercises the
[Rust scheduler](/docs/developer-guide/scheduler/)'s full production dispatch
path at scale — `pipeline::run` end to end: Redis accounting bootstrap →
cluster feed → pending-job query → host matching → dispatch (proc insert, host
ledger decrement, frame start) — against a deterministic, bulk-seeded farm.

It is both a **correctness gate** and a **benchmark harness**:

- **Correctness**: after each phase an audit cross-checks every Redis `acct:*`
hash the run touched against `SUM(proc)` in Postgres (the canonical record —
see the [Redis-Backed Accounting Reference](/docs/developer-guide/redis-accounting/)),
and verifies cap enforcement and ledger invariants.
- **Benchmark**: it reports booking throughput (frames/s over the active
booking window), host-matching efficiency (wasted attempt %), host-cache hit
ratio, and Redis Lua op counts.

The suite runs two phases in one process:

| Phase | Shape | What it proves |
|---|---|---|
| **drain** | Farm capacity comfortably exceeds demand (default: 1,200 hosts, 6,000 frames) | ≥90% of frames book; throughput measured; accounting stays exact under concurrency, including the force-rollback compensation path |
| **saturation** | Demand vastly exceeds tight subscription bursts and per-job core caps (default: 400 hosts, 3,000 frames, 150-core bursts) | The Redis Lua cap check is the binding constraint: bookings stop exactly at burst, caps are never breached, rejections flow through the hot path |

### Invariants the audit asserts

1. Every `acct:{sub,folder,job,layer,point}` hash holds exactly
`SUM(proc.int_cores_reserved)/100` cores and `SUM(proc.int_gpus_reserved)`
GPUs for its grouping — the same 5-dimension grouping and centicore→core
conversion the recompute loop uses. The suite pushes the recompute and
limit-reseed loops out to a 1-hour interval, so agreement here proves the
*dispatch hot path alone* (Lua book + force-rollback) kept Redis exact —
reconciliation never got a chance to paper over drift.
2. Jobs with no bookings have no leaked Redis counters.
3. Per-(show, alloc) booked cores never exceed the subscription burst.
4. Per-job booked cores never exceed `job_resource.int_max_cores`.
5. Host ledger: `int_cores - int_cores_idle == SUM(proc)` per host, never negative.
6. One `RUNNING` frame per proc row.
7. Trigger-maintained `job_stat.int_waiting_count` matches the frame table.
8. After teardown, zero `stress_%` rows remain in any table the suite touches.

## Running locally

### Prerequisites

- A migrated Postgres on `localhost:5432` (`cuebot` / `cuebot_password`).
From the repo root: `docker compose up -d flyway` (brings up `db` and applies
migrations). If the Flyway image won't build in your environment (e.g.
SSL-inspecting proxies break its package mirrors), apply the migrations
directly — they are plain SQL:

```bash
cd cuebot/src/main/resources/conf/ddl/postgres/migrations
for f in $(ls V*.sql | sort -t V -k2 -n); do
docker exec -i opencue-db-1 psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f"
done
Comment on lines +71 to +75

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don't hard-code the Compose container name.

docker exec -i opencue-db-1 ... only works with the default Compose project name, so this fallback will break in many local setups. Use the Compose service name instead.

♻️ Proposed fix
-  docker exec -i opencue-db-1 psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f"
+  docker compose exec -T db psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```bash
cd cuebot/src/main/resources/conf/ddl/postgres/migrations
for f in $(ls V*.sql | sort -t V -k2 -n); do
docker exec -i opencue-db-1 psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f"
done
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/_docs/developer-guide/scheduler-stress-testing.md` around lines 71 - 75,
The bash script hard-codes the Docker Compose container name opencue-db-1, which
only works with the default Compose project name and breaks in other local
setups. Replace the hard-coded container name in the docker exec command with
the Compose service name instead. Consider using docker compose exec command
which can reference the service directly by name, or define the service name as
a variable that can be reused across the migration loop.

```

- A running Docker daemon. The suite starts its own throwaway Redis container
via testcontainers; all accounting state dies with it.

### Run

```bash
cd rust
cargo test -p scheduler --features stress-tests --test stress_tests -- --nocapture
```

For meaningful benchmark numbers, use a release build:

```bash
cargo test -p scheduler --release --features stress-tests --test stress_tests -- --nocapture
```

### Tuning

| Env var | Default | Meaning |
|---|---|---|
| `STRESS_JOBS` | 300 | drain-phase job count |
| `STRESS_LAYERS` | 4 | drain-phase layers per job |
| `STRESS_FRAMES_PER_LAYER` | 5 | drain-phase frames per layer |
| `STRESS_HOSTS` | 1200 | drain-phase host count |
| `STRESS_TAGS` | 8 | drain-phase manual tag count |
| `STRESS_SAT_JOBS` | 150 | saturation-phase job count |
| `STRESS_SAT_HOSTS` | 400 | saturation-phase host count |
| `STRESS_DRAIN_TARGET` | 0.9 | fraction of drain frames that must book |
| `STRESS_STALL_SECS` | 30 | watchdog: pause jobs after this long without a new booking |
| `STRESS_TIMEOUT_SECS` | 600 | watchdog: per-phase hard timeout |

Seeding is deterministic for a given scale (fixed RNG seed), so consecutive
runs at the same scale book the same workload — diffs in throughput between
runs reflect the code, not the data.

### Reading the report

```
================ phase: drain ================
frames : 6000 seeded, 5988 dispatched (99.8%), waiting 6000 -> 12
throughput : 975.1 frames/s over a 6.1s booking window (wall 43.3s)
matching : 3175 host attempts (41.9% wasted), 39 cluster rounds, host-cache hit 98%
accounting : 7452 redis lua ops, 5988 dispatches (metrics), 24040 booked cores, rejections [...]
audit : OK
```
Comment on lines +115 to +122

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Label the sample-output fence.

This fenced block is plain output, so give it a language tag (text or console) to satisfy markdownlint MD040.

♻️ Proposed fix
-```
+```text
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
```
================ phase: drain ================
frames : 6000 seeded, 5988 dispatched (99.8%), waiting 6000 -> 12
throughput : 975.1 frames/s over a 6.1s booking window (wall 43.3s)
matching : 3175 host attempts (41.9% wasted), 39 cluster rounds, host-cache hit 98%
accounting : 7452 redis lua ops, 5988 dispatches (metrics), 24040 booked cores, rejections [...]
audit : OK
```
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 115-115: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/_docs/developer-guide/scheduler-stress-testing.md` around lines 115 -
122, The fenced code block containing the sample scheduler output does not have
a language tag, which violates markdownlint rule MD040. Add a language tag to
the opening triple backticks of the code fence containing the phase: drain
output. Use either `text` or `console` as the language tag (e.g., change the
opening ``` to ```text or ```console) to properly label this as plain output
rather than code.

Source: Linters/SAST tools


- **throughput** is measured from the first to the last `proc.ts_booked`, so it
excludes the post-drain shutdown tail of the feed (the `wall` figure includes it).
- **redis lua ops** above the dispatch count means the compensation path ran:
each failed dispatch costs one book plus one force-rollback. The audit
passing alongside a surplus is a *positive* signal — rollbacks netted out.
- In the saturation phase, expect large `subscription=` rejection counts and
every subscription pinned at exactly `burst/burst` cores.

### Cleanup guarantees

All database rows the suite creates are prefixed `stress_`. The suite sweeps
that prefix **before** seeding (so leftovers from a crashed earlier run never
skew results) and **after** the run, then asserts zero residue. Redis state
needs no cleanup — the container is destroyed with the test. If a run is
killed hard (e.g. SIGKILL mid-phase), the next run's pre-sweep removes the
leftovers.

## CI integration

The suite runs in the
[`scheduler-stress-pipeline.yml`](https://github.com/AcademySoftwareFoundation/OpenCue/blob/master/.github/workflows/scheduler-stress-pipeline.yml)
workflow.

### When it runs

| Trigger | Scale | Purpose |
|---|---|---|
| Pull request touching `rust/crates/scheduler/**`, `rust/crates/opencue-proto/**`, `rust/Cargo.toml`, the Postgres migrations, or the workflow itself | defaults | Gate scheduler changes on booking/accounting correctness |
| Nightly (cron, master) | defaults | Catch drift from changes outside the paths filter; daily throughput data point |
| Manual (`workflow_dispatch`) | custom via inputs | Benchmark a branch at chosen scale |

### When it deliberately does not run

- **PRs that don't touch the scheduler or schema** (Python, CueGUI, CueWeb,
docs, …). The suite needs a migrated Postgres, a Docker daemon, and several
minutes of runner time; for those changes it produces zero signal.
- **As a performance gate.** Shared CI runners have noisy CPU/IO, so the
workflow never asserts on frames/s — throughput is published in the job's
step summary (and the full log as an artifact) for humans to eyeball trends.
Benchmark conclusions should come from local release-mode runs on quiet
hardware.

### What fails the job

Only correctness regressions: accounting drift between Redis and Postgres, a
cap breach, booking liveness failures (drain below target, or a saturated farm
producing no Redis rejections), a phase that never converges (hard-timeout),
or test data left behind after cleanup.

### Launching a manual benchmark run

GitHub → Actions → *OpenCue Scheduler Stress Pipeline* → *Run workflow*, then
optionally override the job/host/frame counts and timeout. Results appear in
the run's step summary; the complete log is attached as the
`scheduler-stress-output` artifact (kept 30 days).

## Scope and limitations

- **RQD is not exercised.** The suite runs in `dry_run_mode`: the full booking
path executes (Redis Lua, proc insert, host ledger, frame start) but no gRPC
launch is sent. Frame *completion* and the Cuebot release path are out of
scope — see the [Redis-Backed Accounting Reference](/docs/developer-guide/redis-accounting/)
for how releases are reconciled.
- **Only scheduler-managed shows** (`show.b_scheduler_managed = true`) are
covered; Cuebot-managed accounting is Cuebot's test territory.
- The recompute / limit-reseed loops are intentionally dormant during the run
(see invariant 1); their CAS semantics are covered separately by
`tests/redis_integration.rs` (`--features redis-tests`).

## Schema gotchas the suite encodes

These bit during development and are asserted/documented in the test code —
keep them in mind when extending the seeding:

- `alloc.str_tag` is `VARCHAR(24)` and `host.str_name` is `VARCHAR(30)`:
generated names must stay short.
- The pending-job query `INNER JOIN`s `folder_resource`: a folder without that
row makes every job in it silently unbookable.
- The `vs_waiting` view requires `job_resource.int_max_cores - int_cores >= 100`
(centicores): `int_max_cores = 0` does **not** mean "unlimited" on the query
path — use a large value instead.
20 changes: 11 additions & 9 deletions docs/_docs/developer-guide/scheduler.md
Original file line number Diff line number Diff line change
Expand Up @@ -998,18 +998,20 @@ cargo build --release -p scheduler
cargo test -p scheduler
```

**Integration tests** (requires database):
**Smoke tests** (requires a migrated local Postgres, see `docker compose up -d flyway`):
```bash
# Set up test database
export DATABASE_URL=postgresql://user:pass@localhost/test_db

# Run integration tests
cargo test -p scheduler --test integration_tests
cargo test -p scheduler --features smoke-tests --test smoke_tests
```

**Stress tests**:
**Stress tests** (requires a migrated local Postgres plus a Docker daemon for
the throwaway Redis container; see the
[Scheduler Stress Testing](/docs/developer-guide/scheduler-stress-testing/)
guide for tuning, CI behavior, and how to read the report):
```bash
cargo test -p scheduler --test stress_tests --release -- --nocapture
cargo test -p scheduler --features stress-tests --test stress_tests -- --nocapture

# Release mode for representative benchmark numbers
cargo test -p scheduler --release --features stress-tests --test stress_tests -- --nocapture
```

### Code Quality
Expand Down Expand Up @@ -1128,7 +1130,7 @@ INFO scheduler: Host cache size: 50000 hosts
**CPU profiling**:
```bash
cargo install samply
samply record cargo test --test stress_tests --release -- --nocapture
samply record cargo test -p scheduler --release --features stress-tests --test stress_tests -- --nocapture
```

## API and Extensibility
Expand Down
Loading
Loading