Skip to content

[scheduler] Resource accounting on Redis#2323

Draft
DiegoTavares wants to merge 12 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:sched_accounting_redis
Draft

[scheduler] Resource accounting on Redis#2323
DiegoTavares wants to merge 12 commits into
AcademySoftwareFoundation:masterfrom
DiegoTavares:sched_accounting_redis

Conversation

@DiegoTavares
Copy link
Copy Markdown
Collaborator

WIP

Scc silently drops inserts where the key already exists.
Also ensure all_sleeping_rounds is reset at the end of each full iteration
…queries

Phase 1 scheduler quick wins: empty-cluster sleep, LIMIT, refresh guard

- Empty-cluster sleep now configurable (cluster_empty_sleep, default 30s).
- QUERY_PENDING_BY_SHOW_FACILITY_TAG capped via max_jobs_per_cluster_pass
  (default 20). Strict ORDER BY priority DESC; low-priority jobs deferred.
- HostCacheService skips overlapping refresh ticks via an AtomicBool guard.

Add V40 indexes for scheduler pending-job query

GIN on layer.str_tags (array overlap), composite partial on
job(pk_show, pk_facility, str_state, b_paused) WHERE PENDING/not paused,
partial on layer_stat(pk_layer) WHERE int_waiting_count > 0.

Plain CREATE INDEX (Flyway 5.2.0 wraps in a transaction, which Postgres
rejects for CONCURRENTLY); apply with CONCURRENTLY via psql before Flyway
when running against populated production tables.

Drop LOWER(pk_facility) hack and rewrite QUERY_PENDING with EXISTS

Scheduler-side facility id is now String (was Uuid). The dao::helpers
parse_uuid path was lower-casing every facility round-trip, which forced
LOWER() compares in 6 SQL sites. Cuebot writes canonical casing on insert,
so a String swap removes the hack at the source.

QUERY_PENDING_BY_SHOW_FACILITY_TAG rewritten to a single bookable_shows
CTE plus EXISTS subquery, removing the layer ⨝ layer_stat ⨝ DISTINCT
cardinality blowup. Folder cap split into outer early-out and per-layer
fit inside the EXISTS.
Now shows can be moved to the scheduler using cueadmin:

```
cueadmin -show foo -setSchedulerManaged true
```

The following properties have been removed:

```
dispatcher.scheduler_manages_resources=false
dispatcher.exclusion_list=show1,show2:facility.allocation,show3:facility.allocation
```
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 14, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: efae4d01-bbec-4fdc-b215-cc26bce3a615

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

When a proc belonging to a show managed by the Scheduler is destroyed, its core and gpu counts are
sent to Redis to update the cached version of the resource accounting tables. See
design/SCHED_REDIS_DECISIONS.md for more details.
@DiegoTavares DiegoTavares force-pushed the sched_accounting_redis branch from 96b309f to d14e52e Compare May 20, 2026 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant