Skip to content

Commit 5865269

Browse files
Release minor fixes - Monday May 4 2026 (#793)
Change summary: ### cortex v0.0.46 (sha-ab6eb45d) Non-breaking changes: - Fix capacity filter to correctly account for multi-VM CommittedResource reservation slots — confirmed VMs are now summed (not just the last one), blocks are clamped to zero when confirmed exceeds slot size, and spec-only VMs larger than remaining slot are fully covered - Expose `prometheusDatasourceControllerParallelReconciles` config option to allow parallel reconciles in the Prometheus datasource controller, reducing initial sync latency - Remove `Conf` field from PrometheusDatasourceReconciler — config is now loaded internally via `conf.GetConfig` during `SetupWithManager` - Add operator-controlled per-resource-type config (`flavorGroupResourceConfig`) for committed resources, replacing runtime derivation from flavor group metadata; supports wildcard (`*`) catch-all for unknown groups - Propagate `AnnotationCreatorRequestID` from the change-commitments API to the CommittedResource CRD and through the reservation controller for end-to-end request tracing ### cortex-nova v0.0.59 (sha-ab6eb45d) Includes updated chart cortex v0.0.46. Non-breaking changes: - Remove all committed resource related Prometheus alerts (info API, change API, usage API, capacity API, and syncer alerts) - Add `flavorGroupResourceConfig` to cortex-nova values.yaml with a wildcard default that sets `hasCapacity: true` for ram, cores, and instances
2 parents d3f4be7 + 2c7ae89 commit 5865269

29 files changed

Lines changed: 1295 additions & 1283 deletions

CHANGELOG.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,24 @@
11
# Changelog
22

3+
## 2026-05-04 — [#793](https://github.com/cobaltcore-dev/cortex/pull/793)
4+
5+
### cortex v0.0.46 (sha-ab6eb45d)
6+
7+
Non-breaking changes:
8+
- Fix capacity filter to correctly account for multi-VM CommittedResource reservation slots — confirmed VMs are now summed (not just the last one), blocks are clamped to zero when confirmed exceeds slot size, and spec-only VMs larger than remaining slot are fully covered
9+
- Expose `prometheusDatasourceControllerParallelReconciles` config option to allow parallel reconciles in the Prometheus datasource controller, reducing initial sync latency
10+
- Remove `Conf` field from PrometheusDatasourceReconciler — config is now loaded internally via `conf.GetConfig` during `SetupWithManager`
11+
- Add operator-controlled per-resource-type config (`flavorGroupResourceConfig`) for committed resources, replacing runtime derivation from flavor group metadata; supports wildcard (`*`) catch-all for unknown groups
12+
- Propagate `AnnotationCreatorRequestID` from the change-commitments API to the CommittedResource CRD and through the reservation controller for end-to-end request tracing
13+
14+
### cortex-nova v0.0.59 (sha-ab6eb45d)
15+
16+
Includes updated chart cortex v0.0.46.
17+
18+
Non-breaking changes:
19+
- Remove all committed resource related Prometheus alerts (info API, change API, usage API, capacity API, and syncer alerts)
20+
- Add `flavorGroupResourceConfig` to cortex-nova values.yaml with a wildcard default that sets `hasCapacity: true` for ram, cores, and instances
21+
322
## 2026-05-04 — [#779](https://github.com/cobaltcore-dev/cortex/pull/779)
423

524
### cortex v0.0.45 (sha-1fb35660)

cmd/manager/main.go

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -573,7 +573,6 @@ func main() {
573573
Client: multiclusterClient,
574574
Scheme: mgr.GetScheme(),
575575
Monitor: monitor,
576-
Conf: conf.GetConfigOrDie[prometheus.PrometheusDatasourceReconcilerConfig](),
577576
}).SetupWithManager(mgr, multiclusterClient); err != nil {
578577
setupLog.Error(err, "unable to create controller", "controller", "PrometheusDatasourceReconciler")
579578
os.Exit(1)

helm/bundles/cortex-cinder/Chart.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ apiVersion: v2
55
name: cortex-cinder
66
description: A Helm chart deploying Cortex for Cinder.
77
type: application
8-
version: 0.0.58
8+
version: 0.0.59
99
appVersion: 0.1.0
1010
dependencies:
1111
# from: file://../../library/cortex-postgres
@@ -16,12 +16,12 @@ dependencies:
1616
# from: file://../../library/cortex
1717
- name: cortex
1818
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts
19-
version: 0.0.45
19+
version: 0.0.46
2020
alias: cortex-knowledge-controllers
2121
# from: file://../../library/cortex
2222
- name: cortex
2323
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts
24-
version: 0.0.45
24+
version: 0.0.46
2525
alias: cortex-scheduling-controllers
2626

2727
# Owner info adds a configmap to the kubernetes cluster with information on

helm/bundles/cortex-crds/Chart.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@ apiVersion: v2
55
name: cortex-crds
66
description: A Helm chart deploying Cortex CRDs.
77
type: application
8-
version: 0.0.58
8+
version: 0.0.59
99
appVersion: 0.1.0
1010
dependencies:
1111
# from: file://../../library/cortex
1212
- name: cortex
1313
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts
14-
version: 0.0.45
14+
version: 0.0.46
1515

1616
# Owner info adds a configmap to the kubernetes cluster with information on
1717
# the service owner. This makes it easier to find out who to contact in case

helm/bundles/cortex-ironcore/Chart.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,13 @@ apiVersion: v2
55
name: cortex-ironcore
66
description: A Helm chart deploying Cortex for IronCore.
77
type: application
8-
version: 0.0.58
8+
version: 0.0.59
99
appVersion: 0.1.0
1010
dependencies:
1111
# from: file://../../library/cortex
1212
- name: cortex
1313
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts
14-
version: 0.0.45
14+
version: 0.0.46
1515

1616
# Owner info adds a configmap to the kubernetes cluster with information on
1717
# the service owner. This makes it easier to find out who to contact in case

helm/bundles/cortex-manila/Chart.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ apiVersion: v2
55
name: cortex-manila
66
description: A Helm chart deploying Cortex for Manila.
77
type: application
8-
version: 0.0.58
8+
version: 0.0.59
99
appVersion: 0.1.0
1010
dependencies:
1111
# from: file://../../library/cortex-postgres
@@ -16,12 +16,12 @@ dependencies:
1616
# from: file://../../library/cortex
1717
- name: cortex
1818
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts
19-
version: 0.0.45
19+
version: 0.0.46
2020
alias: cortex-knowledge-controllers
2121
# from: file://../../library/cortex
2222
- name: cortex
2323
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts
24-
version: 0.0.45
24+
version: 0.0.46
2525
alias: cortex-scheduling-controllers
2626

2727
# Owner info adds a configmap to the kubernetes cluster with information on

helm/bundles/cortex-nova/Chart.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ apiVersion: v2
55
name: cortex-nova
66
description: A Helm chart deploying Cortex for Nova.
77
type: application
8-
version: 0.0.58
8+
version: 0.0.59
99
appVersion: 0.1.0
1010
dependencies:
1111
# from: file://../../library/cortex-postgres
@@ -16,12 +16,12 @@ dependencies:
1616
# from: file://../../library/cortex
1717
- name: cortex
1818
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts
19-
version: 0.0.45
19+
version: 0.0.46
2020
alias: cortex-knowledge-controllers
2121
# from: file://../../library/cortex
2222
- name: cortex
2323
repository: oci://ghcr.io/cobaltcore-dev/cortex/charts
24-
version: 0.0.45
24+
version: 0.0.46
2525
alias: cortex-scheduling-controllers
2626

2727
# Owner info adds a configmap to the kubernetes cluster with information on

helm/bundles/cortex-nova/alerts/nova.alerts.yaml

Lines changed: 0 additions & 252 deletions
Original file line numberDiff line numberDiff line change
@@ -287,258 +287,6 @@ groups:
287287
configuration. It is recommended to investigate the
288288
pipeline status and logs for more details.
289289
290-
# Committed Resource Info API Alerts
291-
- alert: CortexNovaCommittedResourceInfoHttpRequest500sTooHigh
292-
expr: rate(cortex_committed_resource_info_api_requests_total{service="cortex-nova-metrics", status_code=~"5.."}[5m]) > 0.1
293-
for: 5m
294-
labels:
295-
context: committed-resource-api
296-
dashboard: cortex-status-dashboard/cortex-status-dashboard
297-
service: cortex
298-
severity: warning
299-
support_group: workload-management
300-
annotations:
301-
summary: "Committed Resource info API HTTP 500 errors too high"
302-
description: >
303-
The committed resource info API (Limes LIQUID integration) is responding
304-
with HTTP 5xx errors. This indicates internal problems building service info,
305-
such as invalid flavor group data. Limes will not be able to discover available
306-
resources until the issue is resolved.
307-
308-
# Committed Resource Change API Alerts
309-
- alert: CortexNovaCommittedResourceHttpRequest400sTooHigh
310-
expr: rate(cortex_committed_resource_change_api_requests_total{service="cortex-nova-metrics", status_code=~"4.."}[5m]) > 0.1
311-
for: 5m
312-
labels:
313-
context: committed-resource-api
314-
dashboard: cortex-status-dashboard/cortex-status-dashboard
315-
service: cortex
316-
severity: warning
317-
support_group: workload-management
318-
annotations:
319-
summary: "Committed Resource change API HTTP 400 errors too high"
320-
description: >
321-
The committed resource change API (Limes LIQUID integration) is responding
322-
with HTTP 4xx errors. This may happen when Limes sends a request with
323-
an outdated info version (409), the API is temporarily unavailable,
324-
or the request format is invalid. Limes will typically retry these
325-
requests, so no immediate action is needed unless the errors persist.
326-
327-
- alert: CortexNovaCommittedResourceHttpRequest500sTooHigh
328-
expr: rate(cortex_committed_resource_change_api_requests_total{service="cortex-nova-metrics", status_code=~"5.."}[5m]) > 0.1
329-
for: 5m
330-
labels:
331-
context: committed-resource-api
332-
dashboard: cortex-status-dashboard/cortex-status-dashboard
333-
service: cortex
334-
severity: warning
335-
support_group: workload-management
336-
annotations:
337-
summary: "Committed Resource change API HTTP 500 errors too high"
338-
description: >
339-
The committed resource change API (Limes LIQUID integration) is responding
340-
with HTTP 5xx errors. This is not expected and indicates that Cortex
341-
is having an internal problem processing commitment changes. Limes will
342-
continue to retry, but new commitments may not be fulfilled until the
343-
issue is resolved.
344-
345-
- alert: CortexNovaCommittedResourceLatencyTooHigh
346-
expr: |
347-
histogram_quantile(0.95, sum(rate(cortex_committed_resource_change_api_request_duration_seconds_bucket{service="cortex-nova-metrics"}[5m])) by (le)) > 30
348-
and on() rate(cortex_committed_resource_change_api_requests_total{service="cortex-nova-metrics"}[5m]) > 0
349-
for: 5m
350-
labels:
351-
context: committed-resource-api
352-
dashboard: cortex-status-dashboard/cortex-status-dashboard
353-
service: cortex
354-
severity: warning
355-
support_group: workload-management
356-
annotations:
357-
summary: "Committed Resource change API latency too high"
358-
description: >
359-
The committed resource change API (Limes LIQUID integration) is experiencing
360-
high latency (p95 > 30s). This may indicate that the scheduling pipeline
361-
is under heavy load or that reservation scheduling is taking longer than
362-
expected. Limes requests may time out, causing commitment changes to fail.
363-
364-
- alert: CortexNovaCommittedResourceRejectionRateTooHigh
365-
expr: |
366-
(
367-
sum(rate(cortex_committed_resource_change_api_commitment_changes_total{service="cortex-nova-metrics", result="rejected"}[5m]))
368-
/ sum(rate(cortex_committed_resource_change_api_commitment_changes_total{service="cortex-nova-metrics"}[5m]))
369-
) > 0.5
370-
and on() sum(rate(cortex_committed_resource_change_api_commitment_changes_total{service="cortex-nova-metrics"}[5m])) > 0
371-
for: 5m
372-
labels:
373-
context: committed-resource-api
374-
dashboard: cortex-status-dashboard/cortex-status-dashboard
375-
service: cortex
376-
severity: warning
377-
support_group: workload-management
378-
annotations:
379-
summary: "Committed Resource rejection rate too high"
380-
description: >
381-
More than 50% of commitment change requests are being rejected.
382-
This may indicate insufficient capacity in the datacenter to fulfill
383-
new commitments, or issues with the commitment scheduling logic.
384-
Rejected commitments are rolled back, so Limes will see them as failed
385-
and may retry or report the failure to users.
386-
387-
- alert: CortexNovaCommittedResourceTimeoutsTooHigh
388-
expr: increase(cortex_committed_resource_change_api_timeouts_total{service="cortex-nova-metrics"}[5m]) > 0
389-
for: 5m
390-
labels:
391-
context: committed-resource-api
392-
dashboard: cortex-status-dashboard/cortex-status-dashboard
393-
service: cortex
394-
severity: warning
395-
support_group: workload-management
396-
annotations:
397-
summary: "Committed Resource change API timeout detected"
398-
description: >
399-
The committed resource change API (Limes LIQUID integration) timed out
400-
while waiting for reservations to become ready. This indicates that the
401-
scheduling pipeline is overloaded or reservations are taking too long
402-
to be scheduled. Affected commitment changes are rolled back and Limes
403-
will see them as failed. Consider investigating the scheduler performance
404-
or increasing the timeout configuration.
405-
406-
# Committed Resource Usage API Alerts
407-
- alert: CortexNovaCommittedResourceUsageHttpRequest400sTooHigh
408-
expr: rate(cortex_committed_resource_usage_api_requests_total{service="cortex-nova-metrics", status_code=~"4.."}[5m]) > 0.1
409-
for: 5m
410-
labels:
411-
context: committed-resource-api
412-
dashboard: cortex-status-dashboard/cortex-status-dashboard
413-
service: cortex
414-
severity: warning
415-
support_group: workload-management
416-
annotations:
417-
summary: "Committed Resource usage API HTTP 400 errors too high"
418-
description: >
419-
The committed resource usage API (Limes LIQUID integration) is responding
420-
with HTTP 4xx errors. This may indicate invalid project IDs or malformed
421-
requests from Limes. Limes will typically retry these requests.
422-
423-
- alert: CortexNovaCommittedResourceUsageHttpRequest500sTooHigh
424-
expr: rate(cortex_committed_resource_usage_api_requests_total{service="cortex-nova-metrics", status_code=~"5.."}[5m]) > 0.1
425-
for: 5m
426-
labels:
427-
context: committed-resource-api
428-
dashboard: cortex-status-dashboard/cortex-status-dashboard
429-
service: cortex
430-
severity: warning
431-
support_group: workload-management
432-
annotations:
433-
summary: "Committed Resource usage API HTTP 500 errors too high"
434-
description: >
435-
The committed resource usage API (Limes LIQUID integration) is responding
436-
with HTTP 5xx errors. This indicates internal problems fetching reservations
437-
or Nova server data. Limes may receive stale or incomplete usage data.
438-
439-
- alert: CortexNovaCommittedResourceUsageLatencyTooHigh
440-
expr: |
441-
histogram_quantile(0.95, sum(rate(cortex_committed_resource_usage_api_request_duration_seconds_bucket{service="cortex-nova-metrics"}[5m])) by (le)) > 10
442-
and on() rate(cortex_committed_resource_usage_api_requests_total{service="cortex-nova-metrics"}[5m]) > 0
443-
for: 5m
444-
labels:
445-
context: committed-resource-api
446-
dashboard: cortex-status-dashboard/cortex-status-dashboard
447-
service: cortex
448-
severity: warning
449-
support_group: workload-management
450-
annotations:
451-
summary: "Committed Resource usage API latency too high"
452-
description: >
453-
The committed resource usage API (Limes LIQUID integration) is experiencing
454-
high latency (p95 > 10s). This may indicate slow Nova API responses or
455-
database queries. Limes scrapes may time out, affecting quota reporting.
456-
457-
# Committed Resource Capacity API Alerts
458-
- alert: CortexNovaCommittedResourceCapacityHttpRequest400sTooHigh
459-
expr: rate(cortex_committed_resource_capacity_api_requests_total{service="cortex-nova-metrics", status_code=~"4.."}[5m]) > 0.1
460-
for: 5m
461-
labels:
462-
context: committed-resource-api
463-
dashboard: cortex-status-dashboard/cortex-status-dashboard
464-
service: cortex
465-
severity: warning
466-
support_group: workload-management
467-
annotations:
468-
summary: "Committed Resource capacity API HTTP 400 errors too high"
469-
description: >
470-
The committed resource capacity API (Limes LIQUID integration) is responding
471-
with HTTP 4xx errors. This may indicate malformed requests from Limes.
472-
473-
- alert: CortexNovaCommittedResourceCapacityHttpRequest500sTooHigh
474-
expr: rate(cortex_committed_resource_capacity_api_requests_total{service="cortex-nova-metrics", status_code=~"5.."}[5m]) > 0.1
475-
for: 5m
476-
labels:
477-
context: committed-resource-api
478-
dashboard: cortex-status-dashboard/cortex-status-dashboard
479-
service: cortex
480-
severity: warning
481-
support_group: workload-management
482-
annotations:
483-
summary: "Committed Resource capacity API HTTP 500 errors too high"
484-
description: >
485-
The committed resource capacity API (Limes LIQUID integration) is responding
486-
with HTTP 5xx errors. This indicates internal problems calculating cluster
487-
capacity. Limes may receive stale or incomplete capacity data.
488-
489-
- alert: CortexNovaCommittedResourceCapacityLatencyTooHigh
490-
expr: |
491-
histogram_quantile(0.95, sum(rate(cortex_committed_resource_capacity_api_request_duration_seconds_bucket{service="cortex-nova-metrics"}[5m])) by (le)) > 10
492-
and on() rate(cortex_committed_resource_capacity_api_requests_total{service="cortex-nova-metrics"}[5m]) > 0
493-
for: 5m
494-
labels:
495-
context: committed-resource-api
496-
dashboard: cortex-status-dashboard/cortex-status-dashboard
497-
service: cortex
498-
severity: warning
499-
support_group: workload-management
500-
annotations:
501-
summary: "Committed Resource capacity API latency too high"
502-
description: >
503-
The committed resource capacity API (Limes LIQUID integration) is experiencing
504-
high latency (p95 > 10s). This may indicate slow database queries or knowledge
505-
CRD retrieval. Limes scrapes may time out, affecting capacity reporting.
506-
507-
# Committed Resource Syncer Alerts
508-
# These alerts only fire when the syncer is enabled (metrics are only registered when enabled).
509-
# Absent metrics = syncer disabled = alerts inactive by design.
510-
- alert: CortexNovaCommittedResourceSyncerNotRunning
511-
expr: increase(cortex_committed_resource_syncer_duration_seconds_count{service="cortex-nova-metrics"}[3h]) < 1
512-
for: 15m
513-
labels:
514-
context: committed-resource-syncer
515-
dashboard: cortex-status-dashboard/cortex-status-dashboard
516-
service: cortex
517-
severity: warning
518-
support_group: workload-management
519-
annotations:
520-
summary: "Committed Resource syncer has not run in 3 hours"
521-
description: >
522-
No commitment sync has completed in the last 3 hours. The syncer runs hourly,
523-
so at least 2 runs should appear in this window. Check that the syncer task
524-
is healthy and Limes is reachable.
525-
526-
- alert: CortexNovaCommittedResourceSyncerErrors
527-
expr: increase(cortex_committed_resource_syncer_errors_total{service="cortex-nova-metrics"}[1h]) > 3
528-
for: 5m
529-
labels:
530-
context: committed-resource-syncer
531-
dashboard: cortex-status-dashboard/cortex-status-dashboard
532-
service: cortex
533-
severity: warning
534-
support_group: workload-management
535-
annotations:
536-
summary: "Committed Resource syncer is repeatedly failing"
537-
description: >
538-
The committed resource syncer has encountered more than 3 errors in the last
539-
hour. Check syncer logs for details; common causes are connectivity issues
540-
with Limes or failures writing CommittedResource CRDs.
541-
542290
- alert: CortexNovaDoesntFindValidKVMHosts
543291
expr: sum by (az, hvtype) (increase(cortex_vm_faults{hvtype=~"CH|QEMU",faultmsg=~".*No valid host was found.*"}[5m])) > 0
544292
for: 5m

0 commit comments

Comments
 (0)