Skip to content

Commit 71d49c9

Browse files
authored
chore: add stale issue/pr check; fix data (#21)
Closes ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing/index.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - Will Killian (https://github.com/willkill07) Approvers: - David Gardner (https://github.com/dagardner-nv) URL: #21
1 parent cde89d2 commit 71d49c9

3 files changed

Lines changed: 47 additions & 43 deletions

File tree

.github/workflows/stale.yaml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
2+
# SPDX-License-Identifier: Apache-2.0
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
16+
name: 'Close stale issues and PRs'
17+
on:
18+
schedule:
19+
- cron: '0 12 * * *'
20+
21+
jobs:
22+
stale:
23+
runs-on: ubuntu-latest
24+
permissions:
25+
actions: write
26+
issues: write
27+
pull-requests: write
28+
steps:
29+
- uses: actions/stale@v10
30+
with:
31+
stale-issue-message: 'This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.'
32+
stale-pr-message: 'This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days.'
33+
close-issue-message: 'This issue was closed because it has been stalled for 7 days with no activity.'
34+
close-pr-message: 'This PR was closed because it has been stalled for 14 days with no activity.'
35+
days-before-issue-stale: 60
36+
days-before-issue-close: 7
37+
days-before-pr-stale: 30
38+
days-before-pr-close: 14
39+
exempt-issue-labels: 'Needs Triage'
40+
exempt-pr-labels: 'Under Review'
41+
operations-per-run: 100
Lines changed: 3 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,3 @@
1-
[
2-
{
3-
"question": "{\"scenario_id\": \"node-not-ready\", \"query\": \"Worker node worker-2 appears to be down. Investigate the cluster health and identify the impact.\"}",
4-
"answer": "warning: Worker node worker-2 is in NotReady state. Kubelet has stopped sending heartbeats (last seen 8m ago). Pods previously scheduled on worker-2 are being evicted or stuck in Pending. No cluster-wide outage — other nodes are healthy. Recommended: investigate kubelet logs on worker-2, check SSH connectivity, verify VM/hardware status.",
5-
"label": "warning"
6-
},
7-
{
8-
"question": "{\"scenario_id\": \"memory-pressure\", \"query\": \"Multiple pods are crashing in the ml-serving namespace. Check what is happening.\"}",
9-
"answer": "critical: Worker-1 is under severe memory pressure (93% utilization, MemoryPressure=True). Multiple pods OOMKilled: inference-server, api-gateway, metrics-collector. The inference server is consuming excessive memory, triggering kernel OOM killer. Recommended: increase memory limits for the inference server deployment, consider moving ML workloads to a dedicated node with more RAM, or reduce model size.",
10-
"label": "critical"
11-
},
12-
{
13-
"question": "{\"scenario_id\": \"healthy-cluster\", \"query\": \"Run a routine health check on the Kubernetes cluster.\"}",
14-
"answer": "informational: All 4 nodes are in Ready state. 68 pods running across 12 namespaces with no unhealthy pods detected. No warning events. Resource utilization is within normal thresholds across all nodes (highest: worker-1 at 53% memory). Cluster is operating normally.",
15-
"label": "informational"
16-
}
17-
]
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:675db2b74104f65ce027b061b3997bbb930fd8ba7ca01be0b62c3ed559fdc09f
3+
size 1627
Lines changed: 3 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -1,26 +1,3 @@
1-
[
2-
{
3-
"scenario_id": "node-not-ready",
4-
"description": "Worker node becomes NotReady due to kubelet crash",
5-
"node_status_check": "## Node Status\n- control-plane-1: Ready, SchedulingDisabled\n- worker-1: Ready, 6 vCPU, 30Gi RAM\n- worker-2: NotReady, 6 vCPU, 20Gi RAM — last heartbeat 8m ago\n- worker-3: Ready, 6 vCPU, 20Gi RAM\n\n## Node Resource Usage\nNAME CPU(cores) CPU% MEMORY(bytes) MEMORY%\ncontrol-plane-1 380m 9% 3200Mi 40%\nworker-1 2100m 35% 18500Mi 62%\nworker-2 <unknown> <unknown> <unknown> <unknown>\nworker-3 1800m 30% 9600Mi 48%",
6-
"pod_health_check": "## Unhealthy Pods\nmonitoring prometheus-node-exporter-abc12 0/1 NodeAffinity 0 worker-2\nproduction cache-redis-0 0/1 Terminating 0 worker-2\nproduction api-server-5f8b9c7d4-x2k9p 0/1 Pending 0 <none>\n\n## High Restart Pods\n monitoring/alertmanager-main-0: 12 restarts",
7-
"event_collector": "## Warning Events (6 most recent)\nmonitoring Warning NodeNotReady Node/worker-2 Node worker-2 status is now: NodeNotReady\nproduction Warning FailedScheduling Pod/api-server-5f8b9c7d4-x2k9p 0/3 nodes are available: 1 node had untolerated taint node.kubernetes.io/not-ready\nmonitoring Warning Unhealthy Pod/prometheus-node-exporter-abc12 Readiness probe failed: connection refused\nproduction Warning Evicted Pod/cache-redis-0 The node was low on resource: ephemeral-storage\nkube-system Warning NodeNotReady Node/worker-2 Controller detected that node worker-2 is not ready\nmonitoring Warning BackOff Pod/alertmanager-main-0 Back-off restarting failed container\n\n## Recent Events (10 most recent)\nkube-system Normal NodeHasSufficientMemory Node/worker-1 Node worker-1 status is now: NodeHasSufficientMemory\nkube-system Normal Starting Node/worker-3 Starting kubelet\nmonitoring Normal Pulled Pod/grafana-6d8f9c8b7-abc Container image already present\nmonitoring Normal Created Pod/grafana-6d8f9c8b7-abc Created container grafana",
8-
"resource_pressure_check": "## Node Pressure Conditions\n- worker-2: MemoryPressure=Unknown DiskPressure=Unknown PIDPressure=Unknown Ready=False\n\n## Node Resource Utilization\nNAME CPU(cores) CPU% MEMORY(bytes) MEMORY%\ncontrol-plane-1 380m 9% 3200Mi 40%\nworker-1 2100m 35% 18500Mi 62%\nworker-3 1800m 30% 9600Mi 48%\nNote: worker-2 metrics unavailable (node NotReady)\n\n## Nodes Exceeding Thresholds\nNone (excluding offline worker-2)"
9-
},
10-
{
11-
"scenario_id": "memory-pressure",
12-
"description": "Multiple pods OOMKilled due to memory pressure on a worker node",
13-
"node_status_check": "## Node Status\nAll 4 nodes are present.\n- control-plane-1: Ready, SchedulingDisabled\n- worker-1: Ready, 8 vCPU, 64Gi RAM, MemoryPressure=True\n- worker-2: Ready, 6 vCPU, 20Gi RAM\n- worker-3: Ready, 6 vCPU, 20Gi RAM\n\n## Node Resource Usage\nNAME CPU(cores) CPU% MEMORY(bytes) MEMORY%\ncontrol-plane-1 420m 10% 3100Mi 39%\nworker-1 5800m 72% 59800Mi 93%\nworker-2 1200m 20% 12400Mi 62%\nworker-3 900m 15% 8200Mi 41%",
14-
"pod_health_check": "## Unhealthy Pods\nml-serving inference-server-7f9b8c6d5-kl2m3 0/1 OOMKilled 3 worker-1\nproduction api-gateway-5d7b8a9c2-pq4r5 0/1 CrashLoopBackOff 7 worker-1\nmonitoring metrics-collector-6c8d9e7f1-st6u7 0/1 OOMKilled 2 worker-1\n\n## High Restart Pods\n ml-serving/inference-server-7f9b8c6d5-kl2m3: 8 restarts\n production/api-gateway-5d7b8a9c2-pq4r5: 12 restarts\n monitoring/metrics-collector-6c8d9e7f1-st6u7: 6 restarts",
15-
"event_collector": "## Warning Events (8 most recent)\nml-serving Warning OOMKilling Pod/inference-server-7f9b8c6d5-kl2m3 Memory cgroup out of memory: Killed process 4521 (inference-server)\nproduction Warning OOMKilling Pod/api-gateway-5d7b8a9c2-pq4r5 Memory cgroup out of memory: Killed process 3892 (python)\nmonitoring Warning OOMKilling Pod/metrics-collector-6c8d9e7f1-st6u7 Memory cgroup out of memory: Killed process 5123\nml-serving Warning BackOff Pod/inference-server-7f9b8c6d5-kl2m3 Back-off restarting failed container\nproduction Warning BackOff Pod/api-gateway-5d7b8a9c2-pq4r5 Back-off restarting failed container\nkube-system Warning SystemOOM Node/worker-1 System OOM encountered, victim process: inference-server\nkube-system Warning EvictionThresholdMet Node/worker-1 Attempting to reclaim memory\nml-serving Warning Unhealthy Pod/inference-server-7f9b8c6d5-kl2m3 Liveness probe failed: connection refused",
16-
"resource_pressure_check": "## Nodes Under Pressure\n- worker-1: MemoryPressure=True DiskPressure=False PIDPressure=False Ready=True\n\n## Node Resource Utilization\nNAME CPU(cores) CPU% MEMORY(bytes) MEMORY%\ncontrol-plane-1 420m 10% 3100Mi 39%\nworker-1 5800m 72% 59800Mi 93%\nworker-2 1200m 20% 12400Mi 62%\nworker-3 900m 15% 8200Mi 41%\n\n## Nodes Exceeding Thresholds\n - worker-1: Memory at 93% (threshold: 85%)"
17-
},
18-
{
19-
"scenario_id": "healthy-cluster",
20-
"description": "Normal cluster operations with no issues detected",
21-
"node_status_check": "## Node Status\nAll 4 nodes are in Ready state.\n- control-plane-1: Ready, SchedulingDisabled (control-plane taint)\n- worker-1: Ready, 6 vCPU, 30Gi RAM\n- worker-2: Ready, 6 vCPU, 20Gi RAM\n- worker-3: Ready, 6 vCPU, 20Gi RAM\n\n## Node Resource Usage\nNAME CPU(cores) CPU% MEMORY(bytes) MEMORY%\ncontrol-plane-1 350m 8% 3000Mi 37%\nworker-1 1500m 25% 16000Mi 53%\nworker-2 900m 15% 9200Mi 46%\nworker-3 1100m 18% 8800Mi 44%",
22-
"pod_health_check": "## Pod Health Summary\nAll pods are in Running or Succeeded state across all namespaces.\n68 pods running across 12 namespaces.\n\n## High Restart Pods\nNo pods with excessive restart counts detected.",
23-
"event_collector": "## Warning Events\nNo warning events found.\n\n## Recent Events (5 most recent)\nml-serving Normal Pulled Pod/inference-server-7f9b8c6d5-abc Container image already present on machine\nmonitoring Normal Started Pod/grafana-6d8f9c8b7-def Started container grafana\ndefault Normal Scheduled Pod/test-job-xyz-123 Successfully assigned to worker-2\nkube-system Normal Starting Node/worker-1 Starting kubelet\nkube-system Normal NodeReady Node/worker-3 Node worker-3 status is now: NodeReady",
24-
"resource_pressure_check": "## Node Pressure Conditions\nNo nodes reporting pressure conditions (MemoryPressure, DiskPressure, PIDPressure all False).\n\n## Node Resource Utilization\nNAME CPU(cores) CPU% MEMORY(bytes) MEMORY%\ncontrol-plane-1 350m 8% 3000Mi 37%\nworker-1 1500m 25% 16000Mi 53%\nworker-2 900m 15% 9200Mi 46%\nworker-3 1100m 18% 8800Mi 44%\n\n## Nodes Exceeding Thresholds\nNone."
25-
}
26-
]
1+
version https://git-lfs.github.com/spec/v1
2+
oid sha256:29429c8bec6bf001536a2b9e5549378064b12f9482cc036ae40aaacea2c80411
3+
size 7780

0 commit comments

Comments
 (0)