chore: add stale issue/pr check; fix data (#21)

willkill07 · web-flow · commit 71d49c9f07fa · 2026-04-01T15:13:01.000Z
Closes ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/NVIDIA/NeMo-Agent-Toolkit/blob/develop/docs/source/resources/contributing/index.md). - We require that all contributors "sign-off" on their commits. This certifies that the contribution is your original work, or you have rights to submit it under the same license, or a compatible license. - Any contribution which contains commits that are not Signed-Off will not be accepted. - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - Will Killian (https://github.com/willkill07) Approvers: - David Gardner (https://github.com/dagardner-nv) URL: #21
diff --git a/.github/workflows/stale.yaml b/.github/workflows/stale.yaml
@@ -0,0 +1,41 @@
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+name: 'Close stale issues and PRs'
+on:
+  schedule:
+    - cron: '0 12 * * *'
+
+jobs:
+  stale:
+    runs-on: ubuntu-latest
+    permissions:
+      actions: write
+      issues: write
+      pull-requests: write
+    steps:
+      - uses: actions/stale@v10
+        with:
+          stale-issue-message: 'This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 7 days.'
+          stale-pr-message: 'This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 14 days.'
+          close-issue-message: 'This issue was closed because it has been stalled for 7 days with no activity.'
+          close-pr-message: 'This PR was closed because it has been stalled for 14 days with no activity.'
+          days-before-issue-stale: 60
+          days-before-issue-close: 7
+          days-before-pr-stale: 30
+          days-before-pr-close: 14
+          exempt-issue-labels: 'Needs Triage'
+          exempt-pr-labels: 'Under Review'
+          operations-per-run: 100
diff --git a/examples/k8s_infra_monitor/src/nat_k8s_infra_monitor/data/eval_dataset.json b/examples/k8s_infra_monitor/src/nat_k8s_infra_monitor/data/eval_dataset.json
@@ -1,17 +1,3 @@
-[
-  {
-    "question": "{\"scenario_id\": \"node-not-ready\", \"query\": \"Worker node worker-2 appears to be down. Investigate the cluster health and identify the impact.\"}",
-    "answer": "warning: Worker node worker-2 is in NotReady state. Kubelet has stopped sending heartbeats (last seen 8m ago). Pods previously scheduled on worker-2 are being evicted or stuck in Pending. No cluster-wide outage — other nodes are healthy. Recommended: investigate kubelet logs on worker-2, check SSH connectivity, verify VM/hardware status.",
-    "label": "warning"
-  },
-  {
-    "question": "{\"scenario_id\": \"memory-pressure\", \"query\": \"Multiple pods are crashing in the ml-serving namespace. Check what is happening.\"}",
-    "answer": "critical: Worker-1 is under severe memory pressure (93% utilization, MemoryPressure=True). Multiple pods OOMKilled: inference-server, api-gateway, metrics-collector. The inference server is consuming excessive memory, triggering kernel OOM killer. Recommended: increase memory limits for the inference server deployment, consider moving ML workloads to a dedicated node with more RAM, or reduce model size.",
-    "label": "critical"
-  },
-  {
-    "question": "{\"scenario_id\": \"healthy-cluster\", \"query\": \"Run a routine health check on the Kubernetes cluster.\"}",
-    "answer": "informational: All 4 nodes are in Ready state. 68 pods running across 12 namespaces with no unhealthy pods detected. No warning events. Resource utilization is within normal thresholds across all nodes (highest: worker-1 at 53% memory). Cluster is operating normally.",
-    "label": "informational"
-  }
-]
+version https://git-lfs.github.com/spec/v1
+oid sha256:675db2b74104f65ce027b061b3997bbb930fd8ba7ca01be0b62c3ed559fdc09f
+size 1627
diff --git a/examples/k8s_infra_monitor/src/nat_k8s_infra_monitor/data/offline_scenarios.json b/examples/k8s_infra_monitor/src/nat_k8s_infra_monitor/data/offline_scenarios.json
@@ -1,26 +1,3 @@
-[
-  {
-    "scenario_id": "node-not-ready",
-    "description": "Worker node becomes NotReady due to kubelet crash",
-    "node_status_check": "## Node Status\n- control-plane-1: Ready, SchedulingDisabled\n- worker-1: Ready, 6 vCPU, 30Gi RAM\n- worker-2: NotReady, 6 vCPU, 20Gi RAM — last heartbeat 8m ago\n- worker-3: Ready, 6 vCPU, 20Gi RAM\n\n## Node Resource Usage\nNAME               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%\ncontrol-plane-1    380m         9%     3200Mi          40%\nworker-1           2100m        35%    18500Mi         62%\nworker-2           <unknown>    <unknown>  <unknown>   <unknown>\nworker-3           1800m        30%    9600Mi          48%",
-    "pod_health_check": "## Unhealthy Pods\nmonitoring   prometheus-node-exporter-abc12   0/1   NodeAffinity   0   worker-2\nproduction   cache-redis-0                    0/1   Terminating    0   worker-2\nproduction   api-server-5f8b9c7d4-x2k9p      0/1   Pending        0   <none>\n\n## High Restart Pods\n  monitoring/alertmanager-main-0: 12 restarts",
-    "event_collector": "## Warning Events (6 most recent)\nmonitoring   Warning   NodeNotReady          Node/worker-2              Node worker-2 status is now: NodeNotReady\nproduction   Warning   FailedScheduling      Pod/api-server-5f8b9c7d4-x2k9p   0/3 nodes are available: 1 node had untolerated taint node.kubernetes.io/not-ready\nmonitoring   Warning   Unhealthy             Pod/prometheus-node-exporter-abc12  Readiness probe failed: connection refused\nproduction   Warning   Evicted               Pod/cache-redis-0               The node was low on resource: ephemeral-storage\nkube-system  Warning   NodeNotReady          Node/worker-2              Controller detected that node worker-2 is not ready\nmonitoring   Warning   BackOff               Pod/alertmanager-main-0    Back-off restarting failed container\n\n## Recent Events (10 most recent)\nkube-system  Normal    NodeHasSufficientMemory  Node/worker-1  Node worker-1 status is now: NodeHasSufficientMemory\nkube-system  Normal    Starting                 Node/worker-3  Starting kubelet\nmonitoring   Normal    Pulled                   Pod/grafana-6d8f9c8b7-abc  Container image already present\nmonitoring   Normal    Created                  Pod/grafana-6d8f9c8b7-abc  Created container grafana",
-    "resource_pressure_check": "## Node Pressure Conditions\n- worker-2: MemoryPressure=Unknown DiskPressure=Unknown PIDPressure=Unknown Ready=False\n\n## Node Resource Utilization\nNAME               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%\ncontrol-plane-1    380m         9%     3200Mi          40%\nworker-1           2100m        35%    18500Mi         62%\nworker-3           1800m        30%    9600Mi          48%\nNote: worker-2 metrics unavailable (node NotReady)\n\n## Nodes Exceeding Thresholds\nNone (excluding offline worker-2)"
-  },
-  {
-    "scenario_id": "memory-pressure",
-    "description": "Multiple pods OOMKilled due to memory pressure on a worker node",
-    "node_status_check": "## Node Status\nAll 4 nodes are present.\n- control-plane-1: Ready, SchedulingDisabled\n- worker-1: Ready, 8 vCPU, 64Gi RAM, MemoryPressure=True\n- worker-2: Ready, 6 vCPU, 20Gi RAM\n- worker-3: Ready, 6 vCPU, 20Gi RAM\n\n## Node Resource Usage\nNAME               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%\ncontrol-plane-1    420m         10%    3100Mi          39%\nworker-1           5800m        72%    59800Mi         93%\nworker-2           1200m        20%    12400Mi         62%\nworker-3           900m         15%    8200Mi          41%",
-    "pod_health_check": "## Unhealthy Pods\nml-serving   inference-server-7f9b8c6d5-kl2m3      0/1   OOMKilled     3   worker-1\nproduction   api-gateway-5d7b8a9c2-pq4r5            0/1   CrashLoopBackOff  7   worker-1\nmonitoring   metrics-collector-6c8d9e7f1-st6u7      0/1   OOMKilled     2   worker-1\n\n## High Restart Pods\n  ml-serving/inference-server-7f9b8c6d5-kl2m3: 8 restarts\n  production/api-gateway-5d7b8a9c2-pq4r5: 12 restarts\n  monitoring/metrics-collector-6c8d9e7f1-st6u7: 6 restarts",
-    "event_collector": "## Warning Events (8 most recent)\nml-serving   Warning   OOMKilling            Pod/inference-server-7f9b8c6d5-kl2m3     Memory cgroup out of memory: Killed process 4521 (inference-server)\nproduction   Warning   OOMKilling            Pod/api-gateway-5d7b8a9c2-pq4r5          Memory cgroup out of memory: Killed process 3892 (python)\nmonitoring   Warning   OOMKilling            Pod/metrics-collector-6c8d9e7f1-st6u7    Memory cgroup out of memory: Killed process 5123\nml-serving   Warning   BackOff               Pod/inference-server-7f9b8c6d5-kl2m3     Back-off restarting failed container\nproduction   Warning   BackOff               Pod/api-gateway-5d7b8a9c2-pq4r5          Back-off restarting failed container\nkube-system  Warning   SystemOOM             Node/worker-1                  System OOM encountered, victim process: inference-server\nkube-system  Warning   EvictionThresholdMet  Node/worker-1                  Attempting to reclaim memory\nml-serving   Warning   Unhealthy             Pod/inference-server-7f9b8c6d5-kl2m3     Liveness probe failed: connection refused",
-    "resource_pressure_check": "## Nodes Under Pressure\n- worker-1: MemoryPressure=True DiskPressure=False PIDPressure=False Ready=True\n\n## Node Resource Utilization\nNAME               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%\ncontrol-plane-1    420m         10%    3100Mi          39%\nworker-1           5800m        72%    59800Mi         93%\nworker-2           1200m        20%    12400Mi         62%\nworker-3           900m         15%    8200Mi          41%\n\n## Nodes Exceeding Thresholds\n  - worker-1: Memory at 93% (threshold: 85%)"
-  },
-  {
-    "scenario_id": "healthy-cluster",
-    "description": "Normal cluster operations with no issues detected",
-    "node_status_check": "## Node Status\nAll 4 nodes are in Ready state.\n- control-plane-1: Ready, SchedulingDisabled (control-plane taint)\n- worker-1: Ready, 6 vCPU, 30Gi RAM\n- worker-2: Ready, 6 vCPU, 20Gi RAM\n- worker-3: Ready, 6 vCPU, 20Gi RAM\n\n## Node Resource Usage\nNAME               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%\ncontrol-plane-1    350m         8%     3000Mi          37%\nworker-1           1500m        25%    16000Mi         53%\nworker-2           900m         15%    9200Mi          46%\nworker-3           1100m        18%    8800Mi          44%",
-    "pod_health_check": "## Pod Health Summary\nAll pods are in Running or Succeeded state across all namespaces.\n68 pods running across 12 namespaces.\n\n## High Restart Pods\nNo pods with excessive restart counts detected.",
-    "event_collector": "## Warning Events\nNo warning events found.\n\n## Recent Events (5 most recent)\nml-serving   Normal   Pulled    Pod/inference-server-7f9b8c6d5-abc    Container image already present on machine\nmonitoring   Normal   Started   Pod/grafana-6d8f9c8b7-def   Started container grafana\ndefault      Normal   Scheduled Pod/test-job-xyz-123        Successfully assigned to worker-2\nkube-system  Normal   Starting  Node/worker-1               Starting kubelet\nkube-system  Normal   NodeReady Node/worker-3               Node worker-3 status is now: NodeReady",
-    "resource_pressure_check": "## Node Pressure Conditions\nNo nodes reporting pressure conditions (MemoryPressure, DiskPressure, PIDPressure all False).\n\n## Node Resource Utilization\nNAME               CPU(cores)   CPU%   MEMORY(bytes)   MEMORY%\ncontrol-plane-1    350m         8%     3000Mi          37%\nworker-1           1500m        25%    16000Mi         53%\nworker-2           900m         15%    9200Mi          46%\nworker-3           1100m        18%    8800Mi          44%\n\n## Nodes Exceeding Thresholds\nNone."
-  }
-]
+version https://git-lfs.github.com/spec/v1
+oid sha256:29429c8bec6bf001536a2b9e5549378064b12f9482cc036ae40aaacea2c80411
+size 7780