Skip to content

Commit c8e7d99

Browse files
committed
updated the redundant exit code extraction in checkpodshealth and health_role in create_engine
1 parent a21efe9 commit c8e7d99

3 files changed

Lines changed: 90 additions & 13 deletions

File tree

areal/infra/scheduler/kubernetes.py

Lines changed: 8 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -585,26 +585,20 @@ def _check_pods_health(self, role: str) -> None:
585585
-1,
586586
f"Pod {name} {waiting_reason}\n{self._pod_diagnostics(role)}",
587587
)
588-
exit_code = int(
589-
_obj_get(
590-
terminated,
591-
"exit_code",
592-
_obj_get(terminated, "exitCode", 0),
593-
)
594-
)
595-
if terminated and exit_code != 0:
588+
if terminated:
596589
exit_code = int(
597590
_obj_get(
598591
terminated,
599592
"exit_code",
600593
_obj_get(terminated, "exitCode", -1),
601594
)
602595
)
603-
raise WorkerFailedError(
604-
f"{role}/*",
605-
exit_code,
606-
f"Pod {name} exited\n{self._pod_diagnostics(role)}",
607-
)
596+
if exit_code != 0:
597+
raise WorkerFailedError(
598+
f"{role}/*",
599+
exit_code,
600+
f"Pod {name} exited\n{self._pod_diagnostics(role)}",
601+
)
608602
if phase == "Failed":
609603
raise WorkerFailedError(
610604
f"{role}/*",
@@ -1116,6 +1110,7 @@ async def create_engine(
11161110
port = int(wi.worker.worker_ports[0])
11171111
url = f"http://{format_hostport(wi.worker.ip, port)}/create_engine"
11181112

1113+
self._check_pods_health(health_role)
11191114
try:
11201115
timeout = aiohttp.ClientTimeout(total=300.0)
11211116
async with aiohttp.ClientSession(

docs/en/tutorial/kubernetes.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,47 @@ schedulers:
4040
- Provide a service account with permission to create, patch, list, and delete
4141
`Services`, `StatefulSets`, `Pods`, pod logs, and pod events in the target namespace.
4242

43+
## RBAC Permissions
44+
45+
If you are running the AReaL controller with a service account other than a cluster admin, you must provide a `Role` (or `ClusterRole`) with sufficient permissions.
46+
47+
Below is a minimal `Role` and `RoleBinding` example for a namespace named `areal`:
48+
49+
```yaml
50+
apiVersion: rbac.authorization.k8s.io/v1
51+
kind: Role
52+
metadata:
53+
namespace: areal
54+
name: areal-scheduler
55+
rules:
56+
- apiGroups: [""]
57+
resources: ["services", "pods"]
58+
verbs: ["get", "list", "watch", "create", "patch", "delete"]
59+
- apiGroups: ["apps"]
60+
resources: ["statefulsets"]
61+
verbs: ["get", "list", "watch", "create", "patch", "delete"]
62+
- apiGroups: [""]
63+
resources: ["pods/log"]
64+
verbs: ["get"]
65+
- apiGroups: [""]
66+
resources: ["events"]
67+
verbs: ["list"]
68+
---
69+
apiVersion: rbac.authorization.k8s.io/v1
70+
kind: RoleBinding
71+
metadata:
72+
namespace: areal
73+
name: areal-scheduler-binding
74+
subjects:
75+
- kind: ServiceAccount
76+
name: default
77+
namespace: areal
78+
roleRef:
79+
kind: Role
80+
name: areal-scheduler
81+
apiGroup: rbac.authorization.k8s.io
82+
```
83+
4384
## Minimal Launch
4485
4586
Use the normal training entrypoint and override the scheduler type:

docs/zh/tutorial/kubernetes.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,47 @@ worker 的发现与 RPC 流程和 local / Slurm scheduler 保持一致:
3737
- service account 需要有创建、更新、列出和删除 `Services``StatefulSets`
3838
`Pods`,以及读取 pod logs/events 的权限。
3939

40+
## RBAC 权限
41+
42+
如果你在一个没有集群管理员权限的 service account 下运行 AReaL controller,你需要为其配置一个具有足够权限的 `Role`(或 `ClusterRole`)。
43+
44+
以下是在名为 `areal` 的 namespace 下配置 `Role``RoleBinding` 的示例:
45+
46+
```yaml
47+
apiVersion: rbac.authorization.k8s.io/v1
48+
kind: Role
49+
metadata:
50+
namespace: areal
51+
name: areal-scheduler
52+
rules:
53+
- apiGroups: [""]
54+
resources: ["services", "pods"]
55+
verbs: ["get", "list", "watch", "create", "patch", "delete"]
56+
- apiGroups: ["apps"]
57+
resources: ["statefulsets"]
58+
verbs: ["get", "list", "watch", "create", "patch", "delete"]
59+
- apiGroups: [""]
60+
resources: ["pods/log"]
61+
verbs: ["get"]
62+
- apiGroups: [""]
63+
resources: ["events"]
64+
verbs: ["list"]
65+
---
66+
apiVersion: rbac.authorization.k8s.io/v1
67+
kind: RoleBinding
68+
metadata:
69+
namespace: areal
70+
name: areal-scheduler-binding
71+
subjects:
72+
- kind: ServiceAccount
73+
name: default
74+
namespace: areal
75+
roleRef:
76+
kind: Role
77+
name: areal-scheduler
78+
apiGroup: rbac.authorization.k8s.io
79+
```
80+
4081
## 最小启动命令
4182
4283
使用正常的训练入口,并覆盖 scheduler 类型:

0 commit comments

Comments
 (0)