Skip to content

Commit a334495

Browse files
voltjiaZiminlizhangyue207bitzyzgongchensu
authored
feat: build core operator framework with multi-device backends, Python bindings, testing, and CI (#39)
Please check the [PR](#39) for more details. --------- Co-authored-by: Ziminli <coollizimin@gmail.com> Co-authored-by: Ziminli <70735843+Ziminli@users.noreply.github.com> Co-authored-by: zhangyue <138768300+zhangyue207@users.noreply.github.com> Co-authored-by: zhangyunze <93699316+bitzyz@users.noreply.github.com> Co-authored-by: gongchensu <zhuyue_134@qq.com> Co-authored-by: zhuyue <zhuyue@qiyuanlab.com> Co-authored-by: zhangyue <zhangyue@qiyuanlab.com>
1 parent d4f0f72 commit a334495

File tree

150 files changed

+12991
-0
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

150 files changed

+12991
-0
lines changed

.ci/README.md

Lines changed: 386 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,386 @@
1+
# .ci — CI Images and Pipeline
2+
3+
```
4+
.ci/
5+
├── config.yaml # Unified config (images, jobs, agent definitions)
6+
├── utils.py # Shared utilities (load_config, normalize_config, get_git_commit)
7+
├── agent.py # Runner Agent (scheduler, webhooks, remote dispatch)
8+
├── build.py # Image builder
9+
├── run.py # CI pipeline runner (Docker layer)
10+
├── ci_resource.py # GPU/memory detection and allocation
11+
├── github_status.py # GitHub Commit Status reporting
12+
├── images/
13+
│ ├── nvidia/Dockerfile
14+
│ ├── iluvatar/Dockerfile
15+
│ ├── metax/Dockerfile
16+
│ ├── moore/Dockerfile
17+
│ ├── cambricon/Dockerfile
18+
│ └── ascend/Dockerfile
19+
└── tests/ # Unit tests
20+
├── conftest.py
21+
├── test_agent.py
22+
├── test_build.py
23+
├── test_run.py
24+
├── test_resource.py
25+
├── test_github_status.py
26+
└── test_utils.py
27+
```
28+
29+
**Prerequisites**: Docker, Python 3.10+, `pip install pyyaml`
30+
31+
---
32+
33+
## Configuration `config.yaml`
34+
35+
Config uses a **platform-centric** top-level structure. Each platform defines its image, platform-level defaults, and job list.
36+
At load time, jobs are flattened to `{platform}_{job}` format (e.g., `nvidia_gpu`).
37+
38+
```yaml
39+
repo:
40+
url: https://github.com/InfiniTensor/InfiniOps.git
41+
branch: master
42+
43+
github:
44+
status_context_prefix: "ci/infiniops"
45+
46+
agents: # Remote agent URLs (used by CLI for cross-machine dispatch)
47+
nvidia:
48+
url: http://nvidia-host:8080
49+
iluvatar:
50+
url: http://iluvatar-host:8080
51+
52+
platforms:
53+
nvidia:
54+
image: # Image definition
55+
dockerfile: .ci/images/nvidia/
56+
build_args:
57+
BASE_IMAGE: nvcr.io/nvidia/pytorch:24.10-py3
58+
setup: pip install .[dev] --no-build-isolation
59+
jobs:
60+
gpu: # Flattened as nvidia_gpu
61+
resources:
62+
ngpus: 1 # Scheduler auto-picks this many free GPUs
63+
memory: 32GB
64+
shm_size: 16g
65+
timeout: 3600
66+
stages:
67+
- name: test
68+
run: pytest tests/ -n 8 -v --tb=short --junitxml=/workspace/results/test-results.xml
69+
70+
iluvatar:
71+
image:
72+
dockerfile: .ci/images/iluvatar/
73+
build_args:
74+
BASE_IMAGE: corex:qs_pj20250825
75+
APT_MIRROR: http://archive.ubuntu.com/ubuntu
76+
PIP_INDEX_URL: https://pypi.org/simple
77+
docker_args: # Platform-level docker args, inherited by all jobs
78+
- "--privileged"
79+
- "--cap-add=ALL"
80+
- "--pid=host"
81+
- "--ipc=host"
82+
volumes:
83+
- /dev:/dev
84+
- /lib/firmware:/lib/firmware
85+
- /usr/src:/usr/src
86+
- /lib/modules:/lib/modules
87+
setup: pip install .[dev] --no-build-isolation
88+
jobs:
89+
gpu: # Flattened as iluvatar_gpu
90+
resources:
91+
gpu_ids: "0"
92+
gpu_style: none # CoreX: passthrough via --privileged + /dev mount
93+
memory: 32GB
94+
shm_size: 16g
95+
timeout: 3600
96+
stages:
97+
- name: test
98+
run: pytest tests/ -n 8 -v --tb=short --junitxml=/workspace/results/test-results.xml
99+
```
100+
101+
### Config hierarchy
102+
103+
| Level | Field | Description |
104+
|---|---|---|
105+
| **Platform** | `image` | Image definition (dockerfile, build_args) |
106+
| | `image_tag` | Default image tag (defaults to `latest`) |
107+
| | `docker_args` | Extra `docker run` args (e.g., `--privileged`) |
108+
| | `volumes` | Extra volume mounts |
109+
| | `setup` | In-container setup command |
110+
| | `env` | Injected container env vars |
111+
| **Job** | `resources.ngpus` | Number of GPUs — scheduler auto-picks free ones (NVIDIA only) |
112+
| | `resources.gpu_ids` | Static GPU device IDs (e.g., `"0"`, `"0,2"`) |
113+
| | `resources.gpu_style` | GPU passthrough: `nvidia` (default), `none`, or `mlu` |
114+
| | `resources.memory` | Container memory limit |
115+
| | `resources.shm_size` | Shared memory size |
116+
| | `resources.timeout` | Max run time in seconds |
117+
| | `stages` | Execution stage list |
118+
| | Any platform field | Jobs can override any platform-level default |
119+
120+
---
121+
122+
## Image builder `build.py`
123+
124+
| Flag | Description |
125+
|---|---|
126+
| `--platform nvidia\|iluvatar\|metax\|moore\|ascend\|all` | Target platform (default: `all`) |
127+
| `--commit` | Use specific commit ref as image tag (default: HEAD) |
128+
| `--force` | Skip Dockerfile change detection |
129+
| `--dry-run` | Print commands without executing |
130+
131+
```bash
132+
# Build with change detection (skips if no Dockerfile changes)
133+
python .ci/build.py --platform nvidia
134+
135+
# Build Iluvatar image
136+
python .ci/build.py --platform iluvatar --force
137+
138+
# Force build all platforms
139+
python .ci/build.py --force
140+
```
141+
142+
Build artifacts are stored as local Docker image tags: `infiniops-ci/<platform>:<commit-hash>` and `:latest`.
143+
Proxy and `no_proxy` env vars are forwarded from the host to `docker build` automatically.
144+
145+
> `--push` is reserved for future use; requires a `registry` section in `config.yaml`.
146+
147+
---
148+
149+
## Pipeline runner `run.py`
150+
151+
Platform is auto-detected (via `nvidia-smi`/`ixsmi`/`mx-smi`/`mthreads-gmi`/`cnmon` on PATH), no manual specification needed.
152+
153+
| Flag | Description |
154+
|---|---|
155+
| `--config` | Config file path (default: `.ci/config.yaml`) |
156+
| `--job` | Job name: short (`gpu`) or full (`nvidia_gpu`). Defaults to all jobs for the current platform |
157+
| `--branch` | Override clone branch (default: config `repo.branch`) |
158+
| `--stage` | Run only the specified stage |
159+
| `--image-tag` | Override image tag |
160+
| `--gpu-id` | Override GPU device IDs (nvidia via `--gpus`, others via `CUDA_VISIBLE_DEVICES`) |
161+
| `--test` | Override pytest test path (e.g., `tests/test_gemm.py::test_gemm`) |
162+
| `--results-dir` | Host directory mounted to `/workspace/results` inside the container |
163+
| `--local` | Mount current directory (read-only) instead of cloning from git |
164+
| `--dry-run` | Print docker command without executing |
165+
166+
```bash
167+
# Simplest usage: auto-detect platform, run all jobs, use config default branch
168+
python .ci/run.py
169+
170+
# Specify short job name
171+
python .ci/run.py --job gpu
172+
173+
# Full job name (backward compatible)
174+
python .ci/run.py --job nvidia_gpu
175+
176+
# Run only the test stage, preview mode
177+
python .ci/run.py --job gpu --stage test --dry-run
178+
179+
# Test local uncommitted changes without pushing
180+
python .ci/run.py --local
181+
```
182+
183+
Container execution flow: `git clone` → `checkout` → `setup` → stages.
184+
With `--local`, the current directory is mounted read-only at `/workspace/repo` and copied to a writable temp directory inside the container before setup runs — host files are never modified.
185+
Proxy vars are forwarded from the host. Test results are written to `--results-dir`. Each run uses a clean environment (no host pip cache mounted).
186+
187+
---
188+
189+
## Platform differences
190+
191+
| Platform | GPU passthrough | `gpu_style` | Base image | Detection tool |
192+
|---|---|---|---|---|
193+
| NVIDIA | `--gpus` (NVIDIA Container Toolkit) | `nvidia` (default) | `nvcr.io/nvidia/pytorch:24.10-py3` | `nvidia-smi` |
194+
| Iluvatar | `--privileged` + `/dev` mount | `none` | `corex:qs_pj20250825` | `ixsmi` |
195+
| MetaX | `--privileged` | `none` | `maca-pytorch:3.2.1.4-...` | `mx-smi` |
196+
| Moore | `--privileged` | `none` | `vllm_musa:20251112_hygon` | `mthreads-gmi` |
197+
| Cambricon | `--privileged` | `mlu` | `cambricon/pytorch:v1.25.3` | `cnmon` |
198+
| Ascend | TODO | — | `ascend-pytorch:24.0.0` | — |
199+
200+
`gpu_style` controls the Docker device injection mechanism: `nvidia` uses `--gpus`, `none` uses `CUDA_VISIBLE_DEVICES` (or skips injection for Moore), `mlu` uses `MLU_VISIBLE_DEVICES`.
201+
202+
---
203+
204+
## Runner Agent `agent.py`
205+
206+
The Runner Agent supports CLI manual dispatch, GitHub webhook triggers, resource-aware dynamic scheduling, and cross-machine remote dispatch.
207+
208+
### CLI manual execution
209+
210+
```bash
211+
# Run all jobs (dispatched to remote agents, using config default branch)
212+
python .ci/agent.py run
213+
214+
# Specify branch
215+
python .ci/agent.py run --branch feat/xxx
216+
217+
# Run a specific job
218+
python .ci/agent.py run --job nvidia_gpu
219+
220+
# Filter by platform
221+
python .ci/agent.py run --platform nvidia
222+
223+
# Preview mode
224+
python .ci/agent.py run --dry-run
225+
```
226+
227+
| Flag | Description |
228+
|---|---|
229+
| `--branch` | Test branch (default: config `repo.branch`) |
230+
| `--job` | Specific job name |
231+
| `--platform` | Filter jobs by platform |
232+
| `--commit` | Override commit SHA used for GitHub status reporting |
233+
| `--image-tag` | Override image tag |
234+
| `--dry-run` | Preview mode |
235+
236+
### Webhook server
237+
238+
Deploy one Agent instance per platform machine (platform is auto-detected). On each machine:
239+
240+
```bash
241+
python .ci/agent.py serve --port 8080
242+
```
243+
244+
Additional `serve` flags:
245+
246+
| Flag | Description |
247+
|---|---|
248+
| `--port` | Listen port (default: 8080) |
249+
| `--host` | Listen address (default: `0.0.0.0`) |
250+
| `--webhook-secret` | GitHub webhook signing secret (or `WEBHOOK_SECRET` env var) |
251+
| `--api-token` | `/api/run` Bearer auth token (or `AGENT_API_TOKEN` env var) |
252+
| `--results-dir` | Results directory (default: `ci-results`) |
253+
| `--utilization-threshold` | GPU idle threshold percentage (default: 10) |
254+
255+
| Endpoint | Method | Description |
256+
|---|---|---|
257+
| `/webhook` | POST | GitHub webhook (push/pull_request) |
258+
| `/api/run` | POST | Remote job trigger |
259+
| `/api/job/{id}` | GET | Query job status |
260+
| `/health` | GET | Health check |
261+
| `/status` | GET | Queue + resource status |
262+
263+
Webhook supports `X-Hub-Signature-256` signature verification via `--webhook-secret` or `WEBHOOK_SECRET` env var.
264+
265+
### Remote agent configuration
266+
267+
Configure agent URLs in `config.yaml`; the CLI automatically dispatches remote jobs to the corresponding agents:
268+
269+
```yaml
270+
agents:
271+
nvidia:
272+
url: http://<nvidia-ip>:8080
273+
iluvatar:
274+
url: http://<iluvatar-ip>:8080
275+
metax:
276+
url: http://<metax-ip>:8080
277+
moore:
278+
url: http://<moore-ip>:8080
279+
```
280+
281+
### Resource scheduling
282+
283+
The Agent auto-detects GPU utilization and system memory to dynamically determine parallelism:
284+
- GPU utilization < threshold (default 10%) and not allocated by Agent → available
285+
- When resources are insufficient, jobs are queued automatically; completed jobs release resources and trigger scheduling of queued tasks
286+
287+
### GitHub Status
288+
289+
Set the `GITHUB_TOKEN` env var and the Agent will automatically report commit status:
290+
- `pending` — job started
291+
- `success` / `failure` — job completed
292+
293+
Status context format: `ci/infiniops/{job_name}`
294+
295+
---
296+
297+
## Multi-machine deployment guide
298+
299+
### Per-platform setup
300+
301+
Each machine needs Docker installed, the platform runtime, and the base CI image built.
302+
303+
| Platform | Runtime check | Base image | Build command |
304+
|---|---|---|---|
305+
| NVIDIA | `nvidia-smi` (+ [Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html)) | `nvcr.io/nvidia/pytorch:24.10-py3` (public) | `python .ci/build.py --platform nvidia` |
306+
| Iluvatar | `ixsmi` | `corex:qs_pj20250825` (import in advance) | `python .ci/build.py --platform iluvatar` |
307+
| MetaX | `mx-smi` | `maca-pytorch:3.2.1.4-...` (import in advance) | `python .ci/build.py --platform metax` |
308+
| Moore | `mthreads-gmi` | `vllm_musa:20251112_hygon` (import in advance) | `python .ci/build.py --platform moore` |
309+
310+
### Start Agent services
311+
312+
On each machine (platform is auto-detected):
313+
314+
```bash
315+
python .ci/agent.py serve --port 8080
316+
```
317+
318+
### Configure remote agent URLs
319+
320+
On the trigger machine, add the `agents` section to `config.yaml` (see [Remote agent configuration](#remote-agent-configuration) above for the format).
321+
322+
### Trigger cross-platform tests
323+
324+
```bash
325+
# Run all platform jobs at once (using config default branch)
326+
python .ci/agent.py run
327+
328+
# Preview mode (no actual execution)
329+
python .ci/agent.py run --dry-run
330+
331+
# Run only a specific platform
332+
python .ci/agent.py run --platform nvidia
333+
```
334+
335+
### Optional configuration
336+
337+
#### GitHub Status reporting
338+
339+
Set the env var on all machines so each reports its own platform's test status:
340+
341+
```bash
342+
export GITHUB_TOKEN=ghp_xxxxxxxxxxxx
343+
```
344+
345+
#### API Token authentication
346+
347+
When agents are exposed on untrusted networks, enable token auth:
348+
349+
```bash
350+
python .ci/agent.py serve --port 8080 --api-token <secret>
351+
# Or: export AGENT_API_TOKEN=<secret>
352+
```
353+
354+
#### GitHub Webhook auto-trigger
355+
356+
In GitHub repo → Settings → Webhooks, add a webhook for each machine:
357+
358+
| Field | Value |
359+
|---|---|
360+
| Payload URL | `http://<machine-ip>:8080/webhook` |
361+
| Content type | `application/json` |
362+
| Secret | Must match `--webhook-secret` |
363+
| Events | `push` and `pull_request` |
364+
365+
```bash
366+
python .ci/agent.py serve --port 8080 --webhook-secret <github-secret>
367+
# Or: export WEBHOOK_SECRET=<github-secret>
368+
```
369+
370+
### Verification checklist
371+
372+
```bash
373+
# 1. Dry-run each machine individually
374+
for platform in nvidia iluvatar metax moore; do
375+
python .ci/agent.py run --platform $platform --dry-run
376+
done
377+
378+
# 2. Health and resource checks
379+
for ip in <nvidia-ip> <iluvatar-ip> <metax-ip> <moore-ip>; do
380+
curl http://$ip:8080/health
381+
curl http://$ip:8080/status
382+
done
383+
384+
# 3. Cross-platform test
385+
python .ci/agent.py run --branch master
386+
```

0 commit comments

Comments
 (0)