Skip to content

Commit 7f08499

Browse files
benjamin-747benjamin.747
authored andcommitted
feat(orion-scheduler): migrate into mega workspace with configurable paths
- Add orion-scheduler as a Cargo workspace member - Move hardcoded paths (orion_source_dir, orion_binary_path, ssh_public_key_path) to target_config.json - Add target_config.json.template as a configuration reference - Remove custom_images/default_image config in favor of API-driven image parameters - Add build-custom-image.sh with configurable OUTPUT_DIR via env var - Update documentation (README, DESIGN, TESTING, ARTIFACT)
1 parent 73ed780 commit 7f08499

18 files changed

Lines changed: 3934 additions & 7 deletions

.gitignore

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,4 +66,7 @@ orion-server/docker-compose.override.yml
6666
tools/**
6767

6868
.libra
69-
.libraignore
69+
.libraignore
70+
71+
# orion-scheduler local config (use target_config.json.template as base)
72+
orion-scheduler/target_config.json

Cargo.lock

Lines changed: 21 additions & 4 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -12,11 +12,12 @@ members = [
1212
"jupiter/callisto",
1313
"mono",
1414
"orion",
15+
"orion-scheduler",
1516
"orion-server",
1617
"saturn",
1718
"vault",
1819
]
19-
default-members = ["mono", "orion", "orion-server"]
20+
default-members = ["mono", "orion", "orion-server", "orion-scheduler"]
2021
exclude = ["tools/artifacts-compose-e2e"]
2122
resolver = "3"
2223

@@ -72,7 +73,7 @@ lettre = { version = "0.11", default-features = false, features = [
7273
"tokio1",
7374
"tokio1-rustls",
7475
"ring",
75-
"rustls-platform-verifier"
76+
"rustls-platform-verifier",
7677
] }
7778
#====
7879
sea-orm = "1.1.20"

orion-scheduler/ARTIFACT.md

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
# Orion 构建产物分发设计
2+
3+
把 orion 构建产物从「本地 mega 目录 cp」改为「mega 仓库 GitHub Action 构建并推 S3,orion-scheduler 在每次 webhook 时从 S3 按需拉取」。沿用现有 [`.github/workflows/build-custom-image.yml`](.github/workflows/build-custom-image.yml) 的 OIDC + S3 通道;采用 pull 模型,避免 orion-scheduler 暴露入站接口给 CI。
4+
5+
## 背景:现状的问题
6+
7+
[`src/orion_deployer.rs`](src/orion_deployer.rs) 直接读两个本地路径:
8+
9+
```rust
10+
// 路径从 target_config.json 读取
11+
let orion_source_dir = config.orion_source_dir();
12+
let orion_binary_path = config.orion_binary_path();
13+
const ORION_TARGET_DIR: &str = "/home/orion/orion-runner";
14+
```
15+
16+
由此带来三个隐含约束:
17+
18+
1. orion-scheduler 必须和 mega 源码在**同一台机器**
19+
2. orion 二进制是开发者本地 `cargo build` 出来的 **debug 模式**,不可复现,约 500MB
20+
3. 没有版本概念,每次 webhook 都用「当前 target 目录里那个」
21+
22+
## 总体架构
23+
24+
```mermaid
25+
flowchart LR
26+
Dev["mega push to main"] --> MegaAction["mega .github/workflows/build-orion.yml"]
27+
MegaAction -->|"cargo build --release"| Bundle["orion-bundle.tar.gz<br/>(binary + runner-config + systemd)"]
28+
Bundle -->|"aws s3 cp"| S3Releases["s3 releases/sha-XXX/"]
29+
MegaAction -->|"update manifest"| Latest["s3 orion/latest.json"]
30+
31+
Webhook["POST /webhook"] --> Scheduler["orion-scheduler"]
32+
Scheduler -->|"1. GET latest.json"| Latest
33+
Scheduler -->|"2. compare sha vs cache"| Cache["/var/cache/orion-scheduler/artifacts/sha-XXX/"]
34+
Latest -.->|"3. download if changed"| Bundle
35+
Cache -->|"4. SFTP into VM"| VM["microVM"]
36+
```
37+
38+
## 1. S3 layout(mega Action 与 orion-scheduler 共识)
39+
40+
```
41+
s3://${S3_BUCKET}/orion-scheduler/
42+
├── debian-13-buck2.qcow2 # 已存在
43+
└── orion/
44+
├── latest.json # 可变指针
45+
└── releases/
46+
└── sha-<short8>/ # 不可变,按 commit 落
47+
├── orion-bundle.tar.gz # binary + runner-config + systemd
48+
└── orion-bundle.tar.gz.sha256
49+
```
50+
51+
`latest.json` schema:
52+
53+
```json
54+
{
55+
"version": "0.1.0",
56+
"commit_sha": "abc123def456...",
57+
"commit_short": "abc123de",
58+
"built_at": "2026-05-20T08:00:00Z",
59+
"bundle_url": "https://${S3_BUCKET}.s3.${AWS_REGION}.amazonaws.com/orion-scheduler/orion/releases/sha-abc123de/orion-bundle.tar.gz",
60+
"bundle_sha256": "..."
61+
}
62+
```
63+
64+
`orion-bundle.tar.gz` 内部结构(与 deployer 期望的相对路径一致):
65+
66+
```
67+
orion-bundle/
68+
├── orion # release 二进制(约 50-100MB,远小于 debug 的 500MB)
69+
├── runner-config/
70+
│ ├── run.sh
71+
│ ├── scorpio.toml
72+
│ ├── preflight.sh
73+
│ ├── cleanup.sh
74+
│ └── .env.prod
75+
└── systemd/
76+
└── orion-runner.service
77+
```
78+
79+
## 2. mega 仓库新增 Action(不在本仓库改)
80+
81+
新文件 `.github/workflows/build-orion.yml`(在 mega 仓库里)。结构参考本仓库 [`.github/workflows/build-custom-image.yml`](.github/workflows/build-custom-image.yml) 的 OIDC 写法,复用相同的 `secrets.AWS_ROLE_ARN` / `secrets.AWS_REGION` / `secrets.S3_BUCKET`
82+
83+
触发:
84+
85+
- `push` to `main`(自动)
86+
- `workflow_dispatch`(手动)
87+
88+
主要步骤:
89+
90+
1. checkout mega
91+
2. `cargo build --release -p orion`
92+
3. `tar czf orion-bundle.tar.gz orion runner-config systemd`
93+
4. `sha256sum orion-bundle.tar.gz > orion-bundle.tar.gz.sha256`
94+
5. AWS OIDC configure
95+
6. `aws s3 cp``releases/sha-${SHORT_SHA}/`
96+
7. 生成 `latest.json``aws s3 cp``orion/latest.json``--cache-control "max-age=60"` 避免长缓存)
97+
98+
## 3. orion-scheduler 侧改造
99+
100+
### 3.1 新增模块 `src/artifact_fetcher.rs`
101+
102+
职责:
103+
104+
```rust
105+
pub struct ArtifactManifest {
106+
pub commit_short: String,
107+
pub bundle_url: String,
108+
pub bundle_sha256: String,
109+
pub built_at: String,
110+
}
111+
112+
pub struct ResolvedArtifact {
113+
pub root_dir: PathBuf, // e.g. /var/cache/.../sha-abc123de/
114+
pub orion_binary: PathBuf, // root_dir/orion
115+
pub runner_config_dir: PathBuf, // root_dir/runner-config
116+
pub systemd_service: PathBuf, // root_dir/systemd/orion-runner.service
117+
pub manifest: ArtifactManifest,
118+
}
119+
120+
pub async fn ensure_latest(config: &ArtifactConfig) -> Result<ResolvedArtifact>;
121+
```
122+
123+
行为:
124+
125+
1. `reqwest` GET `latest.json` URL
126+
2.`cache_dir` 下查找 `sha-<short>/` 目录是否已存在且 `.complete` 标记文件存在
127+
3. 若不存在:原子下载(temp 文件 → 校验 SHA256 → `tar -xzf` 解压 → 写 `.complete` 标记)
128+
4. 返回 `ResolvedArtifact` 给调用方
129+
5. GC:保留最近 N=3 个版本,其余删除
130+
131+
### 3.2 `orion_deployer.rs` 改造
132+
133+
将硬编码常量 `ORION_SOURCE_DIR` 和二进制路径全部去掉,`deploy_orion_in_vm` 函数签名增加 `artifact: &ResolvedArtifact` 参数。内部所有路径替换为:
134+
135+
- `PathBuf::from(orion_source_dir).join("runner-config").join(file)``artifact.runner_config_dir.join(file)`
136+
- `PathBuf::from(orion_source_dir).join("systemd").join(...)``artifact.systemd_service.clone()`
137+
- `PathBuf::from(orion_binary_path)``artifact.orion_binary.clone()`
138+
139+
### 3.3 `handle_update` 加 Step 0
140+
141+
[`src/orion_deployer.rs`](src/orion_deployer.rs):在 Step 1(读 config)之后、Step 3(创建 VM)之前插入:
142+
143+
```rust
144+
let artifact = artifact_fetcher::ensure_latest(&config.artifact()).await?;
145+
info!("[orion-deploy] Using orion version: {} ({})",
146+
artifact.manifest.commit_short, artifact.manifest.built_at);
147+
```
148+
149+
然后传给 `deploy_orion_in_vm(&machine, &artifact)`
150+
151+
### 3.4 `target_config.json` 加配置块
152+
153+
```json
154+
{
155+
"log_dir": "...",
156+
"default_image": "buck2",
157+
"orion_artifact": {
158+
"manifest_url": "https://${S3_BUCKET}.s3.${REGION}.amazonaws.com/orion-scheduler/orion/latest.json",
159+
"cache_dir": "/var/cache/orion-scheduler/artifacts",
160+
"keep_versions": 3
161+
},
162+
"custom_images": { ... },
163+
"targets": { ... }
164+
}
165+
```
166+
167+
`config.rs` 增加 `ArtifactConfig` struct 和 `Config::artifact() -> &ArtifactConfig` 方法。
168+
169+
### 3.5 `state.rs` 增加版本暴露
170+
171+
[`src/state.rs`](src/state.rs)`VmInfo` 增加:
172+
173+
```rust
174+
pub orion_version: Option<String>, // e.g. "sha-abc123de"
175+
```
176+
177+
[`src/handlers.rs`](src/handlers.rs)`/status` 响应里把这个字段透出,方便排障。
178+
179+
### 3.6 Cargo 依赖
180+
181+
[`Cargo.toml`](Cargo.toml) 增加:
182+
183+
- `reqwest = { version = "0.12", features = ["rustls-tls", "stream"] }`(HTTP 下载,rustls 避免 OpenSSL 依赖)
184+
- `tar = "0.4"`(解压)
185+
- `flate2 = "1.0"`(gzip)
186+
187+
`sha2` 已有,不动。
188+
189+
### 3.7 向后兼容(escape hatch)
190+
191+
`orion_artifact` 字段缺失或 `manifest_url` 为空 → 回退到旧行为,从 `target_config.json` 指定的本地路径读,保留开发者本地迭代体验。`artifact_fetcher::ensure_latest` 在该情况下返回一个指向本地路径的 `ResolvedArtifact`,下游代码无差别。
192+
193+
## 4. S3 访问凭证(host 侧)
194+
195+
orion-scheduler 拉 S3 有三种方案,按从简到复杂排:
196+
197+
- **A. orion 这个 prefix 设为 public-read**:host 用纯 HTTPS GET,零凭证;用 `bundle_sha256` 校验完整性。**推荐这个**,因为 orion 是开源项目,binary 不敏感。
198+
- B. 给 host 一个只读 IAM key 写到 `~/.aws/credentials`
199+
- C. CI 生成 pre-signed URL 写进 `latest.json`(TTL 7 天,到期需重发)
200+
201+
实施时先按 A 走,bucket policy 给 `orion-scheduler/orion/*``s3:GetObject` Allow `Principal: *`
202+
203+
## 5. 实施顺序建议
204+
205+
1. 先在 orion-scheduler 这边加 `artifact_fetcher` 模块和配置开关,**带本地 FS fallback** —— 可以单独 merge、零行为变更
206+
2. 在 mega 仓库加 build-orion Action,跑通一次 S3 上传,生成第一个 `latest.json`
207+
3. 在某个 target 的 target_config 里配上 `manifest_url`,触发一次 webhook 验证
208+
4. 全量切换:所有 target 都用 manifest_url,删掉本地 FS fallback(可选,建议保留 fallback 长期支持本地开发)
209+
210+
## 6. 待确认的小决策
211+
212+
| 决策点 | 默认选择 | 备选 |
213+
| --- | --- | --- |
214+
| Action 编译模式 | `--release`(二进制小一个数量级) | 保留 debug 选项给本地 fallback |
215+
| 缓存目录权限 | `/var/cache/orion-scheduler/`(root 跑无问题) | 改 systemd user 时配 tmpfiles.d |
216+
| GC 策略 | 保留最近 N=3 版本 | 按时间维度(如 7 天) |
217+
| manifest 拉取失败 | fail-fast 返回 webhook 错误 | 降级用最后一次 cache |
218+
219+
## 7. 实施 TODO
220+
221+
- [ ] 在 mega 仓库新增 `.github/workflows/build-orion.yml`:release 编译 + 打包 + S3 上传 + 写 `latest.json`
222+
- [ ] 确定 S3 布局:`orion-scheduler/orion/{latest.json, releases/sha-XXX/orion-bundle.tar.gz}`
223+
- [ ] [`target_config.json`](target_config.json) 增加 `orion_artifact` 配置块,并在 `config.rs` 中加对应的 `ArtifactConfig` + accessor
224+
- [ ] 新增 `src/artifact_fetcher.rs`:GET `latest.json` → 比对 `cache_dir` 里的 sha → 按需下载解压 → SHA256 校验 → 返回 `ResolvedArtifact`
225+
- [ ] [`src/orion_deployer.rs`](src/orion_deployer.rs) 去掉硬编码常量,`deploy_orion_in_vm` 接受 `ResolvedArtifact` 参数
226+
- [ ] `handle_update` 在 Step 1 后插入 `artifact_fetcher::ensure_latest`,传给下游
227+
- [ ] [`src/state.rs`](src/state.rs)`VmInfo` 增加 `orion_version` 字段,`/status` 端点透出
228+
- [ ] [`Cargo.toml`](Cargo.toml) 加入 `reqwest` (rustls), `tar`, `flate2`
229+
- [ ] `manifest_url` 缺失时 fallback 到 `target_config.json` 指定的本地路径,保留本地开发体验
230+
- [ ] 给 S3 bucket 的 `orion-scheduler/orion/*` 加 public-read 策略(或选取其他凭证方案)
231+
- [ ] 更新 [`TESTING.md`](TESTING.md) / [`DESIGN.md`](DESIGN.md) 说明新的 artifact 流和调试方法(如何查看当前版本、如何手动刷缓存)

orion-scheduler/Cargo.toml

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
[package]
2+
name = "orion-scheduler"
3+
version = "0.1.0"
4+
edition = "2024"
5+
6+
[dependencies]
7+
anyhow = { workspace = true }
8+
qlean = { workspace = true }
9+
tokio = { workspace = true }
10+
axum = { workspace = true }
11+
tower-http = { workspace = true, features = ["cors", "trace"] }
12+
serde = { workspace = true }
13+
serde_json = { workspace = true }
14+
tracing = { workspace = true }
15+
tracing-subscriber = { workspace = true, features = ["env-filter"] }
16+
futures-util = { workspace = true }
17+
async-stream = { workspace = true }
18+
19+
[dev-dependencies]
20+
anyhow = { workspace = true }

0 commit comments

Comments
 (0)