diff --git a/docs/proposals/job-metrics-reporting-config.md b/docs/proposals/job-metrics-reporting-config.md new file mode 100644 index 0000000000..01fd693037 --- /dev/null +++ b/docs/proposals/job-metrics-reporting-config.md @@ -0,0 +1,300 @@ +## Job 级指标汇报配置方案 — 集成 Harbor ml_tracker + +### 背景 + +ROCK 的 Job 系统需要在 **Bench 评测** 和 **RL 训练** 场景中汇报运行指标。Harbor 框架已内置 `ml_tracker` 模块,可汇报以下关键指标: + +| 类别 | 指标 | +|------|------| +| **Reward** | `reward/*`(verifier 输出的各 reward key) | +| **Duration** | `total_duration_sec`、`agent_duration_sec` | +| **Token** | `input_tokens`、`output_tokens`、`cache_tokens`、`cost_usd` | +| **RL 训练** | `logprobs_mean`、`entropy`、`loss`、`kl_divergence`、`advantage`、`grad_norm`、`clip_fraction`、`value_loss`、`explained_variance` | +| **Running** | `pass_rate`、`avg_reward`、`error_rate` | +| **Summary** | `final_pass_rate`、`final_avg_reward`、`final_error_rate`、`total_trials`、`total_errors`、`total_duration_sec` | + +但当前 ml_tracker 的启用方式依赖**环境变量** `ROCK_API_KEY` 的存在性(硬编码判断),用户无法通过 Job 配置声明式地控制是否启用、传入超参数等。 + +**改动前**(Harbor `job.py`): + +```python +# 硬编码检查环境变量,无配置入口 +if os.environ.get("ROCK_API_KEY"): + self._tracker = MLTrackerFactory.create(...) +``` + +--- + +### 目标 + +在 `EnvironmentConfig` 上新增 **`tracking`** 字段,让用户在 YAML 的 `environment` 段中声明式地启用 Harbor 内置的 ml_tracker,汇报 Bench/RL 训练指标。 + +**设计原则**: +- **字段名不绑定具体 SDK**:用 `tracking`(而非 `ml_tracker`),避免配置字段与具体包名耦合 +- **复用 Harbor 已有能力**:不另起炉灶,底层仍调用 Harbor `ml_tracker` 模块 +- 所有字段可选,零配置向后兼容(默认不启用,保持现有行为) +- 不侵入 `HarborJobConfig.metrics: list[MetricConfig]`(那是评测结果的聚合策略,语义不同) + +--- + +### 方案(已实现) + +#### 模型定义 + +**ROCK 侧** — `rock/sdk/envhub/config.py`: + +`TrackingConfig` 定义在 `EnvironmentConfig` 同级,作为 `EnvironmentConfig` 的二级字段: + +```python +class TrackingConfig(BaseModel): + """Experiment tracking configuration. + + When present and enabled, activates Harbor's built-in ml_tracker to report + per-trial metrics (reward, duration, token usage, RL training signals) + and a final job-level summary. + """ + + enabled: bool = Field( + default=True, + description="Whether to enable experiment tracking for this job.", + ) + params: dict[str, Any] = Field( + default_factory=dict, + description=( + "User-defined hyperparameters merged into ml_tracker.init(config=...). " + "Combined with auto-collected job metadata (agents, datasets, etc.)." + ), + ) + +class EnvironmentConfig(SandboxConfig): + uploads: list[tuple[str, str]] = Field(default_factory=list) + env: dict[str, str] = Field(default_factory=dict) + oss_mirror: OssMirrorConfig | None = None + tracking: TrackingConfig | None = Field( + default=None, + description="Experiment tracking configuration. None = disabled (default).", + ) +``` + +**Harbor 侧** — `harbor/tracker/config.py`(模块目录已从 `ml_tracker/` 重命名为 `tracker/`): + +```python +class TrackingConfig(BaseModel): + enabled: bool = Field(default=True) + params: dict[str, Any] = Field( + default_factory=dict, + description="User-defined hyperparameters merged into tracker init config.", + ) +``` + +Harbor 侧 `EnvironmentConfig`(`harbor/models/trial/config.py`)中同样作为二级字段: + +```python +class EnvironmentConfig(BaseModel): + ... + tracking: TrackingConfig | None = Field( + default=None, + description="Experiment tracking configuration. None = disabled (default).", + ) +``` + +> **命名决策**: +> - 用户配置字段名 = `tracking`(不绑定具体 SDK,未来可扩展到其他 tracker) +> - 子字段 = `params`(而非 `config`,避免 `tracking.config` 语义重复) +> - ROCK 和 Harbor 两侧配置类统一命名为 `TrackingConfig` +> - Harbor 内部模块目录从 `ml_tracker/` 重命名为 `tracker/`,实现类 `MLTrackerImpl` 保持不变(因为底层仍使用 `ml_tracker` SDK) + +#### 在配置层次中的位置 + +`tracking` 放在 `EnvironmentConfig` 下作为二级字段,而非 `JobConfig` 的一级字段。原因: + +- **与 `oss_mirror` 同层**:`tracking` 和 `oss_mirror` 都是环境级别的能力配置,放在 environment 下更内聚 +- **Harbor 的 `EnvironmentConfig` 天然包含这类配置**:Harbor YAML 中 environment 段是 tracking 信息的自然归属 +- **简化序列化**:`to_harbor_yaml()` 通过 `to_harbor_environment()` 序列化 environment 时自然携带 tracking + +```python +# JobConfig 不直接暴露 tracking,通过 environment 间接访问 +class JobConfig(BaseModel): + environment: EnvironmentConfig = Field(default_factory=EnvironmentConfig) + job_name: str | None = None + namespace: str | None = None + experiment_id: str | None = None + labels: dict[str, str] = Field(default_factory=dict) + timeout: int = 7200 +``` + +> **默认 `None`**:不写 `tracking` 时行为等价于改动前(不启用)。用户显式写 `environment.tracking: {}` 即可启用。 + +--- + +#### YAML 配置示例 + +**最简启用**(所有默认值,自动采集 agent/dataset 信息): + +```yaml +experiment_id: exp-rl-001 +job_name: qwen-72b-swe-bench +environment: + tracking: {} +``` + +**记录额外超参数**(RL 训练场景): + +```yaml +experiment_id: exp-rl-002 +job_name: rl-grpo-run-3 +environment: + tracking: + params: + model: qwen-72b-instruct + algorithm: GRPO + learning_rate: 1.0e-5 + batch_size: 64 + kl_coeff: 0.05 + num_rollouts: 4 +``` + +**显式禁用**(覆盖团队默认配置): + +```yaml +environment: + tracking: + enabled: false +``` + +**不写 `tracking`**(默认行为,等同于禁用): + +```yaml +experiment_id: exp-001 +job_name: my-job +environment: {} +# tracking 不出现 → None → 不启用 +``` + +--- + +### 字段说明 + +| 字段路径 | 类型 | 默认值 | 说明 | +|----------|------|--------|------| +| **`environment.tracking`** | `TrackingConfig \| None` | `None` | 开关。`None` = 不启用(向后兼容);写 `{}` = 启用。 | +| **`environment.tracking.enabled`** | `bool` | `True` | 细粒度开关。配合 `tracking: { enabled: false }` 可显式禁用。 | +| **`environment.tracking.params`** | `dict[str, Any]` | `{}` | 用户自定义超参数,与自动采集的 job metadata 合并后传给 `ml_tracker.init(config=...)`。 | + +两层开关的设计意图: +- `tracking` 不写 / `null` → 不启用(向后兼容,默认路径) +- `tracking: {}` → 启用(`enabled` 默认 `True`) +- `tracking: { enabled: false }` → 显式禁用(团队配置模板中可以预留 `tracking` 段落但暂时关闭) + +--- + +### 汇报的指标详情 + +启用 tracking 后,Harbor 框架会在以下时机自动汇报: + +**每个 Trial 结束时**(`TrialEvent.END` hook): + +``` +reward/* — verifier 输出的 reward 值(每个 key 单独上报) +total_duration_sec — Trial 总耗时 +agent_duration_sec — Agent 执行耗时 +input_tokens — 输入 token 数 +output_tokens — 输出 token 数 +cache_tokens — 缓存 token 数 +cost_usd — 推理花费(USD) +logprobs_mean — rollout log probabilities 均值(RL) +entropy — 策略熵 = -logprobs_mean(RL) +loss — 训练 loss(RL,来自 agent metadata) +kl_divergence — KL 散度(RL) +advantage — 优势值(RL) +grad_norm — 梯度范数(RL) +clip_fraction — PPO clip fraction(RL) +value_loss — 值函数 loss(RL) +explained_variance — 解释方差(RL) +pass_rate — 截至当前的通过率(running) +avg_reward — 截至当前的平均 reward(running) +error_rate — 截至当前的错误率(running) +``` + +**Job 结束时**(`report_job_summary`): + +``` +final_pass_rate — 最终通过率 +final_avg_reward — 最终平均 reward +final_error_rate — 最终错误率 +total_trials — 总 trial 数 +total_errors — 总错误数 +total_duration_sec — Job 总耗时 +``` + +--- + +### 与现有体系的关系 + +``` +JobConfig +├── environment: EnvironmentConfig +│ ├── uploads, env, ... ← 已有: 环境级配置 +│ ├── oss_mirror: OssMirrorConfig ← 已有: OSS 镜像配置 +│ └── tracking: TrackingConfig | None ← NEW: 实验追踪配置 +├── labels: dict[str, str] ← 已有: Job 级标签 +└── ... + +HarborJobConfig(JobConfig) +├── environment.tracking (inherited) ← NEW: 通过 environment 继承 +├── metrics: list[MetricConfig] ← 已有: 评测结果聚合方式(sum/mean/max) +└── ... + +BashJobConfig(JobConfig) +├── environment.tracking (inherited) ← NEW: 通过 environment 继承 +└── ... +``` + +**关键区分**: +- **`environment.tracking`**(新增)= "实验追踪:每个 Trial 的 **业务指标怎么记录**"(reward/token/RL signals → ml_tracker SDK) +- **`metrics`**(HarborJobConfig 已有)= "评测聚合:多个 Trial 的结果 **怎么聚合成最终分数**"(mean/sum/max) + +两者语义正交,互不冲突。`tracking` 与 `oss_mirror` 同层,都属于环境级别的能力配置。 + +--- + +### 改动文件清单 + +#### ROCK 侧 + +| 文件 | 改动 | +|------|------| +| `rock/sdk/envhub/config.py` | 新增 `TrackingConfig` 类 + `EnvironmentConfig.tracking` 字段 | + +#### Harbor 侧 + +| 文件 | 改动 | +|------|------| +| `harbor/tracker/config.py` | 新模块,`TrackingConfig`(`enabled` + `params`) | +| `harbor/tracker/base.py` | `BaseMLTracker` → `BaseTracker` | +| `harbor/tracker/tracker.py` | `MLTrackerImpl` 改为继承 `BaseTracker`,逻辑不变 | +| `harbor/tracker/factory.py` | `MLTrackerFactory`,新增 `tracker_config` 参数,合并用户 `params` | +| `harbor/models/trial/config.py` | `EnvironmentConfig` 新增 `tracking: TrackingConfig \| None` 字段 | +| `harbor/job.py` | 从 `self.config.environment.tracking` 读取配置;`tracking.enabled` 控制启用(不再硬编码检查 env var) | +| ~~`harbor/ml_tracker/`~~ | 整个目录重命名为 `harbor/tracker/` | + +#### 配置传递链路 + +``` +用户 YAML + → rock HarborJobConfig.environment.tracking (解析 + 校验) + → to_harbor_yaml() → environment 段携带 tracking + → harbor EnvironmentConfig.tracking (反序列化) + → harbor Job.__init__ 从 self.config.environment.tracking 读取 + → MLTrackerFactory.create(tracker_config=...) +``` + +--- + +### 向后兼容性 + +- `EnvironmentConfig.tracking` 默认为 `None`,不写等价于改动前行为(不启用)。 +- `SandboxConfig` 基类不受影响(`tracking` 只加在 `EnvironmentConfig` 层)。 +- `BashJobConfig` / `HarborJobConfig`:通过 `environment` 间接访问,不涉及 `extra="forbid"` 问题。 +- `_HarborJobFields`:environment 中 `tracking` 为 `None` 时被序列化过滤,不出现在 Harbor YAML 中。 +- `ROCK_API_KEY` 环境变量:不再用于控制启用逻辑(改由 `tracking.enabled` 控制)。`ROCK_API_KEY` 仅在 `MLTrackerImpl.__init__` 中作为 `ml_tracker.login(key=...)` 的凭证使用,未设置时传 `None`(由 SDK 自行处理鉴权 fallback)。 diff --git a/rock/sdk/bench/models/trial/config.py b/rock/sdk/bench/models/trial/config.py index 05f26ce20d..d4a6982859 100644 --- a/rock/sdk/bench/models/trial/config.py +++ b/rock/sdk/bench/models/trial/config.py @@ -7,7 +7,7 @@ from rock.sdk.bench.models.environment_type import EnvironmentType from rock.sdk.envhub import EnvironmentConfig as _EnvConfig -from rock.sdk.envhub.config import OssMirrorConfig +from rock.sdk.envhub.config import OssMirrorConfig, TrackingConfig class AgentConfig(BaseModel): @@ -33,6 +33,7 @@ class EnvironmentConfig(BaseModel): suppress_override_warnings: bool = False mounts_json: list[dict[str, Any]] | None = None oss_mirror: OssMirrorConfig | None = None + tracking: TrackingConfig | None = None oss_deps: dict[str, str] = Field(default_factory=dict) env: dict[str, str] = Field(default_factory=dict) kwargs: dict[str, Any] = Field(default_factory=dict) diff --git a/rock/sdk/envhub/__init__.py b/rock/sdk/envhub/__init__.py index 115ee11588..5a528d5c34 100644 --- a/rock/sdk/envhub/__init__.py +++ b/rock/sdk/envhub/__init__.py @@ -1,3 +1,3 @@ -from rock.sdk.envhub.config import EnvironmentConfig, OssMirrorConfig +from rock.sdk.envhub.config import EnvironmentConfig, OssMirrorConfig, TrackingConfig -__all__ = ["EnvironmentConfig", "OssMirrorConfig"] +__all__ = ["EnvironmentConfig", "OssMirrorConfig", "TrackingConfig"] diff --git a/rock/sdk/envhub/config.py b/rock/sdk/envhub/config.py index d0330edc68..e8cb579e57 100644 --- a/rock/sdk/envhub/config.py +++ b/rock/sdk/envhub/config.py @@ -6,6 +6,8 @@ from __future__ import annotations +from typing import Any + from pydantic import BaseModel, Field from rock.sdk.sandbox.config import SandboxConfig @@ -29,6 +31,31 @@ class OssMirrorConfig(BaseModel): oss_endpoint: str | None = None +class TrackingConfig(BaseModel): + """Experiment tracking configuration. + + When present and enabled, activates Harbor's built-in ml_tracker to report + per-trial metrics (reward, duration, token usage, RL training signals) + and a final job-level summary. + """ + + enabled: bool = Field( + default=True, + description="Whether to enable experiment tracking for this job.", + ) + api_key: str | None = Field( + default=None, + description="API key for the tracking platform. Falls back to ROCK_API_KEY env var if not set.", + ) + params: dict[str, Any] = Field( + default_factory=dict, + description=( + "User-defined hyperparameters merged into ml_tracker.init(config=...). " + "Combined with auto-collected job metadata (agents, datasets, etc.)." + ), + ) + + class EnvironmentConfig(SandboxConfig): """General environment config — sandbox base fields + environment-level fields.""" @@ -39,3 +66,7 @@ class EnvironmentConfig(SandboxConfig): ) env: dict[str, str] = Field(default_factory=dict) oss_mirror: OssMirrorConfig | None = None + tracking: TrackingConfig | None = Field( + default=None, + description="Experiment tracking configuration. None = disabled (default).", + ) diff --git a/tests/unit/sdk/job/test_config.py b/tests/unit/sdk/job/test_config.py index 8c678ab337..631d576eef 100644 --- a/tests/unit/sdk/job/test_config.py +++ b/tests/unit/sdk/job/test_config.py @@ -20,6 +20,7 @@ VerifierConfig, ) from rock.sdk.envhub import EnvironmentConfig +from rock.sdk.envhub.config import TrackingConfig from rock.sdk.job.config import BashJobConfig, JobConfig # --------------------------------------------------------------------------- @@ -292,12 +293,31 @@ def test_returns_valid_yaml_string(self): parsed = yaml.safe_load(yaml_str) assert isinstance(parsed, dict) + def test_tracking_config_preserved_in_harbor_yaml(self): + """tracking config on environment must survive to_harbor_yaml() serialization.""" + tracking = TrackingConfig(enabled=True, api_key="sk-test-123", params={"lr": 0.01}) + env = RockEnvironmentConfig(tracking=tracking) + cfg = HarborJobConfig(experiment_id="test-exp", environment=env) + yaml_str = cfg.to_harbor_yaml() + data = yaml.safe_load(yaml_str) + assert "environment" in data + assert "tracking" in data["environment"], "tracking must not be stripped by to_harbor_yaml()" + assert data["environment"]["tracking"]["enabled"] is True + assert data["environment"]["tracking"]["api_key"] == "sk-test-123" + assert data["environment"]["tracking"]["params"] == {"lr": 0.01} + + def test_tracking_config_none_omitted_in_harbor_yaml(self): + """When tracking is None (default), it should not appear in harbor YAML.""" + cfg = HarborJobConfig(experiment_id="test-exp") + yaml_str = cfg.to_harbor_yaml() + data = yaml.safe_load(yaml_str) + env_data = data.get("environment", {}) + assert "tracking" not in env_data + # --------------------------------------------------------------------------- # HarborJobConfig.from_yaml # --------------------------------------------------------------------------- - - class TestHarborJobConfigFromYaml: def test_round_trip(self, tmp_path): """Write a YAML config, read it back, verify fields.""" @@ -684,3 +704,160 @@ def test_defaults_to_datetime_string(self): def test_explicit_name_preserved(self): assert BashJobConfig(job_name="x").job_name == "x" + + +# --------------------------------------------------------------------------- +# TrackingConfig +# --------------------------------------------------------------------------- + + +class TestTrackingConfig: + def test_default_values(self): + config = TrackingConfig() + assert config.enabled is True + assert config.api_key is None + assert config.params == {} + + def test_disabled(self): + config = TrackingConfig(enabled=False) + assert config.enabled is False + assert config.api_key is None + assert config.params == {} + + def test_custom_params(self): + config = TrackingConfig(params={"learning_rate": 0.01, "epochs": 10, "model": "qwen-72b"}) + assert config.params["learning_rate"] == 0.01 + assert config.params["epochs"] == 10 + assert config.params["model"] == "qwen-72b" + + def test_api_key(self): + config = TrackingConfig(api_key="sk-test-key-123") + assert config.api_key == "sk-test-key-123" + assert config.enabled is True + + def test_api_key_with_disabled(self): + config = TrackingConfig(enabled=False, api_key="sk-key") + assert config.enabled is False + assert config.api_key == "sk-key" + + def test_api_key_none_by_default(self): + config = TrackingConfig() + assert config.api_key is None + + def test_from_dict(self): + data = {"enabled": True, "params": {"batch_size": 32}} + config = TrackingConfig.model_validate(data) + assert config.enabled is True + assert config.api_key is None + assert config.params["batch_size"] == 32 + + def test_from_dict_with_api_key(self): + data = {"api_key": "sk-from-dict", "params": {"lr": 0.01}} + config = TrackingConfig.model_validate(data) + assert config.api_key == "sk-from-dict" + assert config.params["lr"] == 0.01 + + def test_from_dict_minimal(self): + config = TrackingConfig.model_validate({}) + assert config.enabled is True + assert config.api_key is None + assert config.params == {} + + def test_serialization_roundtrip(self): + config = TrackingConfig(enabled=True, api_key="sk-round", params={"lr": 0.001, "algo": "GRPO"}) + json_str = config.model_dump_json() + restored = TrackingConfig.model_validate_json(json_str) + assert restored == config + assert restored.api_key == "sk-round" + + def test_exclude_none_omits_api_key_when_not_set(self): + config = TrackingConfig(params={"lr": 0.01}) + data = config.model_dump(mode="json", exclude_none=True) + assert "api_key" not in data + assert data["params"] == {"lr": 0.01} + + def test_exclude_none_includes_api_key_when_set(self): + config = TrackingConfig(api_key="sk-present") + data = config.model_dump(mode="json", exclude_none=True) + assert data["api_key"] == "sk-present" + + +# --------------------------------------------------------------------------- +# TrackingConfig on base EnvironmentConfig +# --------------------------------------------------------------------------- + + +class TestTrackingConfigOnBaseEnvironment: + def test_tracking_default_none(self): + env = EnvironmentConfig() + assert env.tracking is None + + def test_tracking_enabled(self): + env = EnvironmentConfig(tracking=TrackingConfig()) + assert env.tracking is not None + assert env.tracking.enabled is True + assert env.tracking.params == {} + + def test_tracking_disabled(self): + env = EnvironmentConfig(tracking=TrackingConfig(enabled=False)) + assert env.tracking is not None + assert env.tracking.enabled is False + + def test_tracking_with_params(self): + env = EnvironmentConfig(tracking=TrackingConfig(params={"model": "qwen-72b", "lr": 1e-5})) + assert env.tracking.params["model"] == "qwen-72b" + assert env.tracking.params["lr"] == 1e-5 + + def test_tracking_with_api_key(self): + env = EnvironmentConfig(tracking=TrackingConfig(api_key="sk-env-key", params={"model": "test"})) + assert env.tracking.api_key == "sk-env-key" + assert env.tracking.params["model"] == "test" + + def test_tracking_from_dict(self): + data = {"tracking": {"enabled": True, "params": {"batch_size": 64}}} + env = EnvironmentConfig.model_validate(data) + assert env.tracking is not None + assert env.tracking.params["batch_size"] == 64 + + def test_tracking_from_dict_with_api_key(self): + data = {"tracking": {"api_key": "sk-yaml-key", "params": {"lr": 0.01}}} + env = EnvironmentConfig.model_validate(data) + assert env.tracking.api_key == "sk-yaml-key" + assert env.tracking.params["lr"] == 0.01 + + def test_tracking_none_from_dict(self): + data = {"tracking": None} + env = EnvironmentConfig.model_validate(data) + assert env.tracking is None + + def test_tracking_empty_dict_from_yaml(self): + """Simulates YAML `tracking: {}` — should enable with defaults.""" + data = {"tracking": {}} + env = EnvironmentConfig.model_validate(data) + assert env.tracking is not None + assert env.tracking.enabled is True + assert env.tracking.params == {} + + def test_tracking_coexists_with_other_fields(self): + env = EnvironmentConfig( + image="python:3.11", + env={"MY_VAR": "hello"}, + tracking=TrackingConfig(params={"model": "test"}), + ) + assert env.image == "python:3.11" + assert env.env == {"MY_VAR": "hello"} + assert env.tracking.params["model"] == "test" + + def test_serialization_roundtrip_with_tracking(self): + env = EnvironmentConfig(tracking=TrackingConfig(api_key="sk-rt", params={"lr": 0.01})) + json_str = env.model_dump_json() + restored = EnvironmentConfig.model_validate_json(json_str) + assert restored.tracking is not None + assert restored.tracking.api_key == "sk-rt" + assert restored.tracking.params["lr"] == 0.01 + + def test_serialization_roundtrip_without_tracking(self): + env = EnvironmentConfig() + json_str = env.model_dump_json() + restored = EnvironmentConfig.model_validate_json(json_str) + assert restored.tracking is None