Skip to content

Commit 4c39cb8

Browse files
earayuclaude
andcommitted
docs(observability): ApeRAG 可观测性需求对齐文档 v1
per earayu2 飞书 directive (via @飞书 msg=d3042add): 汇总近期可观测性相关需求, 跟其他人讨论并对齐 — 先不考虑如何实现, 讨论完成发 PR 合并. ## scope (NOT 实现) - 全路径 4 layer capability map (API / worker / 业务执行 / LLM call) - 各层 capability 字段 + 关联维度 + 优先级 + 驱动场景 - LLM 调用账本独立 event family (per Weston msg=94df61b3 + 太钢 token 追问 + 私有大模型 cost) - privacy hard gate 跨 phase 共同基线 - 实施 phase 拆分 (P0 已 spec PR #1951 / P1+P2 trigger condition) - 跨域协作方对齐确认表 ## 驱动场景 - 太钢 token 消耗追问 (谭怀远 飞书): PG 无 usage 表 / LiteLLM callback 禁用 / 答不上来 - 私有大模型 cost 担忧 (ou_1d75b5... 飞书): 客户私有部署 cost 是对外可见运维必需品 - Singapore 大 PDF 现场 (Planetegg msg=1314ac59): per-window 耗时 / entity-relation 计数无可查询数据 - 财务分摊 + 容量预测 driving ## 优先级 (per 4 guardrail + 真实需求驱动) - P0: graph extraction 2 metric (window + document) - earayu2 已 confirm 立即启动 - P1: worker lane lifecycle + vector/fulltext indexing - P2: retrieval + LLM ledger (LLM ledger 优先级最高 P2, 跟 P1 几乎同时启动) - P3: HTTP request producer (defer) ## Privacy hard gate - 永远 NOT 进 attrs payload: chunk_text / query_text / prompt_text / completion_text / error_message / traceback - AST data-flow gate (NOT 全文 grep) 防误伤合法 extraction 路径 - error 类仅 error_code/error_type whitelist enum - DISABLE_TELEMETRY 部署级 opt-out format: 类比 task #31 Phase A 自然中文 onboarding doc, 面向决策者 + 跨域协作方对齐 capability/scope/trigger/privacy boundary. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1 parent eb4c4f3 commit 4c39cb8

1 file changed

Lines changed: 268 additions & 0 deletions

File tree

Lines changed: 268 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,268 @@
1+
---
2+
title: ApeRAG 可观测性需求对齐文档
3+
description: 全路径埋点 + LLM 账本 + token/cost 字段 - 跨域指标 capability/scope/字段/维度/口径 alignment (NO 实现)
4+
---
5+
6+
# ApeRAG 可观测性需求对齐文档
7+
8+
> earayu2 directive (飞书 + `#indexing优化` msg=1331d5e7): 汇总近期可观测性相关需求, 整理详细文档, 跟其他人讨论并对齐 — 先不考虑如何实现, 讨论完成发 PR 合并。
9+
>
10+
> 本文档仅 alignment **指标 capability + scope + 字段 + 维度 + 口径**, NOT 实现细节。具体实施 spec 见 [`task-89-telemetry-spec-v1.md`](./task-89-telemetry-spec-v1.md) (P0 已 spec lock 候选)。
11+
12+
## 0. 为什么现在做(驱动场景)
13+
14+
### 0.1 真实用户问题
15+
16+
- **太钢 token 消耗追问** (谭怀远 飞书群): 客户问"这次烧了多少 token", 当前线上 PG **无 usage 表** + LiteLLM callback 禁用 + worker 调用不记 token → 完全答不上来。
17+
- **私有大模型 cost 担忧** (`ou_1d75b5...` 飞书): 客户都是私有大模型部署, 如果消耗太大客户使用会有问题 — token/cost 是**对外可见的运维必需品** (NOT internal-only debug)。
18+
- **财务 + 容量预测**: 没有 per-collection / per-doc / per-LLM-call 维度的成本归因, 财务无法分摊, 容量规划只能拍脑袋。
19+
- **Singapore 大文档调试** (Planetegg msg=1314ac59): 413+1316 chunks 大 PDF graph extraction RUNNING 现场, **无 window 级耗时 / entity/relation 计数** → 只能 `tail -f` 日志反推, 平均 / p50 / p95 / 失败率 SQL 全无。
20+
21+
### 0.2 缺口本质
22+
23+
ApeRAG 现有可见性 surface = `logger.info` / `logger.warning` 文本日志, **无结构化埋点 + 无可查询数据 + 无 SaaS telemetry (私有化部署不能用 Datadog/NewRelic)**
24+
25+
跨用户问题 → 同一个根因: **没有 machine-readable, structured, queryable, privacy-safe 的 telemetry foundation**
26+
27+
## 1. 文档定位 (per earayu2 directive)
28+
29+
### 1.1 本文档做什么
30+
31+
**alignment 出指标 capability 范围 + 字段 + 维度 + 口径** — 跨域协作方在统一语言里讨论需求
32+
**优先级 + trigger condition + privacy boundary 锁定** — 决策者拍 P0/P1/P2 时机
33+
**驱动场景显式 link**: 每条 capability 关联具体业务问题 / 用户反馈 / 决策依据
34+
35+
### 1.2 本文档不做什么
36+
37+
**不写实现** — 接入点 / SQL schema / Pydantic / Helm / boundary test 等已在 [`task-89-telemetry-spec-v1.md`](./task-89-telemetry-spec-v1.md) (P0) 或 future P1/P2 spec
38+
**不立 dashboard/UI** — 验收路径 = SQL/query + tests (per Weston msg=22e6df03 P0 lock)
39+
**不绑定具体 backend infra** — 复用现有 PG (per 4 guardrail 简单稳定 + 私有化部署免维护)
40+
41+
## 2. 全路径可观测性 capability map
42+
43+
### 2.1 全路径分层
44+
45+
ApeRAG 端到端有 4 大执行层, 每层有自己的 metric surface:
46+
47+
```
48+
┌─────────────────────────────────────────────────────────────┐
49+
│ Layer A: API 层 (FastAPI) │
50+
│ - HTTP request latency / status code / endpoint │
51+
│ - 用户/租户/collection 维度 │
52+
├─────────────────────────────────────────────────────────────┤
53+
│ Layer B: 索引 worker 层 (indexing-worker) │
54+
│ - lane lifecycle (claim/release/heartbeat) │
55+
│ - 队列深度 (queue depth) / concurrent_tasks │
56+
│ - per-Modality task 耗时 / 失败率 │
57+
├─────────────────────────────────────────────────────────────┤
58+
│ Layer C: 业务执行层 (graph extraction / vector / fulltext) │
59+
│ - per-window / per-chunk 耗时 + node/edge count │
60+
│ - per-document 汇总 (windows_total/success/failed/...) │
61+
│ - vector / fulltext indexing 耗时 + dimension count │
62+
│ - retrieval search top-K / hit rate / latency │
63+
├─────────────────────────────────────────────────────────────┤
64+
│ Layer D: LLM 调用层 (cross-cutting concern) │
65+
│ - per LLM call: model / provider / token / cost / latency │
66+
│ - call_purpose 维度: chunking / extraction / embedding / answer / rerank │
67+
│ - 关联 Layer A/B/C: collection_id / doc_id / window_id / run_id │
68+
└─────────────────────────────────────────────────────────────┘
69+
```
70+
71+
### 2.2 各层 capability 范围 + 优先级
72+
73+
| Layer | Capability | 字段 | 关联维度 | 驱动场景 | 优先级 |
74+
|-------|-----------|------|---------|---------|--------|
75+
| C | **graph extraction window** | `chunk_ids` / `duration_ms` / `entity_count` / `relation_count` / `llm_call_count` / `llm_token_count` / `model_id` / `provider` / `timeout_seconds` / `chunks_truncated` / `status` / `error_type` | collection_id / document_id / parse_version / document_index_id | Singapore 大 PDF + per-chunk 时间 (Planetegg) + per-chunk node/edge (earayu2) | **P0** |
76+
| C | **graph extraction document** | `chunks_total` / `windows_total` / `windows_success` / `windows_failed` / `windows_timeout` / `entities_total` / `relations_total` / `wall_time_ms` | collection_id / document_id / parse_version / document_index_id | 太钢 token 追问 (per-doc 汇总) + 大文档容量诊断 | **P0** |
77+
| B | **worker lane lifecycle** | `lane: str` / `event: claim\|release\|heartbeat` / `queue_depth` / `concurrent_tasks` / `duration_ms` | (不绑 collection 因 worker 是 process-level) | indexing-worker 健康度 + 队列积压可见性 | **P1** |
78+
| C | **vector indexing** | `chunk_count` / `embedding_count` / `dimension` / `model_id` / `provider` / `duration_ms` / `status` | collection_id / document_id / parse_version | 索引检索全路径覆盖 (earayu2 范围) | **P1** |
79+
| C | **fulltext indexing** | `chunk_count` / `index_size_bytes` / `duration_ms` / `status` | collection_id / document_id / parse_version | 同上 | **P1** |
80+
| C | **retrieval search** | `top_k` / `embedding_count` / `candidate_count` / `hit_count` / `latency_ms` / `search_mode: vector\|fulltext\|hybrid` / `status` | collection_id / query_hash (NOT query text) | 检索性能可见性 + hit rate + 慢查询定位 | **P2** |
81+
| D | **LLM call ledger** (独立 event family per Weston msg=94df61b3) | `model_name` / `provider` / `prompt_tokens` / `completion_tokens` / `total_tokens` / `cost_usd` (or local currency) / `latency_ms` / `status` / `error_code` / `call_purpose: Literal['chunking','extraction','embedding','answer','rerank','summarization']` | collection_id / document_id / window_id / run_id / call_id | **太钢 token 追问 + 私有大模型 cost 担忧 + 财务分摊** | **P2** (优先级最高 P2 — 跟 P1 worker lane 几乎同时启动条件) |
82+
| A | **HTTP request** | `method` / `endpoint` / `status_code` / `latency_ms` / `tenant_id` | user_id / collection_id (when applicable) | 索引检索 worker 全路径覆盖 (earayu2 范围) — API 入口侧 | **P3** (defer, 现有 access log 已有, 暂不优先) |
83+
84+
### 2.3 优先级判定逻辑 (per 4 guardrail + 真实需求驱动)
85+
86+
- **P0 (graph extraction 2 metric)**: earayu2 directive 直接 surface + Singapore 大 PDF 现场实证 + 太钢 token 追问 (per-window/document) — production blocking gap, 不能再等
87+
- **P1 (worker lane + vector/fulltext indexing)**: 全路径覆盖前 50% — P0 production data 收集 ≥ 1 周后启动条件 (per 不无限扩范围 4 guardrail)
88+
- **P2 (retrieval + LLM ledger)**: LLM ledger **优先级最高 P2** (太钢直接问 + 私有大模型 cost 必需), 跟 P1 几乎同时启动条件 — 不是 traditional last
89+
- **P3 (API HTTP)**: 现有 access log 部分覆盖, 暂不优先 — earayu2 决策时机
90+
91+
## 3. LLM 调用账本 (Layer D) — 单独 spec out (per Weston msg=94df61b3)
92+
93+
### 3.1 为什么 LLM ledger 单独 spec out
94+
95+
- **业务 priority 高于其他 P2** (太钢 + 私有大模型 cost 担忧)
96+
- **scope 跨 4 layer**: API answer LLM call + indexing-worker chunking/extraction/embedding LLM call + 跨 collection/doc/run 维度归因
97+
- **独立 event family / ledger table** (per Weston): `llm_call_event` 独立, NOT 跟 `telemetry_event` 通用 schema 共表 — 避免 attrs JSON 通用化让 cost 计算 / token 累加慢
98+
99+
### 3.2 字段对齐 (跨域协作方对齐版)
100+
101+
```
102+
event_type = 'llm.call'
103+
ts = call timestamp
104+
collection_id (optional, applicable when LLM call 关联具体 collection)
105+
document_id (optional, applicable when 关联具体 doc, e.g. graph extraction)
106+
window_id (optional, graph extraction context)
107+
run_id (optional, GraphCurationRun / IndexingRun 等 run-level 关联)
108+
call_id = unique LLM call ID (UUID)
109+
call_purpose: Literal['chunking', 'extraction', 'embedding', 'answer', 'rerank', 'summarization']
110+
model_name: str (e.g. 'gpt-4o-mini', 'qwen2.5-72b', 'text-embedding-3-small')
111+
provider: str (e.g. 'openai', 'qwen', 'anthropic', 'private_llm')
112+
prompt_tokens: int
113+
completion_tokens: int
114+
total_tokens: int (= prompt + completion, redundant 但 SQL aggregation 方便)
115+
cost_usd: Optional[Decimal] (有 price 配置时计算, 无配置时 NULL)
116+
latency_ms: int
117+
status: Literal['success', 'failed', 'timeout', 'rate_limited']
118+
error_code: Optional[str] (whitelist enum classify_error(exc), e.g. 'context_overflow' / 'rate_limit' / 'timeout' / 'invalid_response')
119+
```
120+
121+
### 3.3 双入口覆盖 (per Weston msg=94df61b3)
122+
123+
- **API 进程 LLM call**: answer (chat) / rerank — 不能只依赖 LiteLLM callback (per Planetegg 现状: callback 禁用)
124+
- **indexing-worker LLM call**: chunking (LLM-aware splitter) / extraction (graph entity/relation) / embedding (vector indexing) / summarization
125+
- 实施时**两侧都必须显式 emit** — 不能假设 LiteLLM callback 总是可用
126+
127+
### 3.4 Privacy hard gate (沿用 P0 boundary)
128+
129+
ledger event `attrs` payload **不含**:
130+
- `prompt_text` / `completion_text` / `messages` / `chunk_text` / `query_text`
131+
- `error_message` 原文 / `traceback` / `repr(exc)` (per huangzhangshu msg=171acb55) — 仅 `error_code` whitelist enum
132+
- `model_response_metadata` 原始 dict (仅提取 token count / cost 字段)
133+
134+
仅允许: structured 元数据 (model name / provider / token count / cost / latency / status / call_purpose / 关联 ID)
135+
136+
### 3.5 Cost 计算 (declarative, NOT runtime probe)
137+
138+
- `cost_usd` 来自 deployment-time **price config table** (per-model / per-token rate)
139+
- 私有大模型 → cost 可填 0 或 NULL (客户自定义)
140+
- 公有模型 (OpenAI / Anthropic / Azure) → 跟 model card 一致价格
141+
- **NOT runtime probe** — 静态 declaration, 跟 task #61 P1-D3 capability declaration pattern 一致 (Lesson #17 backend 收敛 contract)
142+
143+
## 4. 跨层关联维度 (跨 capability 一致 schema)
144+
145+
无论哪 layer, telemetry event 都共享下列关联维度 (跨 capability join 友好):
146+
147+
| 维度 | 定义 | 适用层 | 隐私敏感性 |
148+
|------|------|--------|----------|
149+
| `collection_id` | 业务集合 ID | A/B/C/D | 低 (内部 ID) |
150+
| `document_id` | 业务 doc ID | A/C/D ||
151+
| `parse_version` | doc parse 版本 (retry 串联) | C/D ||
152+
| `document_index_id` | indexing 内部 retry 串联 | B/C ||
153+
| `run_id` | GraphCurationRun / IndexingRun ID | B/C/D ||
154+
| `window_id` | graph extraction window context | C/D ||
155+
| `call_id` | LLM call UUID | D ||
156+
| `tenant_id` / `user_id` | 用户 / 租户身份 | A | **** — hash or pseudo |
157+
| `query_hash` | 用户查询 hash (NOT raw text) | A/C | **** — hash only |
158+
159+
**Privacy hard gate (跨 layer 沿用 P0 boundary 4 guardrail)**:
160+
- ID-only, NO raw text (chunk / query / prompt / completion / error message / entity description)
161+
- High-sensitivity 字段 (tenant_id / query_hash) hash 或 pseudonymize
162+
163+
## 5. 验收路径 (跨 layer 一致)
164+
165+
per Weston msg=22e6df03 P0 lock + earayu2 4 guardrail:
166+
167+
- **P0 验收 = SQL/query + tests**, NO admin UI / metrics endpoint
168+
- **P1 验收 = SQL/query + admin metrics endpoint** (typed schema 三层区分: `RawTelemetryEvent` debug-only / `TelemetryAggregateBucket` dashboard primary / `TelemetryTimeRangeSummary` dashboard ready, per dongdong msg=076bfaec)
169+
- **P2 验收 = SQL/query + admin metrics endpoint extend + Grafana datasource docs** (per Planetegg P1-T3)
170+
171+
跨 layer SQL aggregation 示例 (P0 + P2 LLM ledger 数据齐后):
172+
```sql
173+
-- per-doc graph extraction 总耗时 + token cost
174+
SELECT
175+
te.collection_id,
176+
te.document_id,
177+
te.attrs->>'wall_time_ms' AS extraction_wall_time_ms,
178+
te.attrs->>'entities_total' AS entities_total,
179+
COUNT(llc.call_id) AS llm_calls,
180+
SUM((llc.attrs->>'total_tokens')::int) AS total_tokens,
181+
SUM((llc.attrs->>'cost_usd')::decimal) AS total_cost_usd
182+
FROM telemetry_event te
183+
LEFT JOIN llm_call_event llc
184+
ON llc.collection_id = te.collection_id AND llc.document_id = te.document_id
185+
AND llc.attrs->>'call_purpose' = 'extraction'
186+
WHERE te.event_type = 'graph_extraction.document'
187+
AND te.ts > now() - interval '7 days'
188+
GROUP BY te.collection_id, te.document_id, te.attrs->>'wall_time_ms', te.attrs->>'entities_total'
189+
ORDER BY total_cost_usd DESC LIMIT 50;
190+
191+
-- per-collection / per-call_purpose token 消耗 + 成本归因
192+
SELECT
193+
collection_id,
194+
attrs->>'call_purpose' AS purpose,
195+
attrs->>'model_name' AS model,
196+
COUNT(*) AS call_count,
197+
SUM((attrs->>'prompt_tokens')::int) AS prompt_tokens_total,
198+
SUM((attrs->>'completion_tokens')::int) AS completion_tokens_total,
199+
SUM((attrs->>'cost_usd')::decimal) AS cost_usd_total
200+
FROM llm_call_event
201+
WHERE ts > now() - interval '30 days'
202+
GROUP BY collection_id, attrs->>'call_purpose', attrs->>'model_name'
203+
ORDER BY cost_usd_total DESC LIMIT 100;
204+
```
205+
206+
## 6. 实施 phase 拆分
207+
208+
| Phase | 包含 | 启动条件 | 实施 spec |
209+
|-------|------|---------|----------|
210+
| **P0** | Layer C graph extraction 2 metric (window + document) + Layer 0 data model + Layer 2 ingestion (DISABLE_TELEMETRY + fail-safe) + Layer 4 boundary | earayu2 已 confirm (msg=5dbbe60a "可以,做吧") — 立即启动 | [`task-89-telemetry-spec-v1.md`](./task-89-telemetry-spec-v1.md) PR #1951 |
211+
| **P1** | Layer B worker lane lifecycle + Layer C vector/fulltext indexing producer + Layer 3 admin metrics endpoint (三层 typed schema) + Helm Grafana docs | P0 production data 收集 ≥ 1 周后, earayu2 + PM 决策 | spec v2 (待启动) |
212+
| **P2** | Layer C retrieval search producer + **Layer D LLM call ledger 独立 event family / ledger table** + cost price config table | 太钢 / 私有大模型 cost 客户问题持续 + LLM ledger 优先级跟 P1 几乎同时 | LLM ledger 单独 spec (待启动) |
213+
| **P3** | Layer A HTTP request producer | earayu2 决策 (现有 access log 部分覆盖) | defer |
214+
215+
## 7. Privacy hard gate (跨 phase 共同基线)
216+
217+
per huangzhangshu msg=171acb55 + Weston msg=94df61b3 + P0 spec § 3.1.5/§5.2 lock:
218+
219+
- **永远不进 attrs payload**:
220+
- chunk_text / chunk_content / query_text / user_query / prompt_text / completion_text / messages
221+
- entity_description / description_text
222+
- error_message 原文 / traceback / repr(exc)
223+
- **AST data-flow gate** (NOT 全文 grep) 防误伤合法 extraction 路径
224+
- **error 类信息**: 仅 `error_code` / `error_type` whitelist enum (classify_error(exc) 输出), 不带 raw exception
225+
- **高敏感 ID**: tenant_id / query 等 hash or pseudonymize
226+
- **disable switch**: `DISABLE_TELEMETRY=true` 部署级 opt-out (per Planetegg msg=db130d5e P0 directive)
227+
228+
## 8. 不做 (跨 phase 共同 YAGNI)
229+
230+
- 不引入 OpenTelemetry / Jaeger / Datadog SDK (跨 process span tracing, evidence-driven 才 trigger)
231+
- 不做 real-time alert / SLO 计算 (admin metrics endpoint 可手动 query)
232+
- 不做 chunk text / query text 抽样存档 (privacy hard gate 永远 NO)
233+
- 不做 dashboard builder UI (P1 仅 fixed indicator dimensions)
234+
- 不引入新存储 (复用 PG, 私有化部署免维护)
235+
236+
## 9. 跨域协作方对齐确认表
237+
238+
(本文档发 PR + merge 后, 跨域协作方在 PR review 留 ack 即视为对齐)
239+
240+
| 协作方 | scope | ack 形式 |
241+
|--------|-------|---------|
242+
| @符炫炜 (architect) | overall capability map + phase 优先级 + privacy gate 设计 | spec author |
243+
| @不穷 (PM) | 实施时机 + task 拆分粒度 + dispatch | PR review LGTM |
244+
| @earayu2 (decision maker) | P0/P1/P2 trigger condition + cost config table 启动时机 | PR ratify |
245+
| @Planetegg (SRE) | 部署可视化 + retention/cardinality 容量估算 + Helm 接入 | PR review LGTM |
246+
| @Weston (架构师 cross-CR) | LLM ledger 独立 event family 设计 + cross-layer field 一致性 | PR review LGTM |
247+
| @ziang (index/worker) | indexing-worker producer 接入点 + worker lane lifecycle 命名 | PR review LGTM |
248+
| @huangzhangshu (testing) | privacy boundary scope + fail-safe 两类 + 验收口径 | PR review LGTM |
249+
| @dongdong (FE) | dashboard typed schema 三层区分 + raw event 不进 dashboard | PR review LGTM |
250+
| @cuiwenbo (FE typed schema consumer) | typed schema sync 一致性 + capability declaration mirror | PR review LGTM |
251+
252+
## 10. 关联文档
253+
254+
- earayu2 directive: `#indexing优化` msg=1331d5e7 + 飞书 DM (msg=d3042add)
255+
- 太钢 token 追问驱动: 谭怀远 飞书群
256+
- 私有大模型 cost 担忧: `ou_1d75b5...` 飞书
257+
- Singapore 大 PDF 现场: `#indexing优化` Planetegg msg=1314ac59
258+
- P0 实施 spec: [`task-89-telemetry-spec-v1.md`](./task-89-telemetry-spec-v1.md) (PR #1951)
259+
- task #17 任务系统不变式: [`task-system-invariants.md`](./task-system-invariants.md)
260+
- task #61 capability declaration pattern (Lesson #17 backend 收敛 contract): [`task-61-db-adapter-compat-spec-v1.md`](./task-61-db-adapter-compat-spec-v1.md)
261+
- ci-flake-policy: [`ci-flake-policy.md`](./ci-flake-policy.md)
262+
263+
---
264+
265+
**起草**: @符炫炜 (总架构师)
266+
**日期**: 2026-04-30
267+
**版本**: v1 (跨域需求对齐草稿; PR review 收齐协作方 ack 后 earayu2 ratify → merge)
268+
**模式**: per earayu2 飞书 directive — "讨论完成发 PR 合并", 不写实现, 仅指标 capability/字段/维度/口径 alignment

0 commit comments

Comments
 (0)