Skip to content

Commit c768200

Browse files
committed
refactor(finance): Refactor FinanceCompositionEvaluator and Enable Independent Model Configuration
- Added a new OpenJudge-based `FinanceCompositionEvaluator` to replace the legacy implementation. - Implemented domain-based routing to direct requests to the appropriate set of graders, supporting multiple fields such as stock analysis and industry research. - Implemented an asynchronous pairwise evaluation interface that returns scores within the 0–1 range. - Enabled independent configuration for `finance_llm`; if not explicitly configured, the general `openjudge_llm` model is reused. - Cleaned up redundant imports and deprecated code within `DeepFinanceJudgeByOpenJudge`. - Updated `deep_finance_openjudge_template.yaml` to include documentation for the `finance_llm` option. - Refined the description of "evidence traceability" in `deep_finance.md`, renaming it to "Reference Logic Audit" and enhancing the details regarding the workflow and judgment criteria.
1 parent ffb1f80 commit c768200

6 files changed

Lines changed: 272 additions & 185 deletions

File tree

tutorial/example_deep_finance/deep_finance.md

Lines changed: 20 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -232,44 +232,38 @@ final_score = 0.5 × coverage + 0.5 × grounding # 综合分数
232232

233233
------
234234

235-
### 4) 证据溯源(EBTU - Evidence-Backed Trace Units
235+
### 4) 引用逻辑审计(AUDIT - Citation Integrity Audit
236236

237-
**目标**对报告中的每个「原子断言」做证据锚定审计——回答「每个数字、每个事实,能否追溯到工具返回的原始数据?」
237+
**目标**审计 AI 研究报告中的每一个引用标记 `[n]` 是否严格符合「逻辑蕴含(Logical Entailment)」原则——回答「每个引用是否被原始证据严格支撑?」
238238

239-
**核心理念:证据优先(Evidence-first)**审计官必须先给出证据锚点(step + quote),再下裁决,严禁先下结论再找证据。
239+
**核心理念:证据优先(Evidence-first)**审计官必须像法官判案一样,先罗列证据,再进行逻辑推导,最后下达判决,严禁先下结论再找证据。
240240

241-
**审计流程**
241+
**三步验证流程**
242242

243-
1. 从报告中提取所有原子断言(Trace Units),标记类型(numeric/temporal/event/comparison/causal 等)
244-
2. 标记硬度:`hard`(确定性事实) / `soft`(明确标注为推测/假设)
245-
3. 对每个断言在 Evidence 中寻找锚点(anchors),要求:
243+
1. **提取(Extract)**:锁定报告中由 `[n]` 支撑的陈述片段(Claim)
244+
2. **溯源(Trace)**:在 Reference 列表中找到 `[n]` 对应的原始文本,摘录核心证据句(Source Quote)
245+
3. **比对(Compare)**:分析 Claim 是否被 Source Quote 严格支撑
246+
- Check: 数字/事实是否一致?
247+
- Check: 语气是否一致(有没有把"可能"改成"确定")?
248+
- Check: 因果关系是否存在?
246249

247-
- - 精确到 step 编号和原文引用(quote ≤ 120 字)
248-
- 数字/日期必须能在 Evidence 原文中找到对应
250+
**判决标准(Verdict Criteria)**
249251

250-
1. 给出裁决(verdict):
251-
252-
| Verdict | 含义 |
253-
| ---------------- | ----------------------------------------- |
254-
| `supported` | 锚点直接支持断言 |
255-
| `contradicted` | 锚点与断言明确冲突 |
256-
| `no_evidence` | Evidence 中找不到支撑,且断言是确定性表述 |
257-
| `speculative_ok` | 断言明确为推测/假设,未伪装成事实 |
258-
| `unclear` | Evidence 相关但不足以支持或反驳 |
259-
260-
1. 标记问题类型(issue):`entity_mismatch` / `time_mismatch` / `value_mismatch` / `scope_mismatch` / `logic_leap` / `over_precision` / `missing_anchor`
252+
| Verdict | 含义 |
253+
| -------------- | ------------------------------------------------------------ |
254+
| `Supported` | 证据充分,逻辑闭环。允许合理的概括,但禁止添加细节 |
255+
| `Overstated` | 夸大其词。证据只说了 A,报告却写成了 A+(如去掉限定词、强加因果) |
256+
| `Contradicted` | 事实冲突。报告内容与证据相反 |
257+
| `Hallucinated` | 无中生有。关键细节在证据中找不到,或引用编号不存在 |
258+
| `Irrelevant` | 引用无效。证据内容真实,但与报告所述主题无关 |
261259

262260
**评分计算**(确定性打分,由 Python 代码计算,非 LLM 输出):
263261

264262
```plain
265-
base = (supported - 1.4×contradicted - 0.9×no_evidence - 0.4×unclear) / hard_units
266-
misattrib_factor = max(0, 1 - 0.7 × misattrib_rate) # 错误归因惩罚
267-
selection_factor = min(1, extracted_units / expected) # 覆盖率因子
268-
cov_factor = 0.65 + 0.35 × digit_coverage # 数字/日期覆盖
269-
score = base × misattrib_factor × selection_factor × cov_factor
263+
integrity_score = Supported数量 / 总引用数
270264
```
271265

272-
关键设计:LLM 只负责结构化输出(断言提取 + 锚点标注 + 裁决),分数完全由代码确定性计算,避免 LLM 自评分的不稳定性。
266+
关键设计:LLM 只负责结构化输出(Claim 提取 + 证据溯源 + 逻辑分析 + 判决),分数完全由代码确定性计算,避免 LLM 自评分的不稳定性。
273267

274268
------
275269

tutorial/example_deep_finance/deep_finance_judge.py

Lines changed: 33 additions & 155 deletions
Original file line numberDiff line numberDiff line change
@@ -8,34 +8,20 @@
88
import time
99
import logging
1010
from datetime import datetime
11-
from typing import Dict, Any, Optional, Tuple, List, Type
11+
from typing import Dict, Any, Optional, Tuple, List
1212

1313
from ajet.task_judge.base_judge import BaseJudge
1414
from ajet.workflow import WorkflowOutput, WorkflowTask
1515

1616
from openjudge.models.openai_chat_model import OpenAIChatModel
1717
from openjudge.runner.grading_runner import GraderConfig, GradingRunner
18-
from openjudge.graders.base_grader import BaseGrader
19-
from tutorial.example_deep_finance.judge import PresentationQualityGrader, GroundingGrader, AuditGrader, EBTUTraceabilityGrader
20-
21-
# Finance Graders from OpenJudge cookbooks
22-
from cookbooks.finance_grader.stock_analysis.valuation_analysis import ValuationAnalysisGrader
23-
from cookbooks.finance_grader.stock_analysis.fundamental_analysis import FundamentalAnalysisGrader
24-
from cookbooks.finance_grader.stock_analysis.overall_logic import OverallLogicGrader
25-
from cookbooks.finance_grader.stock_analysis.stock_risk_analysis import StockRiskAnalysisGrader
26-
from cookbooks.finance_grader.macro_analysis.macro_analysis import MacroAnalysisGrader
27-
from cookbooks.finance_grader.macro_analysis.concept_explanation import ConceptExplanationGrader
28-
from cookbooks.finance_grader.industry_research.characteristics_analysis import CharacteristicsAnalysisGrader
29-
from cookbooks.finance_grader.industry_research.risk_analysis import RiskAnalysisGrader
30-
from cookbooks.finance_grader.industry_research.underlying_comparison import UnderlyingComparisonGrader
31-
from cookbooks.finance_grader.event_interpretation.event_analysis import EventAnalysisGrader
32-
from cookbooks.finance_grader.event_interpretation.event_identification import EventIdentificationGrader
33-
from cookbooks.finance_grader.stock_search.search_relevance import SearchRelevanceGrader
34-
from cookbooks.finance_grader.stock_search.search_integrity import SearchIntegrityGrader
35-
from cookbooks.finance_grader.stock_search.search_timeliness import SearchTimelinessGrader
36-
37-
38-
# OpenJudge imports
18+
from tutorial.example_deep_finance.judge import (
19+
PresentationQualityGrader,
20+
GroundingGrader,
21+
AuditGrader,
22+
EBTUTraceabilityGrader,
23+
FinanceCompositionEvaluator,
24+
)
3925
# =============================================================================
4026
# 全局辅助函数
4127
# =============================================================================
@@ -76,135 +62,6 @@ def load_reference_answers_from_file(file_path: str) -> Tuple[Dict[str, str], Di
7662
raise ValueError(f"Error loading reference answers: {e}")
7763

7864

79-
# =============================================================================
80-
# FinanceCompositionEvaluator - 基于 OpenJudge 的 Finance 评估器
81-
# =============================================================================
82-
83-
class FinanceCompositionEvaluator:
84-
"""
85-
基于 OpenJudge 的 Finance 组合评估器(替代 rm_gallery.FinanceComposition)
86-
87-
功能:
88-
- 根据 domain 路由到对应的 grader 集合
89-
- 执行 pairwise 评估(比较 training answer 和 reference answer)
90-
- 返回 0-1 范围的分数
91-
92-
支持的 domain:
93-
- stock_analysis: 股票分析
94-
- industry_research: 行业研究
95-
- macro_analysis: 宏观分析
96-
- event_interpretation: 事件解读
97-
- stock_search: 股票搜索
98-
"""
99-
100-
# Domain 到 Grader 类的映射(与 RM-Gallery 保持一致)
101-
DOMAIN_GRADERS: Dict[str, List[Type[BaseGrader]]] = {
102-
"stock_analysis": [
103-
ValuationAnalysisGrader,
104-
# FundamentalAnalysisGrader,
105-
# OverallLogicGrader,
106-
# StockRiskAnalysisGrader,
107-
],
108-
"industry_research": [
109-
CharacteristicsAnalysisGrader,
110-
# RiskAnalysisGrader,
111-
# UnderlyingComparisonGrader,
112-
],
113-
"macro_analysis": [
114-
MacroAnalysisGrader,
115-
# ConceptExplanationGrader,
116-
],
117-
"event_interpretation": [
118-
EventAnalysisGrader,
119-
# EventIdentificationGrader,
120-
],
121-
"stock_search": [
122-
SearchRelevanceGrader,
123-
# SearchIntegrityGrader,
124-
# SearchTimelinessGrader,
125-
],
126-
}
127-
128-
def __init__(self, model: OpenAIChatModel, params: Dict[str, Any] = None):
129-
"""
130-
初始化 FinanceCompositionEvaluator
131-
132-
Args:
133-
model: OpenAIChatModel 实例
134-
params: 额外参数(保留兼容性)
135-
"""
136-
self.model = model
137-
self.params = params or {}
138-
self._grader_cache: Dict[str, List[BaseGrader]] = {}
139-
140-
def _get_graders_for_domain(self, domain: str) -> List[BaseGrader]:
141-
"""
142-
获取指定 domain 的 grader 实例列表(带缓存)
143-
"""
144-
if domain not in self._grader_cache:
145-
grader_classes = self.DOMAIN_GRADERS.get(domain, [])
146-
self._grader_cache[domain] = [
147-
grader_cls(model=self.model) for grader_cls in grader_classes
148-
]
149-
return self._grader_cache[domain]
150-
151-
async def aevaluate(self, query: str, current: str, reference: str, domain: str) -> float:
152-
"""
153-
执行 pairwise 评估(异步版本,避免重复创建 event loop)
154-
155-
Args:
156-
query: 用户查询
157-
current: 当前模型生成的回答 (training)
158-
reference: 参考答案
159-
domain: 任务领域(用于路由到对应 graders)
160-
161-
Returns:
162-
float: 0-1 范围的分数
163-
- 1.0: current 优于 reference
164-
- 0.0: reference 优于 current
165-
- 0.5: 无法评估或出错
166-
"""
167-
if not domain or domain not in self.DOMAIN_GRADERS:
168-
print(f"⚠️ FinanceCompositionEvaluator: Unknown domain '{domain}', returning 0.5")
169-
return 0.5
170-
171-
graders = self._get_graders_for_domain(domain)
172-
if not graders:
173-
print(f"⚠️ FinanceCompositionEvaluator: No graders for domain '{domain}', returning 0.5")
174-
return 0.5
175-
176-
# 运行所有 graders
177-
scores = []
178-
for grader in graders:
179-
try:
180-
result = await grader.aevaluate(
181-
query=query,
182-
answer_1=current, # training model output
183-
answer_2=reference, # reference answer
184-
)
185-
186-
# 解析 GraderRank 结果
187-
if hasattr(result, 'rank') and isinstance(result.rank, list):
188-
# rank = [1, 2] 表示 answer_1 (current) 更好 -> score = 1.0
189-
# rank = [2, 1] 表示 answer_2 (reference) 更好 -> score = 0.0
190-
if result.rank[0] == 1:
191-
scores.append(1.0)
192-
else:
193-
scores.append(0.0)
194-
else:
195-
scores.append(0.5) # 无法解析,返回中间值
196-
197-
except Exception as e:
198-
grader_name = getattr(grader, 'name', grader.__class__.__name__)
199-
print(f"⚠️ FinanceCompositionEvaluator: Grader {grader_name} failed: {e}")
200-
scores.append(0.5)
201-
202-
# 计算平均分数
203-
if scores:
204-
return sum(scores) / len(scores)
205-
return 0.5
206-
207-
20865
# =============================================================================
20966
# DeepFinanceJudgeByOpenJudge 类
21067
# =============================================================================
@@ -287,6 +144,7 @@ def _init_finance_evaluator(self):
287144
初始化 FinanceCompositionEvaluator(仅当 finance_weight > 0 时)
288145
289146
使用 OpenJudge 的 finance graders 替代原 rm_gallery 实现
147+
支持独立的 finance_llm 配置,若未配置则复用 openjudge_llm
290148
"""
291149
self._finance_enabled = (self.w.get("finance", 0) > 0)
292150
if self._finance_enabled:
@@ -302,15 +160,35 @@ def _create_finance_evaluator(self):
302160
"""
303161
创建 FinanceCompositionEvaluator 实例(基于 OpenJudge)
304162
305-
复用已初始化的 OpenJudge model,无需单独配置
163+
支持独立的 finance_llm 配置:
164+
- 若 config.ajet.judge.finance_llm 有值,则使用独立的 model
165+
- 若未配置或为空,则复用已初始化的 OpenJudge model
306166
"""
307167
try:
308-
# 复用 OpenJudge model(已在 _init_openjudge_model 中初始化)
168+
# 检查是否配置了独立的 finance_llm
169+
finance_llm_name = getattr(self.config.ajet.judge, "finance_llm", None)
170+
171+
if finance_llm_name and finance_llm_name.strip():
172+
# 使用独立的 finance model
173+
finance_base_url = os.environ.get("FINANCE_BASE_URL") or os.environ.get("OPENJUDGE_BASE_URL")
174+
finance_api_key = os.environ.get("FINANCE_API_KEY") or os.environ.get("OPENJUDGE_API_KEY")
175+
176+
finance_model = OpenAIChatModel(
177+
model=finance_llm_name,
178+
base_url=finance_base_url,
179+
api_key=finance_api_key,
180+
)
181+
print(f"[Init FinanceCompositionEvaluator] Using dedicated finance model: {finance_llm_name}")
182+
else:
183+
# 复用 OpenJudge model(已在 _init_openjudge_model 中初始化)
184+
finance_model = self.model
185+
print(f"[Init FinanceCompositionEvaluator] Reusing OpenJudge model")
186+
309187
self.finance_evaluator = FinanceCompositionEvaluator(
310-
model=self.model,
188+
model=finance_model,
311189
params={"is_parallel": True}
312190
)
313-
print(f"[Init FinanceCompositionEvaluator] Using OpenJudge model, domains={list(FinanceCompositionEvaluator.DOMAIN_GRADERS.keys())}")
191+
print(f"[Init FinanceCompositionEvaluator] domains={list(FinanceCompositionEvaluator.DOMAIN_GRADERS.keys())}")
314192
except Exception as e:
315193
print(f"✗ Failed to initialize FinanceCompositionEvaluator: {e}")
316194
import traceback

tutorial/example_deep_finance/judge/__init__.py

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,5 +3,12 @@
33
from .presentation_quality.grader import PresentationQualityGrader
44
from .audit.grader import AuditGrader
55
from .ebtu.grader import EBTUTraceabilityGrader
6+
from .finance.grader import FinanceCompositionEvaluator
67

7-
__all__ = ["PresentationQualityGrader", "GroundingGrader", "AuditGrader", "EBTUTraceabilityGrader"]
8+
__all__ = [
9+
"PresentationQualityGrader",
10+
"GroundingGrader",
11+
"AuditGrader",
12+
"EBTUTraceabilityGrader",
13+
"FinanceCompositionEvaluator",
14+
]
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
"""Finance Composition Evaluator - 基于 OpenJudge 的 Finance 组合评估器"""
2+
from .grader import FinanceCompositionEvaluator
3+
4+
__all__ = ["FinanceCompositionEvaluator"]

0 commit comments

Comments
 (0)