Skip to content

feat: integrate FinWorldJudge with OpenJudge support & add project blogs#21

Open
TaoShuchang wants to merge 11 commits intomainfrom
dev/shuchang_newjudge
Open

feat: integrate FinWorldJudge with OpenJudge support & add project blogs#21
TaoShuchang wants to merge 11 commits intomainfrom
dev/shuchang_newjudge

Conversation

@TaoShuchang
Copy link
Copy Markdown
Collaborator

Description

This Pull Request introduces the FinWorldJudgeByOpenJudge protocol to enhance the automated evaluation capabilities of AgentJet in financial scenarios. Additionally, it includes comprehensive documentation updates, including bilingual blogs and an improved README to better guide users and contributors.

Key Changes

1. Core Logic & Evaluation

  • New Judge Protocol: Implemented FinWorldJudgeByOpenJudge, leveraging the openjudge framework to provide more nuanced and reliable scoring for financial agent tasks.
  • Environment Integration: Seamlessly integrated the new judge into the evaluation pipeline, ensuring compatibility with existing financial benchmarks (FinWorld).
  • Dependency Update: Added openjudge to the project requirements to support the new evaluation backend.

2. Documentation & Community

  • Bilingual Blogs: Added both English and Chinese technical blogs detailing the design philosophy behind AgentJet and the implementation of the new judging mechanism.
  • README Enhancements:
    • Updated the main README.md with clearer setup instructions.
    • Added a dedicated section for the new FinWorldJudge protocol.
    • Improved the overall project structure description for better developer onboarding.

3. Git Maintenance

  • Resolved merge conflicts between dev/shuchang_newjudge and the main branch to ensure a clean merge.

Type of Change

  • New Feature: Integration of FinWorldJudgeByOpenJudge.
  • Documentation: Added CN/EN blogs and updated README.
  • Refactoring: Conflict resolution and dependency management.

…on OpenJudge

- Refactored reward_metric_helper, optimizing the data structure and statistical logic of OpenJudge and Finance Evaluator

- Added the DeepFinanceJudgeByOpenJudge class to achieve unified calls and weighted fusion across multiple Graders

- Supports both RM Gallery and Finance Evaluator as evaluation sources, enhancing evaluation dimensions

- Asynchronously calls OpenJudge Runner, adding retry and error handling mechanisms

- Implements cached loading of reference answers, improving RM Gallery evaluation efficiency

- Added tool call penalty calculation, fusing step_reward and scores from each Grade

- Added automatic saving of debug information when OpenJudge scores for each Grade are zero

- Log recording and time consumption statistics cover the entire evaluation process, facilitating performance monitoring and troubleshooting
…dependent Model Configuration

- Added a new OpenJudge-based `FinanceCompositionEvaluator` to replace the legacy implementation.
- Implemented domain-based routing to direct requests to the appropriate set of graders, supporting multiple fields such as stock analysis and industry research.
- Implemented an asynchronous pairwise evaluation interface that returns scores within the 0–1 range.
- Enabled independent configuration for `finance_llm`; if not explicitly configured, the general `openjudge_llm` model is reused.
- Cleaned up redundant imports and deprecated code within `DeepFinanceJudgeByOpenJudge`.
- Updated `deep_finance_openjudge_template.yaml` to include documentation for the `finance_llm` option.
- Refined the description of "evidence traceability" in `deep_finance.md`, renaming it to "Reference Logic Audit" and enhancing the details regarding the workflow and judgment criteria.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the reward calculation and evaluation framework for the Finance Deep Research Agent, transitioning from the RM Gallery implementation to a more flexible OpenJudge-based FinanceCompositionEvaluator. It also updates the training infrastructure, configuration templates, and documentation to support this new evaluation approach. My feedback focuses on improving the robustness of the training scripts by removing hardcoded paths in favor of environment variables, fixing documentation errors, and cleaning up unused configuration templates.

Comment on lines +53 to +54
export TRAIN_DATA_PATH="/mnt/data_cpfs/taoshuchang.tsc/deepresearch/AgentJet_new/tutorial/example_deep_finance/data/train_merged_all.json"
export TRAIN_REF_ANS_PATH="/mnt/data_cpfs/taoshuchang.tsc/deepresearch/AgentJet_new/tutorial/example_deep_finance/data/Reference_merged_all.json"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The script hardcodes user-specific paths for TRAIN_DATA_PATH and TRAIN_REF_ANS_PATH. These paths will not work on other machines. The script already sources the .env file, which is the correct place for these configurations. Please remove these export lines to allow the values from the .env file to be used.

export RAY_CLUSTER_MODE="multi_node"
export DEEPFINANCE_PATH="${ENV_SERVICE_ROOT}" # AgentJet 内部可能使用此路径
export DEEPFINANCE_PATH="${ENV_SERVICE_ROOT}"
export DEEPFINANCE_SCRIPT="source /mnt/data/taoshuchang.tsc/anaconda3/etc/profile.d/conda.sh && conda activate finworld_1209 && cd ${ENV_SERVICE_ROOT} && DEEPFINANCE_TOOL_RESULT_MAX_CHARS=${DEEPFINANCE_TOOL_RESULT_MAX_CHARS} DEEPFINANCE_MCP_CONFIG=${DEEPFINANCE_MCP_CONFIG} CACHE_TYPE=${CACHE_TYPE} MONGO_URI=${MONGO_URI} MONGO_DB_NAME=${MONGO_DB_NAME} MONGO_COLLECTION_NAME=${MONGO_COLLECTION_NAME} python -m env_service.env_service --env finworld --portal 0.0.0.0 --port 8080"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The DEEPFINANCE_SCRIPT variable contains a hardcoded, user-specific path to conda.sh (/mnt/data/taoshuchang.tsc/anaconda3/etc/profile.d/conda.sh). This will fail on any other developer's machine. The .env_sample file already defines a CONDA_PATH variable for this purpose. Please use that variable here.

Suggested change
export DEEPFINANCE_SCRIPT="source /mnt/data/taoshuchang.tsc/anaconda3/etc/profile.d/conda.sh && conda activate finworld_1209 && cd ${ENV_SERVICE_ROOT} && DEEPFINANCE_TOOL_RESULT_MAX_CHARS=${DEEPFINANCE_TOOL_RESULT_MAX_CHARS} DEEPFINANCE_MCP_CONFIG=${DEEPFINANCE_MCP_CONFIG} CACHE_TYPE=${CACHE_TYPE} MONGO_URI=${MONGO_URI} MONGO_DB_NAME=${MONGO_DB_NAME} MONGO_COLLECTION_NAME=${MONGO_COLLECTION_NAME} python -m env_service.env_service --env finworld --portal 0.0.0.0 --port 8080"
export DEEPFINANCE_SCRIPT="source ${CONDA_PATH} && conda activate finworld_1209 && cd ${ENV_SERVICE_ROOT} && DEEPFINANCE_TOOL_RESULT_MAX_CHARS=${DEEPFINANCE_TOOL_RESULT_MAX_CHARS} DEEPFINANCE_MCP_CONFIG=${DEEPFINANCE_MCP_CONFIG} CACHE_TYPE=${CACHE_TYPE} MONGO_URI=${MONGO_URI} MONGO_DB_NAME=${MONGO_DB_NAME} MONGO_COLLECTION_NAME=${MONGO_COLLECTION_NAME} python -m env_service.env_service --env finworld --portal 0.0.0.0 --port 8080"

Comment on lines +54 to +55
export TRAIN_DATA_PATH="/mnt/data_cpfs/taoshuchang.tsc/deepresearch/AgentJet_new/tutorial/example_deep_finance/data/train_merged_all.json"
export TRAIN_REF_ANS_PATH="/mnt/data_cpfs/taoshuchang.tsc/deepresearch/AgentJet_new/tutorial/example_deep_finance/data/Reference_merged_all.json"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The script hardcodes user-specific paths for TRAIN_DATA_PATH and TRAIN_REF_ANS_PATH. These paths will not work on other machines. The script already sources the .env file, which is the correct place for these configurations. Please remove these export lines to allow the values from the .env file to be used.


export PYTHONPATH="${AJET_ROOT}:${OPENJUDGE_ROOT}:${PYTHONPATH}"
export DEEPFINANCE_PATH="${ENV_SERVICE_ROOT}"
export DEEPFINANCE_SCRIPT="source /mnt/data/taoshuchang.tsc/anaconda3/etc/profile.d/conda.sh && conda activate finworld_1209 && cd ${ENV_SERVICE_ROOT} && DEEPFINANCE_TOOL_RESULT_MAX_CHARS=${DEEPFINANCE_TOOL_RESULT_MAX_CHARS} DEEPFINANCE_MCP_CONFIG=${DEEPFINANCE_MCP_CONFIG} CACHE_TYPE=${CACHE_TYPE} MONGO_URI=${MONGO_URI} MONGO_DB_NAME=${MONGO_DB_NAME} MONGO_COLLECTION_NAME=${MONGO_COLLECTION_NAME} python -m env_service.env_service --env finworld --portal 0.0.0.0 --port 8080"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The DEEPFINANCE_SCRIPT variable contains a hardcoded, user-specific path to conda.sh (/mnt/data/taoshuchang.tsc/anaconda3/etc/profile.d/conda.sh). This will fail on any other developer's machine. The .env_sample file already defines a CONDA_PATH variable for this purpose. Please use that variable here.

Suggested change
export DEEPFINANCE_SCRIPT="source /mnt/data/taoshuchang.tsc/anaconda3/etc/profile.d/conda.sh && conda activate finworld_1209 && cd ${ENV_SERVICE_ROOT} && DEEPFINANCE_TOOL_RESULT_MAX_CHARS=${DEEPFINANCE_TOOL_RESULT_MAX_CHARS} DEEPFINANCE_MCP_CONFIG=${DEEPFINANCE_MCP_CONFIG} CACHE_TYPE=${CACHE_TYPE} MONGO_URI=${MONGO_URI} MONGO_DB_NAME=${MONGO_DB_NAME} MONGO_COLLECTION_NAME=${MONGO_COLLECTION_NAME} python -m env_service.env_service --env finworld --portal 0.0.0.0 --port 8080"
export DEEPFINANCE_SCRIPT="source ${CONDA_PATH} && conda activate finworld_1209 && cd ${ENV_SERVICE_ROOT} && DEEPFINANCE_TOOL_RESULT_MAX_CHARS=${DEEPFINANCE_TOOL_RESULT_MAX_CHARS} DEEPFINANCE_MCP_CONFIG=${DEEPFINANCE_MCP_CONFIG} CACHE_TYPE=${CACHE_TYPE} MONGO_URI=${MONGO_URI} MONGO_DB_NAME=${MONGO_DB_NAME} MONGO_COLLECTION_NAME=${MONGO_COLLECTION_NAME} python -m env_service.env_service --env finworld --portal 0.0.0.0 --port 8080"

Comment on lines +201 to +207
| **model** | **finance** | **others** | **overall** | | | | | | | | | | | | |
| ------------------------------- | ----------------- | ---------- | --------------------- | ----------- | ----------------- | ----------------- | ------- | --------------------- | ----------- | ----------------- | ----------------- | ------- | --------------------- | ----------- | ----------------- |
| | comprehensiveness | insight | instruction_following | readability | **overall_score** | comprehensiveness | insight | instruction_following | readability | **overall_score** | comprehensiveness | insight | instruction_following | readability | **overall_score** |
| **Qwen3-30B-A3B-Instruct-2507** | 0.181 | 0.169 | 0.191 | 0.211 | 0.184 | 0.112 | 0.111 | 0.117 | 0.137 | 0.118 | 0.122 | 0.119 | 0.128 | 0.148 | 0.127 |
| **Tongyi DeepResearch** | 0.291 | 0.282 | 0.316 | 0.313 | 0.296 | 0.270 | 0.260 | 0.289 | 0.290 | 0.274 | 0.273 | 0.263 | 0.293 | 0.293 | 0.277 |
| **Claude 3.7** | 0.404 | 0.398 | 0.465 | 0.416 | 0.417 | 0.412 | 0.406 | 0.462 | 0.417 | 0.423 | 0.411 | 0.405 | 0.462 | 0.417 | 0.422 |
| **Ours** | 0.476 | 0.472 | 0.488 | 0.487 | 0.479 | 0.470 | 0.470 | 0.485 | 0.484 | 0.475 | 0.471 | 0.471 | 0.485 | 0.484 | **0.476** |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current markdown table is very wide and difficult to read due to the attempt to simulate colspan for headers. This is not standard in markdown and may render poorly in some viewers. For better readability and correctness, I suggest restructuring the table into a 'long' format.

Here is an example of a more conventional and readable structure:

| Model                           | Category | Comprehensiveness | Insight | Instruction Following | Readability | Overall Score |
| ------------------------------- | -------- | ----------------- | ------- | --------------------- | ----------- | ------------- |
| **Qwen3-30B-A3B-Instruct-2507** | finance  | 0.181             | 0.169   | 0.191                 | 0.211       | 0.184         |
|                                 | others   | 0.112             | 0.111   | 0.117                 | 0.137       | 0.118         |
|                                 | overall  | 0.122             | 0.119   | 0.128                 | 0.148       | 0.127         |
| **Tongyi DeepResearch**         | finance  | 0.291             | 0.282   | 0.316                 | 0.313       | 0.296         |
|                                 | others   | 0.270             | 0.260   | 0.289                 | 0.290       | 0.274         |
|                                 | overall  | 0.273             | 0.263   | 0.293                 | 0.293       | 0.277         |
| **Claude 3.7**                  | finance  | 0.404             | 0.398   | 0.465                 | 0.416       | 0.417         |
|                                 | others   | 0.412             | 0.406   | 0.462                 | 0.417       | 0.423         |
|                                 | overall  | 0.411             | 0.405   | 0.462                 | 0.417       | 0.422         |
| **Ours**                        | finance  | 0.476             | 0.472   | 0.488                 | 0.487       | 0.479         |
|                                 | others   | 0.470             | 0.470   | 0.485                 | 0.484       | 0.475         |
|                                 | overall  | 0.471             | 0.471   | 0.485                 | 0.484       | **0.476**     |


1. Xie, Q., et al. (2024). *FinBen: A Holistic Financial Benchmark for Large Language Models*. arXiv:2402.12659.
2. Du, M., et al. (2025). *DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents*. arXiv:2506.11763.
3. FInance Tool API:[https://basic.10jqka.com.cn/](https://basic.10jqka.com.cn/601899/equity.html#stockpage) No newline at end of file
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a typo in "FInance". It should be "Finance".

Suggested change
3. FInance Tool API:[https://basic.10jqka.com.cn/](https://basic.10jqka.com.cn/601899/equity.html#stockpage)
3. Finance Tool API:[https://basic.10jqka.com.cn/](https://basic.10jqka.com.cn/601899/equity.html#stockpage)

cd /path/to/AgentJet
bash install.sh # TODO:把这部分缩减到一个install:https://yuque.alibaba-inc.com/bayotg/wxz7sb/qdesuu33621x2yhi
# 安装ajet请使用uv
git clone -b dev/shuchang_newjudge https://github.com/modelscope/AgentJet.git
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The tutorial instructs users to clone a specific development branch (dev/shuchang_newjudge). This is not ideal for documentation, as development branches can be temporary, rebased, or deleted, which would break the instructions for future users. It's better to point to the main branch or a stable release tag.

Suggested change
git clone -b dev/shuchang_newjudge https://github.com/modelscope/AgentJet.git
git clone https://github.com/modelscope/AgentJet.git

| `EBTU_WEIGHT` | 0.0 | 证据溯源权重(可选启用) |
| `AUDIT_WEIGHT` | 0.0 | 引用逻辑审计权重(可选启用) |
```bash
bash AgentJet/tutorial/example_deep_finance/deep_finance.sh
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The path in this command is incorrect. The preceding instructions have the user cd into the AgentJet directory. Therefore, the AgentJet/ prefix in the path is redundant and will cause the command to fail.

Suggested change
bash AgentJet/tutorial/example_deep_finance/deep_finance.sh
bash tutorial/example_deep_finance/deep_finance.sh

Comment on lines +1 to +22
# ------------------ OpenJudge Finance 配置 ------------------
# 注意:Finance 评估现在使用 OpenJudge FinanceCompositionEvaluator
# finance_llm 可单独配置,若未设置则复用 openjudge_llm
ajet:
project_name: "{{PREFIX}}"
experiment_name: "{{SUFFIX}}"
# Judge 配置(嵌套结构,对应 self.config.ajet.judge.*)
judge:
openjudge_llm: {{OPENJUDGE_LLM}} # OpenJudge 模型(用于通用评估)
finance_llm: {{FINANCE_LLM}} # Finance 评估专用模型(可选,留空则复用 openjudge_llm)
concurrency: {{JUDGE_CONCURRENCY}} # Judge 并发数
train_ref_ans_path: {{TRAIN_REF_ANS_PATH}} # 训练集 Reference Answer 路径
val_ref_ans_path: {{VAL_REF_ANS_PATH}} # 验证集 Reference Answer 路径
# 权重配置
# rm_weight: Finance 评估权重(使用 FinanceCompositionEvaluator,支持 stock_analysis/industry/macro/event/search)
rm_weight: {{RM_WEIGHT}}
presentation_quality_weight: {{PRESENTATION_QUALITY_WEIGHT}} # 报告呈现质量评估
grounding_weight: {{GROUNDING_WEIGHT}} # 引用规范性评估
cgcv_weight: {{CGCV_WEIGHT}} # Citation-Grounded Claim Verification
audit_weight: {{AUDIT_WEIGHT}} # 引用逻辑审计
traceability_weight: {{TRACEABILITY_WEIGHT}} # 可追溯性/可核验性审计 (TVR)
ebtu_weight: {{EBTU_WEIGHT}} # EBTU证据优先可追溯性审计
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This YAML template file appears to be unused. The training scripts (deep_finance.sh and deep_finance_single.sh) use deepfinance_template.yaml instead. Furthermore, this file contains placeholders like {{CGCV_WEIGHT}}, {{TRACEABILITY_WEIGHT}}, and {{EBTU_WEIGHT}} which are no longer defined or substituted in the shell scripts. If this file were to be used, it would cause a configuration parsing error. To avoid confusion and prevent future errors, it's best to remove this file from the repository.

- Revise AgentJet installation steps with detailed commands and environment setup
- Add installation and startup guide for Finance MCP service with API key notes
- Organize README sections: dependencies, service startup, environment variables, training
- Add and standardize MAX_RESPONSE_LENGTH variable in deep_finance.sh and deep_finance_single.sh
- Improve script root directory detection and default variable settings
- Modify YAML template to use dynamic max_response_length configuration value
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant