Skip to content

Commit caf13cb

Browse files
docs: udpate README
1 parent 7a27bad commit caf13cb

2 files changed

Lines changed: 211 additions & 0 deletions

File tree

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,8 @@
1919

2020
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
2121

22+
[English](README.md) | [中文](README_ZH.md)
23+
2224
<details close>
2325
<summary><b>📚 Table of Contents</b></summary>
2426

README_ZH.md

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
<p align="center">
2+
<img src="resources/images/logo.png"/>
3+
</p>
4+
5+
<!-- icon -->
6+
7+
[![stars](https://img.shields.io/github/stars/open-sciencelab/GraphGen.svg)](https://github.com/open-sciencelab/GraphGen)
8+
[![forks](https://img.shields.io/github/forks/open-sciencelab/GraphGen.svg)](https://github.com/open-sciencelab/GraphGen)
9+
[![open issues](https://img.shields.io/github/issues-raw/open-sciencelab/GraphGen)](https://github.com/open-sciencelab/GraphGen/issues)
10+
[![issue resolution](https://img.shields.io/github/issues-closed-raw/open-sciencelab/GraphGen)](https://github.com/open-sciencelab/GraphGen/issues)
11+
[![documentation](https://img.shields.io/badge/docs-latest-blue)](https://graphgen-cookbook.readthedocs.io/en/latest/)
12+
[![wechat](https://img.shields.io/badge/wechat-brightgreen?logo=wechat&logoColor=white)](https://cdn.vansin.top/internlm/dou.jpg)
13+
[![arXiv](https://img.shields.io/badge/Paper-arXiv-white)](https://arxiv.org/abs/2505.20416)
14+
[![Hugging Face](https://img.shields.io/badge/Paper-on%20HF-white?logo=huggingface&logoColor=yellow)](https://huggingface.co/papers/2505.20416)
15+
16+
[![Hugging Face](https://img.shields.io/badge/Demo-on%20HF-blue?logo=huggingface&logoColor=yellow)](https://huggingface.co/spaces/chenzihong/GraphGen)
17+
[![OpenXLab](https://img.shields.io/badge/Demo-on%20OpenXLab-blue?logo=openxlab&logoColor=yellow)](https://g-app-center-000704-6802-aerppvq.openxlab.space)
18+
19+
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
20+
21+
[English](README.md) | [中文](README_ZH.md)
22+
23+
<details close>
24+
<summary><b>📚 目录</b></summary>
25+
26+
- 📝 [什么是 GraphGen?](#-什么是-graphgen)
27+
- 📌 [最新更新](#最新更新)
28+
- 🚀 [快速开始](#快速开始)
29+
- 🏗️ [系统架构](#系统架构)
30+
- 🍀 [致谢](#致谢)
31+
- 📚 [引用](#引用)
32+
- 📜 [许可证](#许可证)
33+
34+
[//]: # (- 🌟 [主要特性](#主要特性))
35+
[//]: # (- 📅 [路线图](#路线图))
36+
[//]: # (- 💰 [成本分析](#成本分析))
37+
[//]: # (- ⚙️ [配置说明](#配置说明))
38+
39+
</details>
40+
41+
42+
## 📝 什么是 GraphGen?
43+
44+
GraphGen 是一个基于知识图谱引导的合成数据生成框架。请查看[**论文**](https://arxiv.org/abs/2505.20416)[最佳实践](https://github.com/open-sciencelab/GraphGen/issues/17)
45+
46+
以下是在超过 50 % 的 SFT 数据来自 GraphGen 及我们的数据清洗流程时的训练后结果:
47+
48+
| 领域 | 数据集 | 我们的方案 | Qwen2.5-7B-Instruct(基线) |
49+
| :-: | :-: | :-: | :-: |
50+
| 植物 | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** | 51.5 |
51+
| 常识 | CMMLU | 73.6 | **75.8** |
52+
| 知识 | GPQA-Diamond | **40.0** | 33.3 |
53+
| 数学 | AIME24 | **20.6** | 16.7 |
54+
| | AIME25 | **22.7** | 7.2 |
55+
56+
GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期望校准误差指标识别大语言模型中的知识缺口,优先生成针对高价值长尾知识的问答对。
57+
此外,GraphGen 采用多跳邻域采样捕获复杂关系信息,并使用风格控制生成来丰富问答数据的多样性。
58+
59+
## 📌 最新更新
60+
61+
- **2025.07.31**:新增 Google、Bing、Wikipedia 和 UniProt 作为搜索后端,帮助填补数据缺口。
62+
- **2025.04.21**:发布 GraphGen 初始版本。
63+
64+
## 🚀 快速开始
65+
66+
通过 [Web](https://g-app-center-000704-6802-aerppvq.openxlab.space)[备用 Web 入口](https://openxlab.org.cn/apps/detail/tpoisonooo/GraphGen) 体验 GraphGen。
67+
68+
如有任何问题,请查看 [FAQ](https://github.com/open-sciencelab/GraphGen/issues/10)、提交新的 [issue](https://github.com/open-sciencelab/GraphGen/issues) 或加入我们的[微信群](https://cdn.vansin.top/internlm/dou.jpg)咨询。
69+
70+
### 准备工作
71+
72+
1. 安装 [uv](https://docs.astral.sh/uv/reference/installer/)
73+
74+
```bash
75+
# 若遇到网络问题,可尝试使用 pipx 或 pip 安装 uv,详见 uv 文档
76+
curl -LsSf https://astral.sh/uv/install.sh | sh
77+
```
78+
2. 克隆仓库
79+
80+
```bash
81+
git clone https://github.com/open-sciencelab/GraphGen
82+
cd GraphGen
83+
```
84+
3. 创建新的 uv 环境
85+
86+
```bash
87+
uv venv --python 3.10
88+
```
89+
4. 安装依赖
90+
91+
```bash
92+
uv pip install -r requirements.txt
93+
```
94+
95+
### 运行 Gradio 演示
96+
97+
```bash
98+
uv run webui/app.py
99+
100+
![ui](https://github.com/user-attachments/assets/3024e9bc-5d45-45f8-a4e6-b57bd2350d84)
101+
102+
### 从 PyPI 运行
103+
104+
1. 安装 GraphGen
105+
```bash
106+
uv pip install graphg
107+
```
108+
109+
2. CLI 运行
110+
```bash
111+
SYNTHESIZER_MODEL=your_synthesizer_model_name \
112+
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model \
113+
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model \
114+
TRAINEE_MODEL=your_trainee_model_name \
115+
TRAINEE_BASE_URL=your_base_url_for_trainee_model \
116+
TRAINEE_API_KEY=your_api_key_for_trainee_model \
117+
graphg --output_dir cache
118+
```
119+
120+
### 源码运行
121+
122+
1. 配置环境
123+
- 在项目根目录创建 `.env` 文件
124+
```bash
125+
cp .env.example .env
126+
```
127+
- 设置以下环境变量:
128+
```bash
129+
# Synthesizer 用于构建知识图谱并生成数据
130+
SYNTHESIZER_MODEL=your_synthesizer_model_name
131+
SYNTHESIZER_BASE_URL=your_base_url_for_synthesizer_model
132+
SYNTHESIZER_API_KEY=your_api_key_for_synthesizer_model
133+
# Trainee 用于使用生成数据进行训练
134+
TRAINEE_MODEL=your_trainee_model_name
135+
TRAINEE_BASE_URL=your_base_url_for_trainee_model
136+
TRAINEE_API_KEY=your_api_key_for_trainee_model
137+
```
138+
2. (可选)如需修改默认生成配置,可编辑 `graphgen/configs/` 文件夹中的 YAML 文件.
139+
140+
例如:
141+
142+
```yaml
143+
# configs/cot_config.yaml
144+
input_data_type: raw
145+
input_file: resources/input_examples/raw_demo.jsonl
146+
output_data_type: cot
147+
tokenizer: cl100k_base
148+
# 其他设置...
149+
```
150+
151+
3. 生成数据
152+
153+
选择所需格式并运行对应脚本:
154+
155+
| 格式 | 运行脚本 | 说明 |
156+
|--------------|------------------------------------------------|--------------|
157+
| `cot` | `bash scripts/generate/generate_cot.sh` | 思维链问答对 |
158+
| `atomic` | `bash scripts/generate/generate_atomic.sh` | 覆盖基础知识的原子问答对 |
159+
| `aggregated` | `bash scripts/generate/generate_aggregated.sh` | 整合复杂知识的聚合问答对 |
160+
| `multi-hop` | `bash scripts/generate/generate_multihop.sh` | 多跳推理问答对 |
161+
162+
163+
4. 查看生成结果
164+
```bash
165+
ls cache/data/graphgen
166+
```
167+
168+
### 使用 Docker 运行
169+
1. 构建镜像
170+
```bash
171+
docker build -t graphgen .
172+
```
173+
2. 启动容器
174+
```bash
175+
docker run -p 7860:7860 graphgen
176+
```
177+
178+
179+
## 🏗️ 系统架构
180+
参阅 deepwiki 的[分析](https://deepwiki.com/open-sciencelab/GraphGen)了解 GraphGen 系统、架构与核心功能的技术概览。
181+
182+
183+
### 工作流程
184+
![workflow](resources/images/flow.png)
185+
186+
187+
## 🍀 致谢
188+
- [SiliconFlow](https://siliconflow.cn) 提供丰富的 LLM API,部分模型免费
189+
- [LightRAG](https://github.com/HKUDS/LightRAG) 简单高效的图检索方案
190+
- [ROGRAG](https://github.com/tpoisonooo/ROGRAG) 鲁棒优化版 GraphRAG 框架
191+
- [DB-GPT](https://github.com/eosphoros-ai/DB-GPT) AI 原生数据应用开发框架
192+
193+
194+
## 📚 引用
195+
如果本项目对你有帮助,请考虑引用我们的工作:
196+
```bibtex
197+
@misc{chen2025graphgenenhancingsupervisedfinetuning,
198+
title={GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation},
199+
author={Zihong Chen and Wanli Jiang and Jinzhe Li and Zhonghang Yuan and Huanjun Kong and Wanli Ouyang and Nanqing Dong},
200+
year={2025},
201+
eprint={2505.20416},
202+
archivePrefix={arXiv},
203+
primaryClass={cs.CL},
204+
url={https://arxiv.org/abs/2505.20416},
205+
}
206+
```
207+
208+
## 📜 许可证
209+
本项目采用 [Apache License 2.0](LICENSE) 许可证。

0 commit comments

Comments
 (0)