Skip to content

Commit d6955a3

Browse files
docs: update README
1 parent d68fcdd commit d6955a3

File tree

2 files changed

+55
-19
lines changed

2 files changed

+55
-19
lines changed

README.md

Lines changed: 28 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -45,16 +45,6 @@ GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthe
4545

4646
GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the [**paper**](https://arxiv.org/abs/2505.20416) and [best practice](https://github.com/open-sciencelab/GraphGen/issues/17).
4747

48-
Here is post-training result which **over 50% SFT data** comes from GraphGen and our data clean pipeline.
49-
50-
| Domain | Dataset | Ours | Qwen2.5-7B-Instruct (baseline) |
51-
|:---------:|:---------------------------------------------------------:|:--------:|:------------------------------:|
52-
| Plant | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** | 51.5 |
53-
| Common | CMMLU | 73.6 | **75.8** |
54-
| Knowledge | GPQA-Diamond | **40.0** | 33.3 |
55-
| Math | AIME24 | **20.6** | 16.7 |
56-
| | AIME25 | **22.7** | 7.2 |
57-
5848
It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge.
5949
Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.
6050

@@ -82,6 +72,34 @@ After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LL
8272

8373
</details>
8474

75+
## Effectiveness of GraphGen
76+
### Pretrain
77+
78+
Inspired by ByteDance Seed's [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2507.15752) (MGA framework) and Kimi-K2's [Improving Token Utility with Rephrasing](https://arxiv.org/pdf/2507.20534), GraphGen added a **rephrase pipeline** — using LLM-driven reformulation to generate diverse variants of the same corpus instead of redundant repetition.
79+
80+
**Setup:** Qwen3-0.6B trained from scratch on [SlimPajama-6B](https://huggingface.co/datasets/DKYoon/SlimPajama-6B).
81+
82+
| Method | ARC-E | ARC-C | HellaSwag | GSM8K | TruthfulQA-MC1 | TruthfulQA-MC2 | **Average** |
83+
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
84+
| SlimPajama-6B (baseline) | 25.55 | 21.08 | 24.48 | 0.08 | 24.36 | 49.90 | 24.24 |
85+
| Executive-Summary Rephrase | 26.43 | **22.70** | **24.75** | **1.36** | **26.19** | **51.90** | **25.56**(1.32↑) |
86+
| Cross-Domain Rephrase | **28.79** | 20.22 | 24.46 | 0.00 | 24.97 | 52.41 | 25.14(0.9↑) |
87+
88+
Both rephrase strategies lift the average by ~1 point over the baseline with **zero additional data** — all gains come from how the same knowledge is expressed.
89+
90+
91+
### SFT
92+
Here is post-training result which **over 50% SFT data** comes from GraphGen and our data clean pipeline.
93+
94+
| Domain | Dataset | Ours | Qwen2.5-7B-Instruct (baseline) |
95+
|:---------:|:---------------------------------------------------------:|:--------:|:------------------------------:|
96+
| Plant | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** | 51.5 |
97+
| Common | CMMLU | 73.6 | **75.8** |
98+
| Knowledge | GPQA-Diamond | **40.0** | 33.3 |
99+
| Math | AIME24 | **20.6** | 16.7 |
100+
| | AIME25 | **22.7** | 7.2 |
101+
102+
85103

86104
## ⚙️ Support List
87105

README_zh.md

Lines changed: 27 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -46,15 +46,6 @@ GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthe
4646

4747
GraphGen 是一个基于知识图谱的数据合成框架。请查看[**论文**](https://arxiv.org/abs/2505.20416)[最佳实践](https://github.com/open-sciencelab/GraphGen/issues/17)
4848

49-
以下是在超过 50 % 的 SFT 数据来自 GraphGen 及我们的数据清洗流程时的训练后结果:
50-
51-
| 领域 | 数据集 | 我们的方案 | Qwen2.5-7B-Instruct(基线) |
52-
|:--:|:---------------------------------------------------------:|:--------:|:-----------------------:|
53-
| 植物 | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** | 51.5 |
54-
| 常识 | CMMLU | 73.6 | **75.8** |
55-
| 知识 | GPQA-Diamond | **40.0** | 33.3 |
56-
| 数学 | AIME24 | **20.6** | 16.7 |
57-
| | AIME25 | **22.7** | 7.2 |
5849

5950
GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期望校准误差指标识别大语言模型中的知识缺口,优先生成针对高价值长尾知识的问答对。
6051
此外,GraphGen 采用多跳邻域采样捕获复杂关系信息,并使用风格控制生成来丰富问答数据的多样性。
@@ -84,6 +75,33 @@ GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期
8475

8576
</details>
8677

78+
## GraphGen的效果
79+
### Pretrain
80+
81+
受 ByteDance Seed 的 [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2507.15752)(MGA 框架)和 Kimi-K2 的 [Improving Token Utility with Rephrasing](https://arxiv.org/pdf/2507.20534) 启发,GraphGen 引入了一套**重述流水线(rephrase pipeline)**——利用大语言模型对语料进行改写,生成同一知识内容的多种表达变体,替代传统的简单重复训练。
82+
83+
**实验设置:** 使用 Qwen3-0.6B 模型,基于 [SlimPajama-6B](https://huggingface.co/datasets/DKYoon/SlimPajama-6B) 数据集从头训练。
84+
85+
| 方法 | ARC-E | ARC-C | HellaSwag | GSM8K | TruthfulQA-MC1 | TruthfulQA-MC2 | **平均值** |
86+
|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
87+
| SlimPajama-6B (baseline)) | 25.55 | 21.08 | 24.48 | 0.08 | 24.36 | 49.90 | 24.24 |
88+
| Executive-Summary Rephrase | 26.43 | **22.70** | **24.75** | **1.36** | **26.19** | **51.90** | **25.56**(1.32↑) |
89+
| Cross-Domain Rephrase | **28.79** | 20.22 | 24.46 | 0.00 | 24.97 | 52.41 | 25.14(0.9↑) |
90+
91+
两种重述策略均在**零额外数据**的情况下,将平均性能较基线提升约 1 个百分点——所有增益均来自于对相同知识的不同表达方式。
92+
93+
94+
### SFT
95+
以下是在超过 50 % 的 SFT 数据来自 GraphGen 及我们的数据清洗流程时的训练后结果:
96+
97+
| 领域 | 数据集 | 我们的方案 | Qwen2.5-7B-Instruct(基线) |
98+
|:--:|:---------------------------------------------------------:|:--------:|:-----------------------:|
99+
| 植物 | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** | 51.5 |
100+
| 常识 | CMMLU | 73.6 | **75.8** |
101+
| 知识 | GPQA-Diamond | **40.0** | 33.3 |
102+
| 数学 | AIME24 | **20.6** | 16.7 |
103+
| | AIME25 | **22.7** | 7.2 |
104+
87105
## ⚙️ 支持列表
88106

89107
我们支持多种 LLM 推理服务器、API 服务器、推理客户端、输入文件格式、数据模态、输出数据格式和输出数据类型。

0 commit comments

Comments
 (0)