docs: update README

ChenZiHong-Gavin · ChenZiHong-Gavin · commit d6955a356e3d · 2026-03-09T23:15:09.000+08:00
diff --git a/README.md b/README.md
@@ -45,16 +45,6 @@ GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthe
 
 GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the [**paper**](https://arxiv.org/abs/2505.20416) and [best practice](https://github.com/open-sciencelab/GraphGen/issues/17).
 
-Here is post-training result which **over 50% SFT data** comes from GraphGen and our data clean pipeline.
-
-|  Domain   |                          Dataset                          |   Ours   | Qwen2.5-7B-Instruct (baseline) |
-|:---------:|:---------------------------------------------------------:|:--------:|:------------------------------:|
-|   Plant   | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** |              51.5              |
-|  Common   |                           CMMLU                           |   73.6   |            **75.8**            |
-| Knowledge |                       GPQA-Diamond                        | **40.0** |              33.3              |
-|   Math    |                          AIME24                           | **20.6** |              16.7              |
-|           |                          AIME25                           | **22.7** |              7.2               |
-
 It begins by constructing a fine-grained knowledge graph from the source text，then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge.
 Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.
 
@@ -82,6 +72,34 @@ After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LL
 
 </details>
 
+## Effectiveness of GraphGen
+### Pretrain
+
+Inspired by ByteDance Seed's [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2507.15752) (MGA framework) and Kimi-K2's [Improving Token Utility with Rephrasing](https://arxiv.org/pdf/2507.20534), GraphGen added a **rephrase pipeline** — using LLM-driven reformulation to generate diverse variants of the same corpus instead of redundant repetition.
+
+**Setup:** Qwen3-0.6B trained from scratch on [SlimPajama-6B](https://huggingface.co/datasets/DKYoon/SlimPajama-6B).
+
+| Method | ARC-E | ARC-C | HellaSwag | GSM8K | TruthfulQA-MC1 | TruthfulQA-MC2 | **Average** |
+|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| SlimPajama-6B (baseline) | 25.55 | 21.08 | 24.48 | 0.08 | 24.36 | 49.90 | 24.24 |
+| Executive-Summary Rephrase | 26.43 | **22.70** | **24.75** | **1.36** | **26.19** | **51.90** | **25.56**(1.32↑) |
+| Cross-Domain Rephrase | **28.79** | 20.22 | 24.46 | 0.00 | 24.97 | 52.41 | 25.14(0.9↑) |
+
+Both rephrase strategies lift the average by ~1 point over the baseline with **zero additional data** — all gains come from how the same knowledge is expressed.
+
+
+### SFT
+Here is post-training result which **over 50% SFT data** comes from GraphGen and our data clean pipeline.
+
+|  Domain   |                          Dataset                          |   Ours   | Qwen2.5-7B-Instruct (baseline) |
+|:---------:|:---------------------------------------------------------:|:--------:|:------------------------------:|
+|   Plant   | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** |              51.5              |
+|  Common   |                           CMMLU                           |   73.6   |            **75.8**            |
+| Knowledge |                       GPQA-Diamond                        | **40.0** |              33.3              |
+|   Math    |                          AIME24                           | **20.6** |              16.7              |
+|           |                          AIME25                           | **22.7** |              7.2               |
+
+
 
 ## ⚙️ Support List
 
diff --git a/README_zh.md b/README_zh.md
@@ -46,15 +46,6 @@ GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthe
 
 GraphGen 是一个基于知识图谱的数据合成框架。请查看[**论文**](https://arxiv.org/abs/2505.20416)和[最佳实践](https://github.com/open-sciencelab/GraphGen/issues/17)。
 
-以下是在超过 50 % 的 SFT 数据来自 GraphGen 及我们的数据清洗流程时的训练后结果：
-
-| 领域 |                            数据集                            |  我们的方案   | Qwen2.5-7B-Instruct（基线） |
-|:--:|:---------------------------------------------------------:|:--------:|:-----------------------:|
-| 植物 | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** |          51.5           |
-| 常识 |                           CMMLU                           |   73.6   |        **75.8**         |
-| 知识 |                       GPQA-Diamond                        | **40.0** |          33.3           |
-| 数学 |                          AIME24                           | **20.6** |          16.7           |
-|    |                          AIME25                           | **22.7** |           7.2           |
 
 GraphGen 首先根据源文本构建细粒度的知识图谱，然后利用期望校准误差指标识别大语言模型中的知识缺口，优先生成针对高价值长尾知识的问答对。  
 此外，GraphGen 采用多跳邻域采样捕获复杂关系信息，并使用风格控制生成来丰富问答数据的多样性。
@@ -84,6 +75,33 @@ GraphGen 首先根据源文本构建细粒度的知识图谱，然后利用期
 
 </details>
 
+## GraphGen的效果
+### Pretrain
+
+受 ByteDance Seed 的 [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2507.15752)（MGA 框架）和 Kimi-K2 的 [Improving Token Utility with Rephrasing](https://arxiv.org/pdf/2507.20534) 启发，GraphGen 引入了一套**重述流水线（rephrase pipeline）**——利用大语言模型对语料进行改写，生成同一知识内容的多种表达变体，替代传统的简单重复训练。
+
+**实验设置：** 使用 Qwen3-0.6B 模型，基于 [SlimPajama-6B](https://huggingface.co/datasets/DKYoon/SlimPajama-6B) 数据集从头训练。
+
+| 方法 | ARC-E | ARC-C | HellaSwag | GSM8K | TruthfulQA-MC1 | TruthfulQA-MC2 | **平均值** |
+|:---|:---:|:---:|:---:|:---:|:---:|:---:|:---:|
+| SlimPajama-6B (baseline)） | 25.55 | 21.08 | 24.48 | 0.08 | 24.36 | 49.90 | 24.24 |
+| Executive-Summary Rephrase | 26.43 | **22.70** | **24.75** | **1.36** | **26.19** | **51.90** | **25.56**(1.32↑) |
+| Cross-Domain Rephrase | **28.79** | 20.22 | 24.46 | 0.00 | 24.97 | 52.41 | 25.14(0.9↑) |
+
+两种重述策略均在**零额外数据**的情况下，将平均性能较基线提升约 1 个百分点——所有增益均来自于对相同知识的不同表达方式。
+
+
+### SFT
+以下是在超过 50 % 的 SFT 数据来自 GraphGen 及我们的数据清洗流程时的训练后结果：
+
+| 领域 |                            数据集                            |  我们的方案   | Qwen2.5-7B-Instruct（基线） |
+|:--:|:---------------------------------------------------------:|:--------:|:-----------------------:|
+| 植物 | [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** |          51.5           |
+| 常识 |                           CMMLU                           |   73.6   |        **75.8**         |
+| 知识 |                       GPQA-Diamond                        | **40.0** |          33.3           |
+| 数学 |                          AIME24                           | **20.6** |          16.7           |
+|    |                          AIME25                           | **22.7** |           7.2           |
+
 ## ⚙️ 支持列表
 
 我们支持多种 LLM 推理服务器、API 服务器、推理客户端、输入文件格式、数据模态、输出数据格式和输出数据类型。