Skip to content

Commit 72dcd69

Browse files
docs: update README
1 parent d6955a3 commit 72dcd69

File tree

2 files changed

+2
-2
lines changed

2 files changed

+2
-2
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LL
7575
## Effectiveness of GraphGen
7676
### Pretrain
7777

78-
Inspired by ByteDance Seed's [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2507.15752) (MGA framework) and Kimi-K2's [Improving Token Utility with Rephrasing](https://arxiv.org/pdf/2507.20534), GraphGen added a **rephrase pipeline** — using LLM-driven reformulation to generate diverse variants of the same corpus instead of redundant repetition.
78+
Inspired by Kimi-K2's [technical report](https://arxiv.org/pdf/2507.20534) (Improving Token Utility with Rephrasing) and ByteDance Seed's [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2507.15752) (MGA framework), GraphGen added a **rephrase pipeline** — using LLM-driven reformulation to generate diverse variants of the same corpus instead of redundant repetition.
7979

8080
**Setup:** Qwen3-0.6B trained from scratch on [SlimPajama-6B](https://huggingface.co/datasets/DKYoon/SlimPajama-6B).
8181

README_zh.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,7 @@ GraphGen 首先根据源文本构建细粒度的知识图谱,然后利用期
7878
## GraphGen的效果
7979
### Pretrain
8080

81-
ByteDance Seed 的 [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2507.15752)(MGA 框架)和 Kimi-K2 [Improving Token Utility with Rephrasing](https://arxiv.org/pdf/2507.20534) 启发,GraphGen 引入了一套**重述流水线(rephrase pipeline)**——利用大语言模型对语料进行改写,生成同一知识内容的多种表达变体,替代传统的简单重复训练。
81+
Kimi-K2 的 技术报告 (https://arxiv.org/pdf/2507.20534) (Improving Token Utility with Rephrasing) 和 ByteDance Seed [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2507.15752)(MGA 框架)启发,GraphGen 引入了一套**重述流水线(rephrase pipeline)**——利用大语言模型对语料进行改写,生成同一知识内容的多种表达变体,替代传统的简单重复训练。
8282

8383
**实验设置:** 使用 Qwen3-0.6B 模型,基于 [SlimPajama-6B](https://huggingface.co/datasets/DKYoon/SlimPajama-6B) 数据集从头训练。
8484

0 commit comments

Comments
 (0)