docs: update README

ChenZiHong-Gavin · ChenZiHong-Gavin · commit 72dcd69dd13c · 2026-03-09T23:24:28.000+08:00
diff --git a/README.md b/README.md
@@ -75,7 +75,7 @@ After data generation, you can use [LLaMA-Factory](https://github.com/hiyouga/LL
 ## Effectiveness of GraphGen
 ### Pretrain
 
-Inspired by ByteDance Seed's [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2507.15752) (MGA framework) and Kimi-K2's [Improving Token Utility with Rephrasing](https://arxiv.org/pdf/2507.20534), GraphGen added a **rephrase pipeline** — using LLM-driven reformulation to generate diverse variants of the same corpus instead of redundant repetition.
+Inspired by Kimi-K2's [technical report](https://arxiv.org/pdf/2507.20534) (Improving Token Utility with Rephrasing)  and ByteDance Seed's [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2507.15752) (MGA framework), GraphGen added a **rephrase pipeline** — using LLM-driven reformulation to generate diverse variants of the same corpus instead of redundant repetition.
 
 **Setup:** Qwen3-0.6B trained from scratch on [SlimPajama-6B](https://huggingface.co/datasets/DKYoon/SlimPajama-6B).
 
diff --git a/README_zh.md b/README_zh.md
@@ -78,7 +78,7 @@ GraphGen 首先根据源文本构建细粒度的知识图谱，然后利用期
 ## GraphGen的效果
 ### Pretrain
 
-受 ByteDance Seed 的 [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2507.15752)（MGA 框架）和 Kimi-K2 的 [Improving Token Utility with Rephrasing](https://arxiv.org/pdf/2507.20534) 启发，GraphGen 引入了一套**重述流水线（rephrase pipeline）**——利用大语言模型对语料进行改写，生成同一知识内容的多种表达变体，替代传统的简单重复训练。
+受 Kimi-K2 的 技术报告 (https://arxiv.org/pdf/2507.20534) (Improving Token Utility with Rephrasing) 和 ByteDance Seed 的 [Reformulation for Pretraining Data Augmentation](https://arxiv.org/abs/2507.15752)（MGA 框架）启发，GraphGen 引入了一套**重述流水线（rephrase pipeline）**——利用大语言模型对语料进行改写，生成同一知识内容的多种表达变体，替代传统的简单重复训练。
 
 **实验设置：** 使用 Qwen3-0.6B 模型，基于 [SlimPajama-6B](https://huggingface.co/datasets/DKYoon/SlimPajama-6B) 数据集从头训练。