Skip to content

Commit 1319629

Browse files
authored
Update README.md
1 parent 67a5085 commit 1319629

1 file changed

Lines changed: 12 additions & 2 deletions

File tree

README.md

Lines changed: 12 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@
1919

2020
GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthetic Data Generation
2121

22-
<details open>
22+
<details close>
2323
<summary><b>📚 Table of Contents</b></summary>
2424

2525
- 📝 [What is GraphGen?](#-what-is-graphgen)
@@ -39,7 +39,17 @@ GraphGen: Enhancing Supervised Fine-Tuning for LLMs with Knowledge-Driven Synthe
3939

4040
## 📝 What is GraphGen?
4141

42-
GraphGen is a framework for synthetic data generation guided by knowledge graphs. Here is our [**paper**](https://arxiv.org/abs/2505.20416), [best practice and LLM precision📊](https://github.com/open-sciencelab/GraphGen/issues/17).
42+
GraphGen is a framework for synthetic data generation guided by knowledge graphs. Please check the [**paper**](https://arxiv.org/abs/2505.20416) and [best practice](https://github.com/open-sciencelab/GraphGen/issues/17).
43+
44+
Here is post-training result which **over 50% SFT data** comes from GraphGen and our data clean pipeline.
45+
46+
| Domain | Dataset | Ours | Qwen2.5-7B-Instruct (baseline) |
47+
| :-: | :-: | :-: | :-: |
48+
| Plant| [SeedBench](https://github.com/open-sciencelab/SeedBench) | **65.9** | 51.5 |
49+
| Common | CMMLU | 73.6 | **75.8** |
50+
| Logic | GPQA-Diamond | **40.0** | 33.3 |
51+
| Math | AIME24 | **20.6** | 16.7 |
52+
| | AIME25 | **22.7** | 7.2 |
4353

4454
It begins by constructing a fine-grained knowledge graph from the source text,then identifies knowledge gaps in LLMs using the expected calibration error metric, prioritizing the generation of QA pairs that target high-value, long-tail knowledge.
4555
Furthermore, GraphGen incorporates multi-hop neighborhood sampling to capture complex relational information and employs style-controlled generation to diversify the resulting QA data.

0 commit comments

Comments
 (0)