Skip to content

Commit 2e77f1e

Browse files
committed
update examples readme
1 parent d9eb8eb commit 2e77f1e

2 files changed

Lines changed: 282 additions & 0 deletions

File tree

Lines changed: 143 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,143 @@
1+
# KAG NetOperatorQA
2+
3+
[English](./README_en.md) |
4+
[简体中文](./README.md)
5+
6+
NetOperatorQA是一个专注于电信运营商领域的知识问答数据集。在数字化转型和智能通信服务快速发展的时代,电信运营商的业务范围不断扩展,涵盖5G网络建设、云计算服务、大数据应用、物联网业务、数字政府等多个领域。知识图谱(Knowledge Graph, KG)作为一种结构化信息管理方式,通过语义关联和语境理解,为运营商业务提供了全面的事实和数据支持。在涉及复杂技术文档、年度报告、业务规范和运营数据时,快速而准确的信息检索能力显得尤为重要。
7+
8+
本数据集专注于运营商知识问答任务,旨在从运营商相关的大型知识图谱中检索出业务事实,来回答相关问题。我们设定输入为一个特定的运营商业务问题,输出则是从知识图谱中提取到的用于回答问题的事实。数据来源包括运营商年度报告、季度报告、技术白皮书、业务介绍文档等多种形式的Markdown文档。
9+
10+
本例我们展示为NetOperatorQA数据集构建知识图谱,然后用[KAG](https://arxiv.org/abs/2409.13731)为评估问题生成答案,并与标准答案对比计算EM和F1指标。
11+
12+
## 实现
13+
14+
针对NetOperatorQA运营商知识问答数据集,我们的实现方法遵循KAG (Knowledge-Augmented Generation)框架,具体步骤如下:
15+
16+
### 数据预处理与Schema定义
17+
我们首先对NetOperatorQA提供的原始运营商数据进行预处理。在此阶段,我们处理运营商年度报告、技术文档、业务介绍等多种格式的Markdown文档,对信息进行初步的归类或结构化整理,以便于后续的图谱构建。
18+
19+
基于KAG框架和OpenSPG的schema建模规范,我们对运营商领域的核心概念(如网络技术、业务产品、财务数据、技术指标、合作伙伴等)进行抽象,定义相应的实体类型(Entity Types)、关系类型(Relation Types)及其属性(Properties)。这份定义体现在schema/NetOperatorQA.schema文件中,并被提交到OpenSPG图数据库,作为知识图谱的结构蓝图。
20+
21+
22+
23+
## 1. 前置条件
24+
25+
参考文档 [快速开始](https://openspg.yuque.com/ndx6g9/0.6/quzq24g4esal7q17) 安装KAG及其依赖的OpenSPG server,了解开发者模式KAG的使用流程。
26+
27+
## 配置说明
28+
29+
### 索引构建配置
30+
在配置文件中,你可以根据需要启用不同的提取器组合:
31+
32+
**完整索引构建(推荐用于生产环境)**
33+
- chunk_extractor - 基础文本块
34+
- outline_extractor - 大纲结构
35+
- summary_extractor - 语义摘要
36+
- table_extractor - 表格数据
37+
- atomic_query_extractor - 原子查询
38+
39+
**快速构建(适用于测试和开发)**
40+
- 仅使用基础文本块提取器
41+
42+
### 检索配置选择
43+
系统提供两种主要的检索配置:
44+
45+
**简单检索模式**:适用于直接事实查询
46+
- 仅使用向量检索器,快速响应
47+
48+
**完整检索模式**:适用于复杂推理查询
49+
- 原子查询检索器
50+
- 大纲检索器
51+
- 摘要检索器
52+
- 向量检索器
53+
- 表格检索器
54+
55+
### 流水线选择
56+
- **kag_solver_pipeline**: 迭代式流水线,支持复杂推理和多轮对话
57+
- **kag_solver_pipeline_tc**: 静态流水线,针对运营商领域优化,响应更快
58+
59+
## 2. 复现步骤
60+
61+
### Step 1:进入示例目录
62+
63+
```bash
64+
cd kag/examples/NetOperatorQA
65+
```
66+
67+
### Step 2:配置模型
68+
69+
更新 [kag_config.yaml](./kag_config.yaml) 中的生成模型配置 `openie_llm``chat_llm` 和表示模型配置 `vectorize_model`
70+
71+
您需要设置正确的 `api_key`。如果使用的模型供应商和模型名与默认值不同,您还需要更新 `base_url``model`
72+
73+
### Step 3:初始化项目
74+
75+
先对项目进行初始化。
76+
77+
```bash
78+
knext project restore --host_addr http://127.0.0.1:8887 --proj_path .
79+
```
80+
81+
### Step 4:提交schema
82+
83+
执行以下命令提交schema [NetOperatorQA.schema](./schema/NetOperatorQA.schema)
84+
85+
```bash
86+
knext schema commit
87+
```
88+
89+
### Step 5:构建知识图谱
90+
91+
[builder](./builder) 目录执行 [indexer.py](./builder/indexer.py) 构建知识图谱。
92+
93+
```bash
94+
cd builder && python indexer.py && cd ..
95+
```
96+
97+
### Step 6:执行QA任务
98+
99+
[solver](./solver) 目录执行 [eval.py](./solver/eval.py) 生成答案并计算EM和F1指标。
100+
101+
```bash
102+
cd solver && python eval.py && cd ..
103+
```
104+
105+
106+
107+
### Step 7:(可选)清理
108+
109+
若要删除checkpoint,可执行以下命令。
110+
111+
```bash
112+
rm -rf ./builder/ckpt
113+
rm -rf ./solver/ckpt
114+
```
115+
116+
## 数据集与技术特点
117+
118+
### 数据集特点
119+
NetOperatorQA数据集具有以下特点:
120+
121+
1. **多文档类型**: 包含年度报告(AY系列)、技术白皮书(BZ系列)、业务介绍(BY系列)、网络建设(BW系列)、技术标准(BT系列)、财务数据(BF系列)等多种类型的文档
122+
123+
2. **丰富的业务领域**: 涵盖5G网络、云计算、大数据、物联网、工业互联网、数字政府、智慧城市等运营商核心业务领域
124+
125+
3. **结构化信息**: 包含财务指标、技术参数、业务数据、合作关系等多维度的结构化信息
126+
127+
4. **时序性**: 数据涵盖多个时间段的报告和数据,便于进行趋势分析和历史对比
128+
129+
5. **专业性**: 涉及电信技术术语、业务模式、财务指标等专业内容,对模型的领域理解能力提出较高要求
130+
131+
### 技术架构特点
132+
133+
1. **多重索引体系**: 构建了文本块索引、大纲索引、摘要索引、表格索引、原子查询索引等多种索引类型
134+
135+
2. **混合检索策略**: 结合向量检索、结构化检索、语义检索等多种检索方法
136+
137+
3. **自适应推理**: 根据问题复杂度自动选择简单推理器或复杂推理器
138+
139+
4. **领域优化**: 针对运营商领域特点定制了专用规划器和提示模板
140+
141+
5. **可配置架构**: 支持灵活配置不同的提取器、检索器和推理器组合
142+
143+
6. **性能优化**: 提供快速模式和完整模式,平衡效果与效率
Lines changed: 139 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,139 @@
1+
# KAG NetOperatorQA
2+
3+
[English](./README_en.md) |
4+
[简体中文](./README.md)
5+
6+
NetOperatorQA is a knowledge question-answering dataset focused on the telecommunications operator domain. In the era of rapid digital transformation and intelligent communication services, telecommunications operators' business scope continues to expand, covering 5G network construction, cloud computing services, big data applications, IoT business, digital government, and many other fields. Knowledge Graph (KG), as a structured information management approach, provides comprehensive factual and data support for operator business through semantic associations and contextual understanding. When dealing with complex technical documents, annual reports, business specifications, and operational data, fast and accurate information retrieval capabilities are particularly important.
7+
8+
This dataset focuses on telecommunications operator knowledge Q&A tasks, aiming to retrieve business facts from large-scale knowledge graphs related to operators to answer relevant questions. We define the input as a specific operator business question, and the output as facts extracted from the knowledge graph to answer the question. Data sources include various forms of Markdown documents such as operator annual reports, quarterly reports, technical white papers, and business introduction documents.
9+
10+
In this example, we demonstrate building a knowledge graph for the NetOperatorQA dataset, then using [KAG](https://arxiv.org/abs/2409.13731) to generate answers for evaluation questions and calculate EM and F1 metrics by comparing with standard answers.
11+
12+
## Implementation
13+
14+
For the NetOperatorQA telecommunications operator knowledge Q&A dataset, our implementation follows the KAG (Knowledge-Augmented Generation) framework with the following specific steps:
15+
16+
### Data Preprocessing and Schema Definition
17+
We first preprocess the raw operator data provided by NetOperatorQA. At this stage, we process various formats of Markdown documents including operator annual reports, technical documents, and business introductions, performing preliminary classification or structured organization of information to facilitate subsequent graph construction.
18+
19+
Based on the KAG framework and OpenSPG's schema modeling specifications, we abstract core concepts in the telecommunications operator domain (such as network technology, business products, financial data, technical indicators, partners, etc.), defining corresponding Entity Types, Relation Types, and their Properties. This definition is embodied in the schema/NetOperatorQA.schema file and submitted to the OpenSPG graph database as the structural blueprint for the knowledge graph.
20+
21+
## 1. Prerequisites
22+
23+
Refer to the documentation [Quick Start](https://openspg.yuque.com/ndx6g9/0.6/quzq24g4esal7q17) to install KAG and its dependent OpenSPG server, and understand the usage process of KAG in developer mode.
24+
25+
## Configuration Instructions
26+
27+
### Index Construction Configuration
28+
In the configuration file, you can enable different combinations of extractors based on your needs:
29+
30+
**Complete Index Construction (Recommended for Production)**:
31+
- chunk_extractor - Basic text chunks
32+
- outline_extractor - Outline structure
33+
- summary_extractor - Semantic summaries
34+
- table_extractor - Table data
35+
- atomic_query_extractor - Atomic queries
36+
37+
**Fast Construction (Suitable for Testing and Development)**:
38+
- Use only basic text chunk extractor
39+
40+
### Retrieval Configuration Options
41+
The system provides two main retrieval configurations:
42+
43+
**Simple Retrieval Mode**: Suitable for direct fact queries
44+
- Use only vector retriever for fast response
45+
46+
**Complete Retrieval Mode**: Suitable for complex reasoning queries
47+
- Atomic query retriever
48+
- Outline retriever
49+
- Summary retriever
50+
- Vector retriever
51+
- Table retriever
52+
53+
### Pipeline Selection
54+
- **kag_solver_pipeline**: Iterative pipeline supporting complex reasoning and multi-turn dialogue
55+
- **kag_solver_pipeline_tc**: Static pipeline optimized for telecommunications operator domain with faster response
56+
57+
## 2. Reproduction Steps
58+
59+
### Step 1: Enter Example Directory
60+
61+
```bash
62+
cd kag/examples/NetOperatorQA
63+
```
64+
65+
### Step 2: Configure Models
66+
67+
Update the generation model configurations `openie_llm` and `chat_llm` and the representation model configuration `vectorize_model` in [kag_config.yaml](./kag_config.yaml).
68+
69+
You need to set the correct `api_key`. If the model provider and model name differ from the default values, you also need to update `base_url` and `model`.
70+
71+
### Step 3: Initialize Project
72+
73+
Initialize the project first.
74+
75+
```bash
76+
knext project restore --host_addr http://127.0.0.1:8887 --proj_path .
77+
```
78+
79+
### Step 4: Submit Schema
80+
81+
Execute the following command to submit schema [NetOperatorQA.schema](./schema/NetOperatorQA.schema).
82+
83+
```bash
84+
knext schema commit
85+
```
86+
87+
### Step 5: Build Knowledge Graph
88+
89+
Execute [indexer.py](./builder/indexer.py) in the [builder](./builder) directory to build the knowledge graph.
90+
91+
```bash
92+
cd builder && python indexer.py && cd ..
93+
```
94+
95+
### Step 6: Execute Q&A Task
96+
97+
Execute [eval.py](./solver/eval.py) in the [solver](./solver) directory to generate answers and calculate EM and F1 metrics.
98+
99+
```bash
100+
cd solver && python eval.py && cd ..
101+
```
102+
103+
### Step 7: (Optional) Cleanup
104+
105+
To delete checkpoints, execute the following commands.
106+
107+
```bash
108+
rm -rf ./builder/ckpt
109+
rm -rf ./solver/ckpt
110+
```
111+
112+
## Dataset and Technical Features
113+
114+
### Dataset Features
115+
The NetOperatorQA dataset has the following characteristics:
116+
117+
1. **Multiple Document Types**: Contains annual reports (AY series), technical white papers (BZ series), business introductions (BY series), network construction (BW series), technical standards (BT series), financial data (BF series), and other types of documents
118+
119+
2. **Rich Business Domains**: Covers telecommunications operator core business areas including 5G networks, cloud computing, big data, IoT, industrial internet, digital government, smart cities, etc.
120+
121+
3. **Structured Information**: Contains multi-dimensional structured information including financial indicators, technical parameters, business data, partnership relationships, etc.
122+
123+
4. **Temporal Nature**: Data covers reports and data from multiple time periods, facilitating trend analysis and historical comparisons
124+
125+
5. **Professional Content**: Involves telecommunications technical terminology, business models, financial indicators, and other professional content, placing high demands on the model's domain understanding capabilities
126+
127+
### Technical Architecture Features
128+
129+
1. **Multi-Index System**: Constructs multiple index types including text chunk index, outline index, summary index, table index, atomic query index, etc.
130+
131+
2. **Hybrid Retrieval Strategy**: Combines vector retrieval, structured retrieval, semantic retrieval, and other retrieval methods
132+
133+
3. **Adaptive Reasoning**: Automatically selects simple reasoner or complex reasoner based on question complexity
134+
135+
4. **Domain Optimization**: Customized planners and prompt templates specifically for telecommunications operator domain characteristics
136+
137+
5. **Configurable Architecture**: Supports flexible configuration of different extractor, retriever, and reasoner combinations
138+
139+
6. **Performance Optimization**: Provides fast mode and complete mode to balance effectiveness and efficiency

0 commit comments

Comments
 (0)