|
| 1 | +# KAG NetOperatorQA |
| 2 | + |
| 3 | +[English](./README_en.md) | |
| 4 | +[简体中文](./README.md) |
| 5 | + |
| 6 | +NetOperatorQA is a knowledge question-answering dataset focused on the telecommunications operator domain. In the era of rapid digital transformation and intelligent communication services, telecommunications operators' business scope continues to expand, covering 5G network construction, cloud computing services, big data applications, IoT business, digital government, and many other fields. Knowledge Graph (KG), as a structured information management approach, provides comprehensive factual and data support for operator business through semantic associations and contextual understanding. When dealing with complex technical documents, annual reports, business specifications, and operational data, fast and accurate information retrieval capabilities are particularly important. |
| 7 | + |
| 8 | +This dataset focuses on telecommunications operator knowledge Q&A tasks, aiming to retrieve business facts from large-scale knowledge graphs related to operators to answer relevant questions. We define the input as a specific operator business question, and the output as facts extracted from the knowledge graph to answer the question. Data sources include various forms of Markdown documents such as operator annual reports, quarterly reports, technical white papers, and business introduction documents. |
| 9 | + |
| 10 | +In this example, we demonstrate building a knowledge graph for the NetOperatorQA dataset, then using [KAG](https://arxiv.org/abs/2409.13731) to generate answers for evaluation questions and calculate EM and F1 metrics by comparing with standard answers. |
| 11 | + |
| 12 | +## Implementation |
| 13 | + |
| 14 | +For the NetOperatorQA telecommunications operator knowledge Q&A dataset, our implementation follows the KAG (Knowledge-Augmented Generation) framework with the following specific steps: |
| 15 | + |
| 16 | +### Data Preprocessing and Schema Definition |
| 17 | +We first preprocess the raw operator data provided by NetOperatorQA. At this stage, we process various formats of Markdown documents including operator annual reports, technical documents, and business introductions, performing preliminary classification or structured organization of information to facilitate subsequent graph construction. |
| 18 | + |
| 19 | +Based on the KAG framework and OpenSPG's schema modeling specifications, we abstract core concepts in the telecommunications operator domain (such as network technology, business products, financial data, technical indicators, partners, etc.), defining corresponding Entity Types, Relation Types, and their Properties. This definition is embodied in the schema/NetOperatorQA.schema file and submitted to the OpenSPG graph database as the structural blueprint for the knowledge graph. |
| 20 | + |
| 21 | +## 1. Prerequisites |
| 22 | + |
| 23 | +Refer to the documentation [Quick Start](https://openspg.yuque.com/ndx6g9/0.6/quzq24g4esal7q17) to install KAG and its dependent OpenSPG server, and understand the usage process of KAG in developer mode. |
| 24 | + |
| 25 | +## Configuration Instructions |
| 26 | + |
| 27 | +### Index Construction Configuration |
| 28 | +In the configuration file, you can enable different combinations of extractors based on your needs: |
| 29 | + |
| 30 | +**Complete Index Construction (Recommended for Production)**: |
| 31 | +- chunk_extractor - Basic text chunks |
| 32 | +- outline_extractor - Outline structure |
| 33 | +- summary_extractor - Semantic summaries |
| 34 | +- table_extractor - Table data |
| 35 | +- atomic_query_extractor - Atomic queries |
| 36 | + |
| 37 | +**Fast Construction (Suitable for Testing and Development)**: |
| 38 | +- Use only basic text chunk extractor |
| 39 | + |
| 40 | +### Retrieval Configuration Options |
| 41 | +The system provides two main retrieval configurations: |
| 42 | + |
| 43 | +**Simple Retrieval Mode**: Suitable for direct fact queries |
| 44 | +- Use only vector retriever for fast response |
| 45 | + |
| 46 | +**Complete Retrieval Mode**: Suitable for complex reasoning queries |
| 47 | +- Atomic query retriever |
| 48 | +- Outline retriever |
| 49 | +- Summary retriever |
| 50 | +- Vector retriever |
| 51 | +- Table retriever |
| 52 | + |
| 53 | +### Pipeline Selection |
| 54 | +- **kag_solver_pipeline**: Iterative pipeline supporting complex reasoning and multi-turn dialogue |
| 55 | +- **kag_solver_pipeline_tc**: Static pipeline optimized for telecommunications operator domain with faster response |
| 56 | + |
| 57 | +## 2. Reproduction Steps |
| 58 | + |
| 59 | +### Step 1: Enter Example Directory |
| 60 | + |
| 61 | +```bash |
| 62 | +cd kag/examples/NetOperatorQA |
| 63 | +``` |
| 64 | + |
| 65 | +### Step 2: Configure Models |
| 66 | + |
| 67 | +Update the generation model configurations `openie_llm` and `chat_llm` and the representation model configuration `vectorize_model` in [kag_config.yaml](./kag_config.yaml). |
| 68 | + |
| 69 | +You need to set the correct `api_key`. If the model provider and model name differ from the default values, you also need to update `base_url` and `model`. |
| 70 | + |
| 71 | +### Step 3: Initialize Project |
| 72 | + |
| 73 | +Initialize the project first. |
| 74 | + |
| 75 | +```bash |
| 76 | +knext project restore --host_addr http://127.0.0.1:8887 --proj_path . |
| 77 | +``` |
| 78 | + |
| 79 | +### Step 4: Submit Schema |
| 80 | + |
| 81 | +Execute the following command to submit schema [NetOperatorQA.schema](./schema/NetOperatorQA.schema). |
| 82 | + |
| 83 | +```bash |
| 84 | +knext schema commit |
| 85 | +``` |
| 86 | + |
| 87 | +### Step 5: Build Knowledge Graph |
| 88 | + |
| 89 | +Execute [indexer.py](./builder/indexer.py) in the [builder](./builder) directory to build the knowledge graph. |
| 90 | + |
| 91 | +```bash |
| 92 | +cd builder && python indexer.py && cd .. |
| 93 | +``` |
| 94 | + |
| 95 | +### Step 6: Execute Q&A Task |
| 96 | + |
| 97 | +Execute [eval.py](./solver/eval.py) in the [solver](./solver) directory to generate answers and calculate EM and F1 metrics. |
| 98 | + |
| 99 | +```bash |
| 100 | +cd solver && python eval.py && cd .. |
| 101 | +``` |
| 102 | + |
| 103 | +### Step 7: (Optional) Cleanup |
| 104 | + |
| 105 | +To delete checkpoints, execute the following commands. |
| 106 | + |
| 107 | +```bash |
| 108 | +rm -rf ./builder/ckpt |
| 109 | +rm -rf ./solver/ckpt |
| 110 | +``` |
| 111 | + |
| 112 | +## Dataset and Technical Features |
| 113 | + |
| 114 | +### Dataset Features |
| 115 | +The NetOperatorQA dataset has the following characteristics: |
| 116 | + |
| 117 | +1. **Multiple Document Types**: Contains annual reports (AY series), technical white papers (BZ series), business introductions (BY series), network construction (BW series), technical standards (BT series), financial data (BF series), and other types of documents |
| 118 | + |
| 119 | +2. **Rich Business Domains**: Covers telecommunications operator core business areas including 5G networks, cloud computing, big data, IoT, industrial internet, digital government, smart cities, etc. |
| 120 | + |
| 121 | +3. **Structured Information**: Contains multi-dimensional structured information including financial indicators, technical parameters, business data, partnership relationships, etc. |
| 122 | + |
| 123 | +4. **Temporal Nature**: Data covers reports and data from multiple time periods, facilitating trend analysis and historical comparisons |
| 124 | + |
| 125 | +5. **Professional Content**: Involves telecommunications technical terminology, business models, financial indicators, and other professional content, placing high demands on the model's domain understanding capabilities |
| 126 | + |
| 127 | +### Technical Architecture Features |
| 128 | + |
| 129 | +1. **Multi-Index System**: Constructs multiple index types including text chunk index, outline index, summary index, table index, atomic query index, etc. |
| 130 | + |
| 131 | +2. **Hybrid Retrieval Strategy**: Combines vector retrieval, structured retrieval, semantic retrieval, and other retrieval methods |
| 132 | + |
| 133 | +3. **Adaptive Reasoning**: Automatically selects simple reasoner or complex reasoner based on question complexity |
| 134 | + |
| 135 | +4. **Domain Optimization**: Customized planners and prompt templates specifically for telecommunications operator domain characteristics |
| 136 | + |
| 137 | +5. **Configurable Architecture**: Supports flexible configuration of different extractor, retriever, and reasoner combinations |
| 138 | + |
| 139 | +6. **Performance Optimization**: Provides fast mode and complete mode to balance effectiveness and efficiency |
0 commit comments