Skip to content

wang-h/bilingual_term_extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

An LLM-based Multi-Agent System for Bilingual Legal Term Extractor

English | ไธญๆ–‡ | ๆ—ฅๆœฌ่ชž

A powerful AI-driven tool for extracting, normalizing, and standardizing bilingual terminology from parallel texts, designed for legal, technical, and professional documents.

โœจ Key Features

  • ๐Ÿค– AI-Powered Extraction: Intelligently identifies bilingual term pairs using advanced LLMs (GPT-4, Claude, DeepSeek, etc.)
  • ๐Ÿ” Quality Control: Automatically evaluates term alignment quality and filters low-quality results
  • ๐Ÿ“ Intelligent Normalization:
    • Chinese: Traditional/Simplified unification, structural markers (็ฌฌXXๆก)
    • English: Singular/plural normalization, verb tense unification, structural markers (Article XX)
    • Japanese: Notation unification, okurigana standardization, structural markers
  • ๐ŸŽฏ Deduplication & Standardization: Intelligently merges synonym variants and selects best translations
  • โšก High Performance: Supports concurrent processing for large-scale documents
  • ๐ŸŒ Multilingual Support: Chinese, English, Japanese, and more

๐Ÿ“ฆ Installation

Requirements

  • Python 3.8+
  • OpenAI API key (or OpenAI-compatible API)

Quick Install

# Clone the repository
git clone https://github.com/wang-h/bilingual_term_extractor.git
cd bilingual_term_extractor

# Install dependencies
pip install -r requirements.txt

# Configure environment variables
cp .env.example .env
# Edit .env and set your API keys:
# OPENAI_API_KEY=your-api-key-here
# OPENAI_BASE_URL=https://api.openai.com/v1  # Optional: for OpenAI-compatible APIs
# OPENAI_API_MODEL=gpt-4o-mini  # Optional: default model

๐Ÿš€ Quick Start

Basic Usage

import asyncio
from src.agents.bilingual_term_extract import BilingualTermExtractAgent
from src.agents.bilingual_term_quality_check import BilingualTermQualityCheckAgent
from src.agents.bilingual_term_normalization import TermNormalizationAgent
from src.agents.bilingual_term_standardization import BilingualTermStandardizationAgent

async def extract_terms():
    # Source text (Chinese)
    source_text = """
    ็ฌฌไธ‰ๆก ๅŠณๅŠจ่€…ไบซๆœ‰ๅนณ็ญ‰ๅฐฑไธšๅ’Œ้€‰ๆ‹ฉ่Œไธš็š„ๆƒๅˆฉใ€ๅ–ๅพ—ๅŠณๅŠจๆŠฅ้…ฌ็š„ๆƒๅˆฉ...
    """
    
    # Target text (English)
    target_text = """
    Article 3: Workers shall have the right to employment on an equal basis...
    """
    
    # Stage 1: Extract terms
    extract_agent = BilingualTermExtractAgent(locale='zh')
    extracted = await extract_agent.run({
        'source_text': source_text,
        'target_text': target_text,
        'src_lang': 'zh',
        'tgt_lang': 'en'
    }, None)
    
    # Stage 2: Quality check
    quality_agent = BilingualTermQualityCheckAgent(locale='zh')
    filtered = await quality_agent.run({
        'terms': [t.__dict__ for t in extracted],
        'source_text': source_text,
        'target_text': target_text,
        'src_lang': 'zh',
        'tgt_lang': 'en'
    }, None)
    
    # Stage 3: Normalize
    norm_agent = TermNormalizationAgent(locale='zh')
    normalized = await norm_agent.run({
        'terms': [t.__dict__ for t in filtered],
        'src_lang': 'zh',
        'tgt_lang': 'en'
    }, None)
    
    # Stage 4: Standardize
    std_agent = BilingualTermStandardizationAgent(locale='zh')
    final_terms = await std_agent.execute({
        'terms': [t.__dict__ for t in normalized]
    }, None)
    
    return final_terms

# Run
asyncio.run(extract_terms())

Run Example

python term_extract.py test_data/sample_zh_en_100.json -o outputs --checkpoint outputs/checkpoint.json

๐Ÿ“Š Processing Pipeline

Raw Bilingual Texts
    โ†“
[Stage 1] Term Extraction (BilingualTermExtractAgent)
    โ”œโ”€ AI identifies term pairs
    โ”œโ”€ Extracts context information
    โ””โ”€ Confidence scoring
    โ†“
[Stage 2] Quality Check (BilingualTermQualityCheckAgent)
    โ”œโ”€ Semantic consistency validation
    โ”œโ”€ Term accuracy evaluation
    โ””โ”€ Filters low-quality results
    โ†“
[Stage 3] Term Normalization (TermNormalizationAgent)
    โ”œโ”€ Format standardization
    โ”‚   โ”œโ”€ Chinese: Traditional/Simplified, "็ฌฌ36ๆก"โ†’"็ฌฌXXๆก"
    โ”‚   โ”œโ”€ English: Singular/plural, "Article 36"โ†’"Article XX"
    โ”‚   โ””โ”€ Japanese: Notation, "็ฌฌ36ๆก"โ†’"็ฌฌXXๆก"
    โ”œโ”€ Tense unification (English)
    โ””โ”€ Abbreviation standardization
    โ†“
[Stage 4] Deduplication & Standardization (BilingualTermStandardizationAgent)
    โ”œโ”€ Deduplicate by normalized forms
    โ”œโ”€ Merge synonym variants
    โ””โ”€ Select best translations
    โ†“
Final Standardized Terminology

๐ŸŽฏ Normalization Rules

Chinese Normalization

  1. Traditional/Simplified: ๅ”่ญฐ โ†’ ๅ่ฎฎ
  2. Abbreviation: ๆœ‰้™ๅ…ฌๅธ โ†’ ๆœ‰้™่ดฃไปปๅ…ฌๅธ
  3. Structural Markers:
    • ็ฌฌ36ๆก โ†’ ็ฌฌXXๆก
    • ็ฌฌไธ‰ๅๅ…ญๆก โ†’ ็ฌฌXXๆก
    • ็ฌฌ40ๆก็ฌฌไธ€้กน โ†’ ็ฌฌXXๆก็ฌฌXX้กน
    • ็ฌฌไบŒ็ซ  โ†’ ็ฌฌXX็ซ 
    • ๏ผˆไธ€๏ผ‰ โ†’ ๏ผˆXX๏ผ‰

English Normalization

  1. Singular/Plural: contracts โ†’ contract/contracts
  2. Verb Tense: terminated โ†’ terminate
  3. Structural Markers:
    • Article 36 โ†’ Article XX
    • Section 5 โ†’ Section XX
    • Chapter 3 โ†’ Chapter XX
    • Paragraph 2 โ†’ Paragraph XX

Japanese Normalization

  1. Notation: ใ‘ใ„ใ‚„ใ โ†’ ๅฅ‘็ด„
  2. Okurigana: Following Cabinet Notice standards
  3. Structural Markers:
    • ็ฌฌ36ๆก โ†’ ็ฌฌXXๆก
    • ็ฌฌไธ‰ๅๅ…ญๆก โ†’ ็ฌฌXXๆก
    • ็ฌฌ2็ซ  โ†’ ็ฌฌXX็ซ 

๐Ÿ“ Project Structure

bilingual_term_extractor/
โ”œโ”€โ”€ src/
โ”‚   โ”œโ”€โ”€ agents/              # Core Agent modules
โ”‚   โ”‚   โ”œโ”€โ”€ base.py         # Base Agent class
โ”‚   โ”‚   โ”œโ”€โ”€ bilingual_term_extract.py
โ”‚   โ”‚   โ”œโ”€โ”€ bilingual_term_quality_check.py
โ”‚   โ”‚   โ”œโ”€โ”€ bilingual_term_normalization.py
โ”‚   โ”‚   โ””โ”€โ”€ bilingual_term_standardization.py
โ”‚   โ”œโ”€โ”€ lib/                 # Utility libraries
โ”‚   โ”‚   โ””โ”€โ”€ llm_client.py   # LLM client
โ”‚   โ””โ”€โ”€ workflows/           # Workflows
โ”‚       โ””โ”€โ”€ bilingual_term_extract.py
โ”œโ”€โ”€ examples/                # Example scripts
โ”‚   โ”œโ”€โ”€ simple_extract.py   # Simple example
โ”‚   โ””โ”€โ”€ concurrent_bilingual_term_extract_v2.py  # Concurrent processing
โ”œโ”€โ”€ outputs/                 # Output directory
โ”œโ”€โ”€ requirements.txt
โ”œโ”€โ”€ README.md               # English documentation
โ”œโ”€โ”€ README_zh.md            # Chinese documentation
โ””โ”€โ”€ README_ja.md            # Japanese documentation

โš™๏ธ Configuration

LLM Configuration

Supports the following LLM providers:

  • OpenAI (GPT-4, GPT-4-turbo, GPT-3.5)
  • Anthropic (Claude-3.5-sonnet, Claude-3-opus)
  • Other OpenAI-compatible APIs

Term Extraction Configuration

# Quality check batch size
batch_size = 10

# Maximum target terms per source term
max_targets_per_source = 3

# Scoring weights
confidence_weight = 0.4
quality_weight = 0.6

๐Ÿ“Š Output Format

Example of standardized term output:

{
  "source_term": "ๅŠณๅŠจๆŠฅ้…ฌ",
  "target_term": "remuneration for work",
  "original_source_term": "ๅŠณๅŠจๆŠฅ้…ฌ",
  "original_target_term": "remuneration for work",
  "category": "Legal Concept",
  "confidence": 0.95,
  "quality_score": 0.92,
  "combined_score": 0.93,
  "law": "Labor Law",
  "domain": "LaborLaw",
  "year": 1995,
  "occurrence_count": 3
}

๐Ÿ”ง Advanced Usage

Concurrent Batch Processing

Use concurrent_bilingual_term_extract_v2.py for large-scale document processing:

python examples/concurrent_bilingual_term_extract_v2.py \
    --input data/parallel_texts.json \
    --output outputs/ \
    --max-workers 5

Custom Normalization Rules

Customize normalization behavior by modifying prompt templates in Agents:

# Add custom rules in TermNormalizationAgent
custom_rules = """
7. **Custom Rules**: Your domain-specific rules
   - Example: Specialized terminology handling
"""

๐Ÿค Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details

๐Ÿ“ง Contact

๐Ÿ™ Acknowledgments

  • OpenAI for GPT models
  • Anthropic for Claude models
  • All contributors

Note: Using this tool requires valid LLM API keys. Please ensure compliance with relevant terms of service.

About

A powerful AI-driven tool (an LLM-based multi-agent system) for extracting, normalizing, and standardizing bilingual terminology from parallel texts, designed for legal, technical, and professional documents.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages