A powerful AI-driven tool for extracting, normalizing, and standardizing bilingual terminology from parallel texts, designed for legal, technical, and professional documents.
- ๐ค AI-Powered Extraction: Intelligently identifies bilingual term pairs using advanced LLMs (GPT-4, Claude, DeepSeek, etc.)
- ๐ Quality Control: Automatically evaluates term alignment quality and filters low-quality results
- ๐ Intelligent Normalization:
- Chinese: Traditional/Simplified unification, structural markers (็ฌฌXXๆก)
- English: Singular/plural normalization, verb tense unification, structural markers (Article XX)
- Japanese: Notation unification, okurigana standardization, structural markers
- ๐ฏ Deduplication & Standardization: Intelligently merges synonym variants and selects best translations
- โก High Performance: Supports concurrent processing for large-scale documents
- ๐ Multilingual Support: Chinese, English, Japanese, and more
- Python 3.8+
- OpenAI API key (or OpenAI-compatible API)
# Clone the repository
git clone https://github.com/wang-h/bilingual_term_extractor.git
cd bilingual_term_extractor
# Install dependencies
pip install -r requirements.txt
# Configure environment variables
cp .env.example .env
# Edit .env and set your API keys:
# OPENAI_API_KEY=your-api-key-here
# OPENAI_BASE_URL=https://api.openai.com/v1 # Optional: for OpenAI-compatible APIs
# OPENAI_API_MODEL=gpt-4o-mini # Optional: default modelimport asyncio
from src.agents.bilingual_term_extract import BilingualTermExtractAgent
from src.agents.bilingual_term_quality_check import BilingualTermQualityCheckAgent
from src.agents.bilingual_term_normalization import TermNormalizationAgent
from src.agents.bilingual_term_standardization import BilingualTermStandardizationAgent
async def extract_terms():
# Source text (Chinese)
source_text = """
็ฌฌไธๆก ๅณๅจ่
ไบซๆๅนณ็ญๅฐฑไธๅ้ๆฉ่ไธ็ๆๅฉใๅๅพๅณๅจๆฅ้
ฌ็ๆๅฉ...
"""
# Target text (English)
target_text = """
Article 3: Workers shall have the right to employment on an equal basis...
"""
# Stage 1: Extract terms
extract_agent = BilingualTermExtractAgent(locale='zh')
extracted = await extract_agent.run({
'source_text': source_text,
'target_text': target_text,
'src_lang': 'zh',
'tgt_lang': 'en'
}, None)
# Stage 2: Quality check
quality_agent = BilingualTermQualityCheckAgent(locale='zh')
filtered = await quality_agent.run({
'terms': [t.__dict__ for t in extracted],
'source_text': source_text,
'target_text': target_text,
'src_lang': 'zh',
'tgt_lang': 'en'
}, None)
# Stage 3: Normalize
norm_agent = TermNormalizationAgent(locale='zh')
normalized = await norm_agent.run({
'terms': [t.__dict__ for t in filtered],
'src_lang': 'zh',
'tgt_lang': 'en'
}, None)
# Stage 4: Standardize
std_agent = BilingualTermStandardizationAgent(locale='zh')
final_terms = await std_agent.execute({
'terms': [t.__dict__ for t in normalized]
}, None)
return final_terms
# Run
asyncio.run(extract_terms())python term_extract.py test_data/sample_zh_en_100.json -o outputs --checkpoint outputs/checkpoint.jsonRaw Bilingual Texts
โ
[Stage 1] Term Extraction (BilingualTermExtractAgent)
โโ AI identifies term pairs
โโ Extracts context information
โโ Confidence scoring
โ
[Stage 2] Quality Check (BilingualTermQualityCheckAgent)
โโ Semantic consistency validation
โโ Term accuracy evaluation
โโ Filters low-quality results
โ
[Stage 3] Term Normalization (TermNormalizationAgent)
โโ Format standardization
โ โโ Chinese: Traditional/Simplified, "็ฌฌ36ๆก"โ"็ฌฌXXๆก"
โ โโ English: Singular/plural, "Article 36"โ"Article XX"
โ โโ Japanese: Notation, "็ฌฌ36ๆก"โ"็ฌฌXXๆก"
โโ Tense unification (English)
โโ Abbreviation standardization
โ
[Stage 4] Deduplication & Standardization (BilingualTermStandardizationAgent)
โโ Deduplicate by normalized forms
โโ Merge synonym variants
โโ Select best translations
โ
Final Standardized Terminology
- Traditional/Simplified: ๅ่ญฐ โ ๅ่ฎฎ
- Abbreviation: ๆ้ๅ ฌๅธ โ ๆ้่ดฃไปปๅ ฌๅธ
- Structural Markers:
- ็ฌฌ36ๆก โ ็ฌฌXXๆก
- ็ฌฌไธๅๅ ญๆก โ ็ฌฌXXๆก
- ็ฌฌ40ๆก็ฌฌไธ้กน โ ็ฌฌXXๆก็ฌฌXX้กน
- ็ฌฌไบ็ซ โ ็ฌฌXX็ซ
- ๏ผไธ๏ผ โ ๏ผXX๏ผ
- Singular/Plural: contracts โ contract/contracts
- Verb Tense: terminated โ terminate
- Structural Markers:
- Article 36 โ Article XX
- Section 5 โ Section XX
- Chapter 3 โ Chapter XX
- Paragraph 2 โ Paragraph XX
- Notation: ใใใใ โ ๅฅ็ด
- Okurigana: Following Cabinet Notice standards
- Structural Markers:
- ็ฌฌ36ๆก โ ็ฌฌXXๆก
- ็ฌฌไธๅๅ ญๆก โ ็ฌฌXXๆก
- ็ฌฌ2็ซ โ ็ฌฌXX็ซ
bilingual_term_extractor/
โโโ src/
โ โโโ agents/ # Core Agent modules
โ โ โโโ base.py # Base Agent class
โ โ โโโ bilingual_term_extract.py
โ โ โโโ bilingual_term_quality_check.py
โ โ โโโ bilingual_term_normalization.py
โ โ โโโ bilingual_term_standardization.py
โ โโโ lib/ # Utility libraries
โ โ โโโ llm_client.py # LLM client
โ โโโ workflows/ # Workflows
โ โโโ bilingual_term_extract.py
โโโ examples/ # Example scripts
โ โโโ simple_extract.py # Simple example
โ โโโ concurrent_bilingual_term_extract_v2.py # Concurrent processing
โโโ outputs/ # Output directory
โโโ requirements.txt
โโโ README.md # English documentation
โโโ README_zh.md # Chinese documentation
โโโ README_ja.md # Japanese documentation
Supports the following LLM providers:
- OpenAI (GPT-4, GPT-4-turbo, GPT-3.5)
- Anthropic (Claude-3.5-sonnet, Claude-3-opus)
- Other OpenAI-compatible APIs
# Quality check batch size
batch_size = 10
# Maximum target terms per source term
max_targets_per_source = 3
# Scoring weights
confidence_weight = 0.4
quality_weight = 0.6Example of standardized term output:
{
"source_term": "ๅณๅจๆฅ้
ฌ",
"target_term": "remuneration for work",
"original_source_term": "ๅณๅจๆฅ้
ฌ",
"original_target_term": "remuneration for work",
"category": "Legal Concept",
"confidence": 0.95,
"quality_score": 0.92,
"combined_score": 0.93,
"law": "Labor Law",
"domain": "LaborLaw",
"year": 1995,
"occurrence_count": 3
}Use concurrent_bilingual_term_extract_v2.py for large-scale document processing:
python examples/concurrent_bilingual_term_extract_v2.py \
--input data/parallel_texts.json \
--output outputs/ \
--max-workers 5Customize normalization behavior by modifying prompt templates in Agents:
# Add custom rules in TermNormalizationAgent
custom_rules = """
7. **Custom Rules**: Your domain-specific rules
- Example: Specialized terminology handling
"""Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details
- Project Homepage: https://github.com/wang-h/bilingual_term_extractor
- Issue Tracker: Issues
- OpenAI for GPT models
- Anthropic for Claude models
- All contributors
Note: Using this tool requires valid LLM API keys. Please ensure compliance with relevant terms of service.