A tool for extracting structured information from large unstructured text datasets using semantic regular expressions with pre-computed indexes to reduce token usage during querying.
# Install dependencies
pip install -r requirements.txt
# Install MongoDB Atlas CLI
# Follow instructions at: https://www.mongodb.com/docs/atlas/cli/current/install-atlas-cli/
# Set up local Atlas deployment
atlas deployments setup myLocalAtlas --type local --port 27017
atlas deployments connect myLocalAtlas
# Install ollama for embedding model
# https://ollama.com/download
ollama pull nomic-embed-text
# Set Gemini API key
export GOOGLE_API_KEY=your_gemini_api_key_hereThis tool uses semantic regular expressions (SRE) to extract structured information from large volumes of unstructured text data. The key innovation is the use of pre-computed vector indexes for entities and text chunks, which significantly reduces token usage during queries while maintaining high accuracy.
-
outputs/: Evaluation results -
logs/: Evaluation logs -
visualizations/: Evaluation results visualization -
services/: Core service modules for document processing, entity extraction, and relation extractionDocumentProcessingService.py: Handles document ingestion and chunkingEntityExtractionService.py: Extracts entities from text chunksRelationExtractorService.py: Executes information based onsemantic regular expressionsEmbeddingProcessorService.py: Computes and manages entity and text embeddingsDBService.py: The Mongodb interface, centralizes database operationsQueryEntityService.py: Provides optimized entity search capabilitiesSREGeneratorService.py: Generates semantic regular expressions
-
entity/: Entity-related modulesEntitySRE.py: Core implementation of semantic regular expressions
-
utils/: Utility functions and model interfaces- Contains model wrappers, data processing utilities, and helper functions
-
baseline/: Implementation of baseline relation extraction methodsmain.py: Entry point for baseline extractionpipeline.py: Baseline extraction pipeline
-
prompts/: Contains prompt templates -
datasets/: Contains sample datasets
Processes datasets and creates pre-computed indexes to accelerate queries and reduce token usage.
python scripts/indexing.pyRuns baseline relation extraction on datasets without using pre-computed indexes.
python scripts/baseline.pyRuns optimized query on datasets using pre-computed indexes to reduce token usage (need to run indexing first).
python scripts/query.pyThe visualizations/ directory contains charts and data comparing the performance of the baseline and optimized approaches, showing improvements in token usage without sacrificing accuracy.
python scripts/visualization.pyA simple web interface is available for interactive querying:
python webapp.pyAccessible at http://localhost:8000 after starting.
(<ORGAN>)(<SYMPTOM>)(<DIAGNOSIS>): Extract organ-symptom-diagnosis relations(<ORGAN:Liver>)(<SYMPTOM>)(<DIAGNOSIS>): Extract liver-symptom-diagnosis relations(<DRUG>)(DOSE): Extract drug dosage information
- LightRAG for the entity extraction prompt