Skip to content

spanning-tree/CSCI2270

Repository files navigation

Semantic Regular Expression (SRE)

A tool for extracting structured information from large unstructured text datasets using semantic regular expressions with pre-computed indexes to reduce token usage during querying.

Installation and Requirements

# Install dependencies
pip install -r requirements.txt

# Install MongoDB Atlas CLI
# Follow instructions at: https://www.mongodb.com/docs/atlas/cli/current/install-atlas-cli/

# Set up local Atlas deployment
atlas deployments setup myLocalAtlas --type local --port 27017
atlas deployments connect myLocalAtlas

# Install ollama for embedding model
# https://ollama.com/download
ollama pull nomic-embed-text

# Set Gemini API key
export GOOGLE_API_KEY=your_gemini_api_key_here

Project Goal

This tool uses semantic regular expressions (SRE) to extract structured information from large volumes of unstructured text data. The key innovation is the use of pre-computed vector indexes for entities and text chunks, which significantly reduces token usage during queries while maintaining high accuracy.

Directory Structure

  • outputs/: Evaluation results

  • logs/: Evaluation logs

  • visualizations/: Evaluation results visualization

  • services/: Core service modules for document processing, entity extraction, and relation extraction

    • DocumentProcessingService.py: Handles document ingestion and chunking
    • EntityExtractionService.py: Extracts entities from text chunks
    • RelationExtractorService.py: Executes information based onsemantic regular expressions
    • EmbeddingProcessorService.py: Computes and manages entity and text embeddings
    • DBService.py: The Mongodb interface, centralizes database operations
    • QueryEntityService.py: Provides optimized entity search capabilities
    • SREGeneratorService.py: Generates semantic regular expressions
  • entity/: Entity-related modules

    • EntitySRE.py: Core implementation of semantic regular expressions
  • utils/: Utility functions and model interfaces

    • Contains model wrappers, data processing utilities, and helper functions
  • baseline/: Implementation of baseline relation extraction methods

    • main.py: Entry point for baseline extraction
    • pipeline.py: Baseline extraction pipeline
  • prompts/: Contains prompt templates

  • datasets/: Contains sample datasets

Key Scripts

Indexing (scripts/indexing.py)

Processes datasets and creates pre-computed indexes to accelerate queries and reduce token usage.

python scripts/indexing.py

Baseline Evaluation (scripts/baseline.py)

Runs baseline relation extraction on datasets without using pre-computed indexes.

python scripts/baseline.py

Optimized Query Evaluation (scripts/query.py)

Runs optimized query on datasets using pre-computed indexes to reduce token usage (need to run indexing first).

python scripts/query.py

Visualization

The visualizations/ directory contains charts and data comparing the performance of the baseline and optimized approaches, showing improvements in token usage without sacrificing accuracy.

python scripts/visualization.py

Web Interface

A simple web interface is available for interactive querying:

python webapp.py

Accessible at http://localhost:8000 after starting.

Example Semantic Regular Expressions

  • (<ORGAN>)(<SYMPTOM>)(<DIAGNOSIS>): Extract organ-symptom-diagnosis relations
  • (<ORGAN:Liver>)(<SYMPTOM>)(<DIAGNOSIS>): Extract liver-symptom-diagnosis relations
  • (<DRUG>)(DOSE): Extract drug dosage information

Acknowledgements

  • LightRAG for the entity extraction prompt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages