Skip to content

Latest commit

 

History

History
92 lines (72 loc) · 3.39 KB

File metadata and controls

92 lines (72 loc) · 3.39 KB

SciDaEx Data Service

This folder contains the core data processing and extraction functionalities for the SciDaEx (Scientific Data Extraction) project. These modules work together to process scientific papers, extract relevant information, and provide a question-answering capability.

Key Features

  • PDF processing and information extraction (preprocess.py)
    • Table and figure extraction from scientific papers
    • Meta-information extraction from papers
  • Vector store creation and management (dataService.py)
  • RAG-based question-answering system (dataService.py)
  • LLM-based summarization (summarize.py)
  • Evaluation metrics for QA performance (llm_eval.py)
  • Global configuration and prompt management (globalVariable.py)
  • Utility functions for various processing tasks (utils.py)

Setup

  1. Ensure all required libraries are installed (see requirements.txt in the parent directory).
  2. Create a .env file in the backend directory by copying from .env.example:
    cp .env.example .env
  3. Update the .env file with your actual API keys and credentials:
    # OpenAI Configuration
    OPENAI_API_KEY=your_openai_api_key_here
    
    # Adobe Credentials
    ADOBE_CLIENT_ID=your_adobe_client_id_here
    ADOBE_CLIENT_SECRET=your_adobe_client_secret_here
    • Replace the placeholder values with your actual API keys and credentials.
    • Adobe credentials
    • Optional: You can also configure directory paths and model settings in the .env file.

Usage

Preprocess PDFs

You can use preprocess.py to process either a folder of PDFs or a single PDF file:

  1. Processing a folder of PDFs:

    python preprocess.py \
    --pdf_dir <path_to_pdf_folder> \
    --figure_dir <path_to_figure_output_folder> \
    --table_dir <path_to_table_output_folder> \
    --meta_dir <path_to_meta_output_folder> \
    --openai_key <your_openai_api_key> \
    --vectorstore_dir <path_to_vectorstore_output_folder>
  2. Processing a folder of PDFs:

    python preprocess.py \
    --pdf_path <path_to_single_pdf_file> \
    --figure_dir <path_to_figure_output_folder> \
    --table_dir <path_to_table_output_folder> \
    --meta_dir <path_to_meta_output_folder> \
    --openai_key <your_openai_api_key> \
    --vectorstore_dir <path_to_vectorstore_output_folder>

Add the --fast flag for faster, non-LLM-based table extraction. For more options, run python preprocess.py --help.

Using dataService.py

The DataService class in dataService.py provides the main question-answering functionality:

  1. Ensure you have preprocessed your PDF files using preprocess.py as described in the previous section.

  2. To use the DataService, you can refer to the example in the dataService.py file or use the following template:

    from dataService import DataService
    
    # Initialize the DataService
    data_service = DataService()
    
    # Specify the PDF file names you want to query (NOTE: These files should have been preprocessed)
    pdf_files = ["example1.pdf", "example2.pdf", ...]
    
    # Your question
    question = "Your question here"
    
    # Run the QA system
    summary, results = data_service.run_rag_qa(pdf_files, question)
    
    # Process and use the results as needed
    print(summary)
    for pdf, result in results.items():
        print(f"Results for {pdf}:", result)