SciDaEx Data Service

This folder contains the core data processing and extraction functionalities for the SciDaEx (Scientific Data Extraction) project. These modules work together to process scientific papers, extract relevant information, and provide a question-answering capability.

Key Features

PDF processing and information extraction (preprocess.py)
- Table and figure extraction from scientific papers
- Meta-information extraction from papers
Vector store creation and management (dataService.py)
RAG-based question-answering system (dataService.py)
LLM-based summarization (summarize.py)
Evaluation metrics for QA performance (llm_eval.py)
Global configuration and prompt management (globalVariable.py)
Utility functions for various processing tasks (utils.py)

Setup

Ensure all required libraries are installed (see requirements.txt in the parent directory).
Create a .env file in the backend directory by copying from .env.example:
```
cp .env.example .env
```
Update the .env file with your actual API keys and credentials:
```
# OpenAI Configuration
OPENAI_API_KEY=your_openai_api_key_here

# Adobe Credentials
ADOBE_CLIENT_ID=your_adobe_client_id_here
ADOBE_CLIENT_SECRET=your_adobe_client_secret_here
```
- Replace the placeholder values with your actual API keys and credentials.
- Adobe credentials
- Optional: You can also configure directory paths and model settings in the .env file.

Usage

Preprocess PDFs

You can use preprocess.py to process either a folder of PDFs or a single PDF file:

Processing a folder of PDFs:

python preprocess.py \
--pdf_dir <path_to_pdf_folder> \
--figure_dir <path_to_figure_output_folder> \
--table_dir <path_to_table_output_folder> \
--meta_dir <path_to_meta_output_folder> \
--openai_key <your_openai_api_key> \
--vectorstore_dir <path_to_vectorstore_output_folder>

Processing a folder of PDFs:

python preprocess.py \
--pdf_path <path_to_single_pdf_file> \
--figure_dir <path_to_figure_output_folder> \
--table_dir <path_to_table_output_folder> \
--meta_dir <path_to_meta_output_folder> \
--openai_key <your_openai_api_key> \
--vectorstore_dir <path_to_vectorstore_output_folder>

Add the --fast flag for faster, non-LLM-based table extraction. For more options, run python preprocess.py --help.

Using dataService.py

The DataService class in dataService.py provides the main question-answering functionality:

Ensure you have preprocessed your PDF files using preprocess.py as described in the previous section.

To use the DataService, you can refer to the example in the dataService.py file or use the following template:

from dataService import DataService

# Initialize the DataService
data_service = DataService()

# Specify the PDF file names you want to query (NOTE: These files should have been preprocessed)
pdf_files = ["example1.pdf", "example2.pdf", ...]

# Your question
question = "Your question here"

# Run the QA system
summary, results = data_service.run_rag_qa(pdf_files, question)

# Process and use the results as needed
print(summary)
for pdf, result in results.items():
    print(f"Results for {pdf}:", result)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SciDaEx Data Service

Key Features

Setup

Usage

Preprocess PDFs

Using dataService.py

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

SciDaEx Data Service

Key Features

Setup

Usage

Preprocess PDFs

Using dataService.py