gabaoun
diff --git a/‎.env.example‎
Lines changed: 13 additions & 5 deletions b/‎.env.example‎
Lines changed: 13 additions & 5 deletions
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 28 additions & 33 deletions b/‎.github/workflows/ci.yml‎
Lines changed: 28 additions & 33 deletions
diff --git a/‎README.md‎
Lines changed: 46 additions & 41 deletions b/‎README.md‎
Lines changed: 46 additions & 41 deletions
diff --git a/‎docker-compose.yml‎
Lines changed: 2 additions & 16 deletions b/‎docker-compose.yml‎
Lines changed: 2 additions & 16 deletions
diff --git a/‎docs/ADR-001-Native-Vector-Search-Redis.md‎
Lines changed: 0 additions & 25 deletions b/‎docs/ADR-001-Native-Vector-Search-Redis.md‎
Lines changed: 0 additions & 25 deletions
diff --git a/‎docs/adr/001-Simplified-Semantic-Caching.md‎
Lines changed: 22 additions & 0 deletions b/‎docs/adr/001-Simplified-Semantic-Caching.md‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎main.py‎
Lines changed: 33 additions & 39 deletions b/‎main.py‎
Lines changed: 33 additions & 39 deletions
@@ -1,10 +1,18 @@
+# Core API Keys
 OPENAI_API_KEY=your_openai_api_key_here
+
+# Qdrant Vector Store
 QDRANT_URL=http://localhost:6333
 QDRANT_API_KEY=your_qdrant_api_key_here
-POSTGRES_USER=postgres
-POSTGRES_PASSWORD=postgres
-POSTGRES_DB=project_aether
-POSTGRES_HOST=localhost
-POSTGRES_PORT=5432
+
+# Redis Semantic Cache
+REDIS_HOST=localhost
+REDIS_PORT=6379
+SEMANTIC_CACHE_THRESHOLD=0.85
+
+# Observability (Optional - Arize Phoenix)
 PHOENIX_COLLECTOR_ENDPOINT=http://localhost:6006
+
+# Application Settings
 LOG_LEVEL=INFO
+DATA_DIR=./data
@@ -1,40 +1,35 @@
-name: Continuous Integration
+name: Project Aether CI
 
-on: [push, pull_request]
+on:
+  push:
+    branches: [ main ]
+  pull_request:
+    branches: [ main ]
 
 jobs:
   test:
     runs-on: ubuntu-latest
 
-    services:
-      redis:
-        image: redis/redis-stack-server:latest
-        ports:
-          - 6379:6379
-      qdrant:
-        image: qdrant/qdrant:latest
-        ports:
-          - 6333:6333
-
     steps:
-      - uses: actions/checkout@v4
-      
-      - name: Set up Python
-        uses: actions/setup-python@v5
-        with:
-          python-version: '3.11'
-          cache: 'pip'
-
-      - name: Install dependencies
-        run: |
-          python -m pip install --upgrade pip
-          pip install -r requirements.txt
-          pip install pytest pytest-asyncio pytest-mock
-          pip install -e .
-
-      - name: Run Tests
-        env:
-          OPENAI_API_KEY: sk-fake-key-for-testing
-          REDIS_HOST: localhost
-          QDRANT_HOST: localhost
-        run: pytest tests/
+    - uses: actions/checkout@v3
+    
+    - name: Set up Python 3.11
+      uses: actions/setup-python@v4
+      with:
+        python-version: "3.11"
+        
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install -r requirements.txt
+        
+    - name: Run tests with pytest
+      env:
+        OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
+        REDIS_HOST: localhost
+        REDIS_PORT: 6379
+        QDRANT_URL: http://localhost:6333
+        LOG_LEVEL: INFO
+        DATA_DIR: ./tests/data
+      run: |
+        pytest tests/
@@ -1,75 +1,80 @@
-# Project Aether: Event-Driven RAG Engine
+# Project Aether: RAG Pipeline with Event-Driven Workflows
 
-![Python](https://img.shields.io/badge/python-3.11+-blue.svg)
-![LlamaIndex](https://img.shields.io/badge/framework-LlamaIndex-orange.svg)
-![License](https://img.shields.io/badge/license-Apache--2.0-green.svg)
+## Overview
+Project Aether is a Retrieval-Augmented Generation (RAG) system built with Python and LlamaIndex. It implements a document ingestion and retrieval pipeline using an event-driven architecture (Workflows) to handle complex tasks like query transformation, metadata enrichment, and semantic caching.
 
-**Author:** Gabriel (Gabaoun) Penha
+The project is designed as a modular reference for building RAG applications that require more than simple linear processing, incorporating retries, asynchronous operations, and a clear separation of concerns.
 
-> *A highly resilient, event-driven Retrieval-Augmented Generation (RAG) engine optimizing semantic search latency by 80% while ensuring robust PII masking and enterprise-grade reliability.*
+## Features
+- **Event-Driven Ingestion:** Processes documents through a series of discrete steps (Loading -> PII Masking -> Semantic Splitting -> Enrichment -> Indexing).
+- **Advanced Retrieval:** Implements HyDE (Hypothetical Document Embeddings), query refinement loops, and relevance judgment (Chain-of-Thought) before generating answers.
+- **Semantic Caching:** Uses Redis to store and retrieve previously generated answers for identical or highly similar queries to reduce LLM latency and cost.
+- **PII Masking:** Basic regex-based masking of emails and phone numbers during the ingestion phase.
+- **Resiliency:** Uses `tenacity` for exponential backoff retries on LLM and database operations.
+- **Memory Efficiency:** Uses Python generators during document splitting to handle larger datasets without high memory consumption.
 
-Project Aether is a world-class reference implementation of a complex RAG system. By shifting from standard linear pipelines to LlamaIndex Workflows, it introduces cycles, streaming, and robust failure recovery natively into the ingestion and retrieval processes.
+## Tech Stack
+- **Language:** Python 3.11+
+- **Orchestration:** LlamaIndex (Workflows)
+- **Vector Database:** Qdrant
+- **Cache:** Redis
+- **LLM:** OpenAI (GPT-4o, GPT-4o-mini)
+- **Embeddings:** HuggingFace (BGE models)
+- **Configuration:** Pydantic Settings
 
-## 🌟 Key Features
+## Key Technical Points
+- **Modular Refactoring:** Logic is split into `core` (business logic), `services` (external integrations), `pipeline` (workflow orchestration), and `models` (data structures).
+- **Asynchronous Execution:** Heavy use of `asyncio` for non-blocking I/O, particularly in PII masking and LLM calls.
+- **Custom Splitter:** Implements a `SemanticDoubleMergingSplitter` which performs an initial semantic split and then merges small chunks that fall below a minimum size threshold.
 
-* **Event-Driven Workflows:** Employs LlamaIndex `Workflow` and `Event` classes to orchestrate query decomposition, HyDE, and Chain-of-Thought (CoT) relevance judgments with self-correction loops.
-* **Semantic Caching (Redis):** Caches query vectors via HNSW indices, intercepting recurrent queries to deliver sub-100ms response times and drastically reduce LLM API costs.
-* **Enterprise Governance:** Integrates an asynchronous Microsoft Presidio masking layer to strip Personally Identifiable Information (PII) before documents ever hit the vector database.
-* **Resilient Infrastructure:** Bulletproofed with `tenacity` for exponential backoff on all critical third-party I/O (LLMs, Qdrant).
-* **Memory-Optimized Ingestion:** Implements a custom `SemanticDoubleMergingSplitter` leveraging Python Generators to process massive document sets without memory bloat.
+## Design Decisions
+- **LlamaIndex Workflows over Pipelines:** Chosen to allow for non-linear logic, such as the query refinement loop in the retrieval workflow which can re-run if initial results are deemed irrelevant.
+- **BGE-Reranker:** Integrated to improve precision by re-evaluating the top retrieved nodes using a cross-encoder model.
+- **Strict Typing:** All major functions and classes use Python type hints for better maintainability and error detection.
 
-## 📈 Benchmarks
+## Limitations
+- **Regex-based PII:** The current PII masker uses basic regular expressions and is not a substitute for a production-grade NER (Named Entity Recognition) system.
+- **Simplified Semantic Cache:** The current implementation uses exact string matching in Redis for the cache keys rather than true vector-based similarity search.
+- **Single Collection:** Currently hardcoded to use a single Qdrant collection for all documents.
 
-| Metric | Basic RAG | Project Aether | Impact |
-|--------|-----------|----------------|--------|
-| **Faithfulness (Hallucination Rate)** | 62% | **88%** | ⬇️ HyDE & CoT Evaluation |
-| **Answer Relevance** | 70% | **92%** | ⬆️ BGE-Reranker & Reordering |
-| **Context Precision** | 55% | **85%** | ⬆️ Semantic Chunking Generators |
-| **Avg. Latency (P95)** | 5.2s | **0.8s** | ⚡ Semantic Cache (80% Hit Rate) |
-
-## 🛠 Architecture Decision Records (ADR)
-We maintain a robust architecture history. See the `docs/adr/` directory for detailed reasoning on our stack:
-- [ADR 001: Native Vector Search on Redis](docs/adr/ADR-001-Native-Vector-Search-Redis.md)
-- [ADR 002: LlamaIndex Workflows for Event-Driven RAG](docs/adr/002-LlamaIndex-Workflows-for-Event-Driven-RAG.md)
-- [ADR 003: Semantic Chunking Strategy](docs/adr/003-Semantic-Chunking-Strategy.md)
-
-## 🚀 Getting Started
+## Getting Started
 
 ### Prerequisites
-- Docker & Docker Compose
-- Python 3.11+ (Uses `async/await` heavily)
+- Docker and Docker Compose
+- Python 3.11+
 - OpenAI API Key
 
 ### Installation
-1. Clone the repository and navigate to the directory:
+1. Clone the repository:
    ```bash
-    git clone https://github.com/gabaoun/Project-Aether.git
-    cd Project-Aether
+   git clone https://github.com/your-username/Project-Aether.git
+   cd Project-Aether
    ```
 2. Install dependencies:
    ```bash
    pip install -r requirements.txt
    ```
-3. Environment Setup:
+3. Setup environment variables:
    ```bash
    cp .env.example .env
-   # Add your OPENAI_API_KEY to .env
+   # Edit .env with your OpenAI API Key and other settings
    ```
-4. Start the infrastructure (Qdrant, Postgres, Redis):
+4. Start infrastructure:
    ```bash
    docker-compose up -d
    ```
 
 ### Usage
-Execute the main application to start ingestion (if `./data` is populated) and the interactive retrieval loop:
+Ensure you have documents in the `./data` directory (as specified in your `.env`), then run:
 ```bash
 python main.py
 ```
 
-### Testing
-Run the comprehensive test suite:
+## Testing
+Run the test suite using pytest:
 ```bash
 pytest tests/
 ```
-## ⚖️ License
-Distributed under the Apache 2.0 License. See `LICENSE` for more information.
+
+## Purpose
+This project was developed to demonstrate a technically sound approach to building RAG systems. It focuses on clean architecture, error handling, and implementing advanced RAG patterns in a way that is maintainable and extensible.
@@ -8,21 +8,7 @@ services:
       - "6333:6333"
       - "6334:6334"
     volumes:
-      - ./qdrant_data:/qdrant/storage
-    networks:
-      - aether_network
-
-  postgres:
-    image: ankane/pgvector:latest
-    container_name: postgres
-    environment:
-      POSTGRES_USER: postgres
-      POSTGRES_PASSWORD: postgres
-      POSTGRES_DB: project_aether
-    ports:
-      - "5432:5432"
-    volumes:
-      - postgres_data:/var/lib/postgresql/data
+      - qdrant_data:/qdrant/storage
     networks:
       - aether_network
 
@@ -42,5 +28,5 @@ networks:
     driver: bridge
 
 volumes:
-  postgres_data:
+  qdrant_data:
   redis_data:
@@ -0,0 +1,22 @@
+# ADR 001: Simplified Semantic Caching with Redis
+
+## Status
+Accepted
+
+## Context
+RAG systems can be slow and expensive if the same queries are sent to the LLM repeatedly. We need a way to cache responses for identical or highly similar queries.
+
+## Decision
+We decided to implement a simplified caching layer using Redis as a key-value store.
+
+## Rationale
+- **Speed:** Redis provides sub-millisecond lookups for cached answers.
+- **Cost Reduction:** Intercepting recurrent queries prevents redundant LLM API calls.
+- **Implementation Clarity:** For the current project scope, a key-based lookup (using query strings) provides immediate value without the overhead of managing a separate vector index within Redis.
+
+## Consequences
+- **Positive:** Immediate performance gain for repeated queries and reduced costs.
+- **Negative:** Current implementation requires exact query matching (or highly similar pre-processed strings) and does not yet leverage true vector-based similarity search in Redis.
+
+## Future Work
+In a production-grade system, this could be migrated to utilize Redis Stack's native vector search capabilities (HNSW) to allow for a true semantic cache.
@@ -1,25 +1,18 @@
-"""
-Project Aether - Entry Point
-Author: Gabriel (Gabaoun) Penha
-"""
 import os
 import asyncio
-import logging
 from qdrant_client import QdrantClient
-from src.processing.ingestion_workflow import IngestionWorkflow
-from src.retrieval.retrieval_workflow import RetrievalWorkflow, StreamingStatusEvent
-from src.utils.config import settings
+from src.pipeline.ingestion import IngestionWorkflow
+from src.pipeline.retrieval import RetrievalWorkflow, StreamingStatusEvent
+from src.config.settings import settings
+from src.utils.logger import logger
 from llama_index.core import set_global_handler
 
-# Setup logging
-logging.basicConfig(level=getattr(logging, settings.log_level))
-
-# Setup observability
+# Optional Observability
 try:
     set_global_handler("arize_phoenix", endpoint=settings.phoenix_collector_endpoint)
-    print(f"Observability enabled via Arize Phoenix at {settings.phoenix_collector_endpoint}")
+    logger.info(f"Observability enabled at {settings.phoenix_collector_endpoint}")
 except ImportError:
-    print("Arize Phoenix not installed. Skipping observability setup.")
+    logger.warning("Arize Phoenix not installed. Skipping observability.")
 
 async def main():
     qdrant_client = QdrantClient(url=settings.qdrant_url, api_key=settings.qdrant_api_key)
@@ -29,38 +22,39 @@ async def main():
 
     data_dir = settings.data_dir
     if os.path.exists(data_dir) and os.listdir(data_dir):
-        print(f"Starting ingestion for documents in {data_dir}...")
+        logger.info(f"Starting ingestion from {data_dir}...")
         index = await ingestion_wf.run(input_dir=data_dir)
     else:
-        print("Data directory is empty or does not exist. Skipping ingestion.")
+        logger.error(f"Data directory '{data_dir}' is empty or missing.")
         return
 
     retrieval_wf = RetrievalWorkflow(index=index)
 
     while True:
-        query = input("\nEnter your query (or 'exit' to quit): ")
-        if query.lower() == 'exit':
-            break
-            
-        print(f"\nProcessing query: {query}")
-        
-        # Stream workflow events
-        handler = retrieval_wf.run(query=query)
-        
-        async for event in handler.stream_events():
-            if isinstance(event, StreamingStatusEvent):
-                print(f"[Streaming] ⏳ {event.status}")
+        try:
+            query = input("\nEnter query (or 'exit'): ")
+            if query.lower() == 'exit':
+                break
 
-        result = await handler
-        
-        print("\n--- Answer ---")
-        if result.get("from_cache"):
-            print("[✅ RETURNED FROM REDIS CACHE]")
-        print(result["answer"])
-        print("\n--- Sources ---")
-        for node in result.get("source_nodes", []):
-            file_name = node.metadata.get('file_name', 'Unknown file')
-            print(f"- {file_name}")
+            handler = retrieval_wf.run(query=query)
+            
+            async for event in handler.stream_events():
+                if isinstance(event, StreamingStatusEvent):
+                    print(f"⏳ {event.status}")
+                    
+            result = await handler
+            
+            print("\n--- Answer ---")
+            if result.get("from_cache"):
+                print("[CACHED]")
+            print(result["answer"])
+            print("\n--- Sources ---")
+            for node in result.get("source_nodes", []):
+                print(f"- {node.metadata.get('file_name', 'Unknown')}")
+        except KeyboardInterrupt:
+            break
+        except Exception as e:
+            logger.error(f"Error: {e}")
 
 if __name__ == "__main__":
-    asyncio.run(main())
+    asyncio.run(main())