This document provides a detailed, step-by-step explanation of the internal workflows of the SCALAR 2.0 application. It traces the journey of a request from the moment it is sent until a response is returned, covering both the document ingestion and search processes.
This workflow is initiated when a user uploads a PDF document to be added to the vector database.
- File:
api/main.py - Function:
upload_pdf
- A client (such as the Streamlit UI or a
curlcommand) sends anHTTP POSTrequest with amultipart/form-datapayload to the/uploadendpoint. - The FastAPI server receives the request. It validates that the uploaded file's
Content-Typeisapplication/pdf. If not, it rejects the request with a400 Bad Requesterror. - The file is saved to a temporary directory (
./temp_uploads) on the server's local disk. This is necessary so that thePyMuPDFlibrary can access it via a file path. - The API then calls the core processing method:
vector_service.process_and_store_pdf(), passing the path to the temporary file and the original filename.
- File:
core/vector_service.py - Function:
process_and_store_pdf()
- Duplicate Check: The first action is to check if a document with the same filename has already been indexed. It iterates through the
self.metadatalist to see if thesourcekey matches the incomingfilename. If a match is found, it raises aValueError, which is caught by the API layer and returned as a422 Unprocessable Entityerror. - Text Extraction: The
fitz(PyMuPDF) library opens the PDF file. It iterates through each page, extracts the raw text, and checks if it contains content. - Data Structuring: For each page with text, a LangChain
Documentobject is created. This object holds thepage_content(the text) and its associatedmetadata(the source filename and page number). This ensures the context of each piece of text is preserved. If no text is extracted from the entire PDF, aValueErroris raised.
- File:
core/vector_service.py - Function:
process_and_store_pdf()
- Text Chunking: The list of
Documentobjects is passed toLangChain's RecursiveCharacterTextSplitter. This tool intelligently splits the text from all pages into smaller, overlapping chunks (of max 1000 characters with a 200-character overlap). It prioritizes splitting along paragraph breaks, then line breaks, then sentences, to keep the chunks as semantically coherent as possible. The output is a new list of smallerDocumentobjects, each representing one chunk. - Data Separation: The content and metadata from the chunks are separated into two lists:
chunk_texts(a list of strings) andchunk_metadata(a list of dictionaries). - Vector Embedding: The
chunk_textslist is passed to theSentenceTransformermodel'sencode()method. This model transforms each text chunk into a 384-dimensional numerical vector (embedding). The result is a single NumPy array where each row is a vector. - Normalization: The vectors in the NumPy array are normalized to a unit length (L2 normalization). This is a required step for performing accurate similarity searches using an inner product metric.
- File:
core/vector_service.py - Functions:
process_and_store_pdf(),_save_data()
- Add to Index: The NumPy array of vectors is added to the in-memory FAISS index (
self.index.add()). FAISS automatically assigns an implicit, sequential ID (its position) to each vector. - Add to Lists: The
chunk_metadataandchunk_textsare extended to the service's master lists,self.metadataandself.content. The position of each item in these lists directly corresponds to the position of its vector in the FAISS index. - Save to Disk: The
_save_data()method is called. It usesfaiss.write_index()to save the vector index toindex.faiss, andpickle.dump()to serialize and save themetadataandcontentlists tometadata.pklandcontent.pkl.
- File:
api/main.py - Function:
upload_pdf()
- Control returns to the API layer after the
vector_servicemethod completes successfully. - The
finallyblock in thetry...exceptstatement executes, deleting the file from the temporary directory. - The API sends a
200 OKresponse to the client with a JSON body confirming the successful ingestion.
This workflow is initiated when a user submits a query to find relevant information within the indexed documents.
- File:
api/main.py - Function:
search()
- A client sends an
HTTP POSTrequest to the/searchendpoint with a JSON body containing thequery_textand the desired number of results,k. - FastAPI uses the
SearchQueryPydantic model to automatically validate the request body. If the data is invalid (e.g.,query_textis missing), it returns a422 Unprocessable Entityerror. - The API calls the core search method:
vector_service.search(), passing the query text andk.
- File:
core/vector_service.py - Function:
search()
- Database Check: The function first checks if the database contains any vectors (
self.index.ntotal == 0). If not, it returns an error dictionary. - Query Embedding: The user's
query_text(a single string) is passed to the sameSentenceTransformermodel to convert it into a 384-dimensional vector. This ensures the query and the documents are in the same vector space. The query vector is also normalized. - FAISS Search: The
self.index.search()method is called. This is the core retrieval step. The FAISS HNSW algorithm efficiently navigates its internal graph structure to find thekvectors in the index that are most similar to the query vector. - Search Output: The search returns two NumPy arrays:
distances: An array of similarity scores (higher is more similar).indices: An array containing the integer positions (the implicit IDs) of thekmost similar vectors in the index.
- File:
core/vector_service.py - Function:
search()
- Data Lookup: The function iterates through the
indicesarray returned by FAISS. - For each index
idx, it retrieves the corresponding metadata and content from the in-memory lists:self.metadata[idx]andself.content[idx]. - Response Packaging: The retrieved metadata and content are combined into a structured dictionary for each result. The similarity score from the
distancesarray is also included. - All result dictionaries are collected into a final list, which is then wrapped in a parent dictionary:
{"results": [...]}. This dictionary is returned to the API layer.
- File:
api/main.py - Function:
search()
- The API layer receives the dictionary of results from the
vector_service. - If the dictionary contains an "error" key, it raises an
HTTPException. - Otherwise, it sends a
200 OKresponse to the client, with the structured list of search results as the JSON body.