Skip to content

[Feature] Improve retrieval quality by adding reranking layer to HybridRetriever #115

@GovindhKishore

Description

@GovindhKishore

Problem

The current HybridRetriever in csv_chroma.py retrieves documents
from multiple subdirectories using BM25 + SelfQuery + MultiQuery
expansion, resulting in ~90 documents being passed to
create_stuff_documents_chain.

All retrieved documents are stuffed directly into the LLM prompt
with no relevance filtering across subdirectories. This causes:

  • Responses becoming increasingly long as more data sources are added
  • No cross-subdirectory relevance ranking - a low-relevance document
    from one subdirectory is treated equally to a high-relevance document
    from another
  • LLM receiving noisy context which reduces answer precision

Proposed Solution

Add a reranking layer after weighted_reciprocal_rank and before
returning documents in both retrieve_documents() and
aretrieve_documents().

The reranker scores all retrieved documents against the original user
query using a cross-encoder model and returns only the top N most
relevant documents regardless of which subdirectory they came from.

Implementation:

  • Add src/retrievers/reranker.py using FlashRank
    (ms-marco-MiniLM-L-12-v2)
  • Modify return statements in both sync and async retrieve methods
    in csv_chroma.py
  • Add reranker configuration to config_default.yml

Why FlashRank:

  • Runs locally - no API key required
  • CPU only - no GPU needed
  • Lightweight (~4MB model)
  • Already compatible with existing list[Document] pipeline

Impact

  • Applies automatically to both Reactome and UniProt retrievers
    since csv_chroma.py is shared
  • Any future database integrations get reranking for free
  • Response length directly controlled via top_n config parameter
  • Zero changes to downstream pipeline - same list[Document] type
    returned throughout

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions