Skip to content

CrazyAngelm/megaquant-rag-compress

Repository files navigation

MegaQuant RAG Compress

Research PoC CPU/Python Stored vectors

Low-bit compression experiments for stored RAG/document vectors.

At a glance

In a small SQuAD exact-search proxy benchmark, the best current document-only method gives:

What you care about Result Compared with
Stored-vector payload size 9.77% of float32 10.24x compression / 90.23% saving vs float32
Retrieval quality 98.69% recall@1 retention 0.440009 vs 0.445840 float32 recall@1
Better than old local baseline 55.34% smaller and +4.92% recall@1 vs older blockwise_seven_level_3bit result

Main method:

doconly_affine3_g64_meta4
3.126250 effective bits/dim
0.440009 recall@1
98.69% recall@1 retention vs float32

Why document-only? In this benchmark, document vectors are the stored index payload, while query vectors are transient. Keeping queries float32 preserves quality better than quantizing both sides.

Scope

This repository is a research proof-of-concept, not a production vector database.

The numbers above are:

  • from a small CPU/Python exact-search proxy benchmark,
  • based on TF-IDF + random projection vectors, not modern embedding models,
  • modeled stored-vector payload accounting,
  • not total vector-database memory.

They are not claims about FAISS/Qdrant/Milvus/LanceDB, ANN serving, HNSW/IVF memory, GPU search, or BEIR/MTEB quality.

Related repository

KV-cache companion project:

Benchmark setup

Dataset: SQuAD v1.1 dev paragraphs/questions
Docs: 800
Queries: 4460
Embedding proxy: TF-IDF + GaussianRandomProjection -> 256d + L2 normalize
Search: exact dense matrix search on CPU/Python

This is a micro-scale proxy benchmark. TF-IDF + random projection is not a modern semantic embedding model such as BGE, E5, GTE, or OpenAI embeddings. Results may change on larger corpora, denser candidate sets, real embedding models, or ANN indexes.

Headline result

Best method in this benchmark:

doconly_affine3_g64_meta4

Result:

effective_bits_per_dim = 3.126250
stored-vector memory saved = 90.230%
compression vs float32     = 10.236x
recall@1                   = 0.440009
recall@1 retention         = 98.69% of float32
MRR retention              = 98.99% of float32
score correlation          = 0.984499

Float32 baseline:

recall@1 = 0.445840
MRR      = 0.542880

Plain-language summary for this benchmark:

The stored vector payload is about 10x smaller, while recall@1 retention remains about 98.7% versus float32.

The memory number refers to compressed vector payload accounting, not total vector database footprint. For simulated low-bit metadata, it includes a small shared metadata-range overhead term. It does not include HNSW/IVF graph structures, IDs, metadata columns, allocator overhead, or packed-kernel layout overhead.

Current frontier table

Method Effective bits/dim Stored-vector memory saved Recall@1 Recall@1 retention MRR retention Notes
doconly_affine2_g64_meta4 2.126250 93.355% 0.409509 91.85% 93.50% best ultra-compact point tested here
doconly_affine3_g64_meta4 3.126250 90.230% 0.440009 98.69% 98.99% best tradeoff tested here
affine3_g64_meta4 3.126250 90.230% 0.430366 96.53% 97.47% compress docs and queries
nf3_g64_meta8 3.125625 90.232% 0.428347 96.08% 97.06% nonuniform codebook variant

Recommended methods

Main method

doconly_affine3_g64_meta4

Use as the main PoC configuration when you want near-float32 metrics in this small exact-search benchmark with about 10.236x smaller modeled stored-vector payload.

Ultra-compact method

doconly_affine2_g64_meta4

Result:

effective_bits_per_dim = 2.126250
stored-vector memory saved = 93.355%
recall@1 retention = 91.85%
MRR retention      = 93.50%

Use when stored-vector memory is more important than maximum benchmark recall.

Reproduce

Install dependencies:

python -m pip install -r requirements.txt

Place SQuAD files in the repository root as described in DATA.md.

Run the current frontier benchmark from the repository root:

python scripts/run_frontier_benchmark.py \
  --docs 800 \
  --components 256 \
  --output-csv results/frontier_rag_benchmark.csv \
  --output-md reports/frontier_benchmark_report.md

Reports

Current reports:

  • reports/frontier_summary.md
  • reports/frontier_benchmark_report.md

Results

  • results/frontier_rag_benchmark.csv

Changelog

  • CHANGELOG.md

Related prior-work topics

A public-facing version should position this against standard retrieval compression and ANN topics: scalar quantization, product quantization (PQ/OPQ), residual quantization, binary quantization, HNSW/IVF/ScaNN, BEIR/MTEB evaluation, exact search versus ANN serving, and static index compression versus query-time compression.

Honest limitations

This project currently demonstrates a CPU/Python exact-search quality and modeled stored-vector-memory result.

Not yet proven:

  • production ANN/vector database speed,
  • HNSW/IVF/PQ integration,
  • GPU search,
  • large embedding models such as OpenAI/text-embedding, BGE, E5, GTE,
  • large-scale BEIR/MTEB retrieval quality,
  • packed integer index implementation,
  • total vector database memory savings including graph/ID/metadata overhead.

Conservative claim:

In this small CPU/Python exact-search proxy benchmark, document-only affine3_g64_meta4 compression gives the best observed stored-vector memory/quality tradeoff among the tested MegaQuant RAG configurations: about 90% modeled stored-vector memory saving while retaining about 98.7% of float32 recall@1.


Repository positioning

This repository is public as a research PoC for compressed RAG/vector indexes. It is not a production vector database engine.

Suggested GitHub topics after public release:

rag vector-search embeddings compression quantization retrieval ai-search

About

Low-bit document-index compression for RAG/vector search research PoC.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages