Skip to content

Commit 71df51f

Browse files
Merge pull request #54 from BPMSoftwareSolutions/feat/53-rag-upgrade
feat(#53): Phase 1 RAG Upgrade - Real embeddings, FAISS, LLM rewriting, and reranking
2 parents d13b0c7 + cae60d0 commit 71df51f

11 files changed

Lines changed: 1379 additions & 278 deletions

PHASE_1_UPGRADE_PLAN.md

Lines changed: 380 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,380 @@
1+
# Phase 1 RAG Upgrade Plan: Skeleton → Production-Ready
2+
3+
## Current State (Skeleton)
4+
- ✅ Document chunking: Converts experiences.json to RAG documents
5+
- ✅ JSON vector store: Stores documents with metadata
6+
- ✅ Retrieval pipeline: Returns top-K results
7+
-**Synthetic embeddings**: Hash-based, not semantic
8+
-**No LLM integration**: No AI-powered rewriting
9+
-**No vector DB**: JSON-based, not scalable
10+
-**No reranking**: No quality filtering
11+
12+
## Feedback from Code Review
13+
14+
> "It's a good RAG skeleton (index → retrieve → use context), but not a full, production RAG."
15+
16+
### What's Missing for "Real RAG"
17+
18+
1. **Real embeddings** - Swap fake hash/sine vectors for actual ML embeddings
19+
2. **A vector store** - Use FAISS/Chroma/pgvector instead of JSON
20+
3. **LLM generation & planning** - Use LLM to rewrite bullets with evidence constraints
21+
4. **Reranking** (optional) - Cross-encoder to improve top-K quality
22+
23+
---
24+
25+
## Upgrade Tasks
26+
27+
### A) Real Embeddings (sentence-transformers)
28+
29+
**Current**: Hash-based embeddings (deterministic but not semantic)
30+
```python
31+
def _generate_embedding(self, text: str) -> List[float]:
32+
hash_val = sum(ord(c) for c in text)
33+
embedding = []
34+
for i in range(384):
35+
embedding.append(math.sin((hash_val + i) * 0.1) * 0.5 + 0.5)
36+
return embedding
37+
```
38+
39+
**Target**: Real semantic embeddings
40+
```python
41+
from sentence_transformers import SentenceTransformer
42+
43+
class RAGIndexer:
44+
def __init__(self):
45+
self.embedder = SentenceTransformer("all-MiniLM-L6-v2")
46+
47+
def _embed(self, texts: List[str]) -> List[List[float]]:
48+
return self.embedder.encode(texts, normalize_embeddings=True).tolist()
49+
```
50+
51+
**Changes**:
52+
- [ ] Add `sentence-transformers` dependency
53+
- [ ] Update `RAGIndexer._embed()` to use SentenceTransformer
54+
- [ ] Update `Retriever._query_embedding()` to use SentenceTransformer
55+
- [ ] Re-index all documents with real embeddings
56+
- [ ] Update tests to validate semantic similarity
57+
58+
**Impact**: Semantic search instead of keyword matching
59+
60+
---
61+
62+
### B) Vector Database (FAISS)
63+
64+
**Current**: JSON file with linear search
65+
```python
66+
# O(n) search through all documents
67+
for doc in self.documents:
68+
score = self._cosine_similarity(query_embedding, doc.embedding)
69+
```
70+
71+
**Target**: FAISS for efficient similarity search
72+
```python
73+
import faiss
74+
import numpy as np
75+
76+
class Retriever:
77+
def __init__(self, vector_store_path):
78+
self.embedder = SentenceTransformer("all-MiniLM-L6-v2")
79+
self.index = faiss.IndexFlatIP(384) # Inner product for cosine
80+
81+
# Load embeddings and build index
82+
embeddings = np.array([d["embedding"] for d in docs], dtype="float32")
83+
self.index.add(embeddings)
84+
85+
def retrieve(self, query: str, top_k: int = 10):
86+
q_emb = np.array([self.embedder.encode(query)], dtype="float32")
87+
scores, indices = self.index.search(q_emb, top_k)
88+
return [(self.documents[i], float(scores[0][j]))
89+
for j, i in enumerate(indices[0])]
90+
```
91+
92+
**Changes**:
93+
- [ ] Add `faiss-cpu` dependency (or `faiss-gpu` for production)
94+
- [ ] Update `RAGIndexer` to build FAISS index
95+
- [ ] Update `Retriever` to use FAISS for search
96+
- [ ] Save/load FAISS index alongside JSON metadata
97+
- [ ] Update tests for FAISS integration
98+
99+
**Impact**: O(log n) search, scales to millions of documents
100+
101+
---
102+
103+
### C) LLM-Powered Rewriting
104+
105+
**Current**: No LLM integration, just retrieval
106+
```python
107+
# tailor.py just injects retrieved context into prompt
108+
rag_context = retriever.retrieve(requirement)
109+
# Context passed to LLM but no special handling
110+
```
111+
112+
**Target**: Evidence-constrained LLM rewriting
113+
```python
114+
from openai import OpenAI
115+
116+
def rewrite_with_evidence(bullet: str, evidence: str, requirement: str) -> str:
117+
"""Rewrite bullet using retrieved evidence as constraint."""
118+
client = OpenAI()
119+
120+
prompt = f"""Rewrite this resume bullet to match the job requirement.
121+
Use ONLY facts from the EVIDENCE. Do not invent metrics or skills.
122+
123+
REQUIREMENT: {requirement}
124+
ORIGINAL BULLET: {bullet}
125+
EVIDENCE: {evidence}
126+
127+
Rewrite the bullet to:
128+
1. Use active voice and strong verbs
129+
2. Include quantified impact (%, $, X%, improvement)
130+
3. Highlight relevant skills from the requirement
131+
4. Keep it under 150 characters
132+
133+
Return ONLY the rewritten bullet, no explanation."""
134+
135+
response = client.chat.completions.create(
136+
model="gpt-4o-mini",
137+
messages=[{"role": "user", "content": prompt}],
138+
temperature=0.2, # Low temperature for consistency
139+
max_tokens=100
140+
)
141+
142+
return response.choices[0].message.content.strip()
143+
```
144+
145+
**Integration in tailor.py**:
146+
```python
147+
def select_and_rewrite_with_rag(experience, keywords, rag_context=None):
148+
tailored = []
149+
for job in experience:
150+
top_bullets = score_bullets(job["bullets"], keywords)[:3]
151+
152+
rewritten = []
153+
for bullet in top_bullets:
154+
# Get evidence for this bullet
155+
evidence = retrieve_evidence_for_bullet(bullet, rag_context)
156+
157+
# Rewrite with LLM using evidence
158+
improved = rewrite_with_evidence(
159+
bullet["text"],
160+
evidence,
161+
keywords[0] # Primary requirement
162+
)
163+
rewritten.append(improved)
164+
165+
job_data = {**job, "selected_bullets": rewritten}
166+
tailored.append(job_data)
167+
168+
return tailored
169+
```
170+
171+
**Changes**:
172+
- [ ] Create `src/rag/llm_rewriter.py` with `rewrite_with_evidence()`
173+
- [ ] Update `tailor.py` to use LLM rewriter
174+
- [ ] Add OpenAI API key configuration
175+
- [ ] Implement evidence extraction from retrieved context
176+
- [ ] Add error handling for LLM failures
177+
- [ ] Update tests with mock LLM responses
178+
179+
**Impact**: AI-powered bullet rewriting with evidence constraints
180+
181+
---
182+
183+
### D) Reranking (Optional but Recommended)
184+
185+
**Current**: Top-K by similarity score only
186+
```python
187+
# Just return top-K by cosine similarity
188+
top_results = sorted_docs[:top_k]
189+
```
190+
191+
**Target**: Rerank with cross-encoder for better quality
192+
```python
193+
from sentence_transformers import CrossEncoder
194+
195+
class Retriever:
196+
def __init__(self, vector_store_path):
197+
self.embedder = SentenceTransformer("all-MiniLM-L6-v2")
198+
self.reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
199+
200+
def retrieve(self, query: str, top_k: int = 10):
201+
# Step 1: Get top-2K with FAISS
202+
candidates = self._faiss_search(query, top_k=20)
203+
204+
# Step 2: Rerank with cross-encoder
205+
pairs = [[query, doc.content] for doc, _ in candidates]
206+
scores = self.reranker.predict(pairs)
207+
208+
# Step 3: Return top-K reranked results
209+
reranked = sorted(
210+
zip(candidates, scores),
211+
key=lambda x: x[1],
212+
reverse=True
213+
)[:top_k]
214+
215+
return [(doc, float(score)) for (doc, _), score in reranked]
216+
```
217+
218+
**Changes**:
219+
- [ ] Add `sentence-transformers` cross-encoder model
220+
- [ ] Update `Retriever.retrieve()` to use reranking
221+
- [ ] Add reranking configuration (top_k_candidates)
222+
- [ ] Update tests for reranking
223+
224+
**Impact**: Better quality top-K results, more relevant to query
225+
226+
---
227+
228+
## Implementation Order
229+
230+
### Phase 1A: Real Embeddings (2-3 hours)
231+
1. Add `sentence-transformers` dependency
232+
2. Update `RAGIndexer` to use SentenceTransformer
233+
3. Update `Retriever` to use SentenceTransformer
234+
4. Re-index documents
235+
5. Update tests
236+
237+
### Phase 1B: FAISS Integration (2-3 hours)
238+
1. Add `faiss-cpu` dependency
239+
2. Update `RAGIndexer` to build FAISS index
240+
3. Update `Retriever` to use FAISS search
241+
4. Update tests
242+
243+
### Phase 1C: LLM Rewriting (3-4 hours)
244+
1. Create `src/rag/llm_rewriter.py`
245+
2. Update `tailor.py` to use LLM rewriter
246+
3. Add OpenAI configuration
247+
4. Update tests with mock responses
248+
249+
### Phase 1D: Reranking (1-2 hours)
250+
1. Add cross-encoder model
251+
2. Update `Retriever.retrieve()` for reranking
252+
3. Update tests
253+
254+
### Phase 1E: Testing & Documentation (2-3 hours)
255+
1. Update all tests
256+
2. Update documentation
257+
3. Run full test suite
258+
4. Create upgrade guide
259+
260+
---
261+
262+
## Dependencies to Add
263+
264+
```bash
265+
pip install sentence-transformers faiss-cpu
266+
```
267+
268+
Or in `requirements.txt`:
269+
```
270+
sentence-transformers>=2.2.0
271+
faiss-cpu>=1.7.4 # or faiss-gpu for production
272+
```
273+
274+
---
275+
276+
## Configuration Changes
277+
278+
### RAG Config (src/rag/config.py)
279+
```python
280+
RAG_CONFIG = {
281+
# Embedding
282+
'embedding_model': 'all-MiniLM-L6-v2',
283+
'embedding_dim': 384,
284+
285+
# Vector Store
286+
'vector_store_type': 'faiss', # Changed from 'local'
287+
'vector_store_path': 'data/rag/vector_store.faiss',
288+
'metadata_path': 'data/rag/metadata.json',
289+
290+
# Retrieval
291+
'retrieval_top_k': 10,
292+
'retrieval_top_k_candidates': 20, # For reranking
293+
'similarity_threshold': 0.35,
294+
295+
# Reranking
296+
'use_reranking': True,
297+
'reranker_model': 'cross-encoder/ms-marco-MiniLM-L-6-v2',
298+
299+
# LLM Rewriting
300+
'use_llm_rewriting': True,
301+
'llm_model': 'gpt-4o-mini',
302+
'llm_temperature': 0.2,
303+
}
304+
```
305+
306+
---
307+
308+
## Testing Strategy
309+
310+
### Unit Tests
311+
- [ ] Test real embeddings produce semantic similarity
312+
- [ ] Test FAISS index creation and search
313+
- [ ] Test LLM rewriter with mock responses
314+
- [ ] Test reranking improves quality
315+
316+
### Integration Tests
317+
- [ ] End-to-end: Parse JD → Retrieve → Rerank → Rewrite
318+
- [ ] Compare skeleton vs production RAG quality
319+
- [ ] Benchmark performance (speed, accuracy)
320+
321+
### Validation
322+
- [ ] Semantic similarity > 0.7 for relevant documents
323+
- [ ] Reranking improves top-1 accuracy by > 10%
324+
- [ ] LLM rewriting produces valid bullets
325+
- [ ] All 421 existing tests still pass
326+
327+
---
328+
329+
## Success Criteria
330+
331+
- [ ] Real embeddings: Semantic similarity works correctly
332+
- [ ] FAISS: Search is O(log n) and accurate
333+
- [ ] LLM Rewriting: Produces evidence-constrained bullets
334+
- [ ] Reranking: Improves top-K quality by > 10%
335+
- [ ] All tests pass (421 + new tests)
336+
- [ ] Performance: Retrieval < 100ms for 1000 documents
337+
- [ ] Documentation: Updated with production RAG details
338+
339+
---
340+
341+
## Rollback Plan
342+
343+
If any component fails:
344+
1. Keep JSON vector store as fallback
345+
2. Keep hash-based embeddings as fallback
346+
3. Keep regex rewriter as fallback
347+
4. Feature flags to enable/disable each component
348+
349+
---
350+
351+
## Next Steps
352+
353+
1. **Decide**: Do you want to upgrade Phase 1 to production RAG?
354+
2. **Prioritize**: Which components are most important?
355+
- A) Real embeddings (critical for semantic search)
356+
- B) FAISS (critical for scalability)
357+
- C) LLM rewriting (critical for quality)
358+
- D) Reranking (nice-to-have for quality)
359+
3. **Timeline**: How much time do you want to spend?
360+
4. **Resources**: Do you have OpenAI API access for LLM rewriting?
361+
362+
---
363+
364+
## Recommendation
365+
366+
**Implement in this order**:
367+
1. **A + B** (Real embeddings + FAISS) - 4-6 hours
368+
- Enables semantic search and scalability
369+
- Foundation for everything else
370+
2. **C** (LLM rewriting) - 3-4 hours
371+
- Enables AI-powered bullet improvement
372+
- Requires OpenAI API
373+
3. **D** (Reranking) - 1-2 hours
374+
- Optional but recommended for quality
375+
- Low effort, high impact
376+
377+
**Total**: 8-12 hours to production-ready RAG
378+
379+
This would make Phase 1 a **complete, production-ready RAG system** before moving to Phase 2 (LoRA fine-tuning).
380+

0 commit comments

Comments
 (0)