You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Check -->|Deleted| Delete[Remove old chunks<br/>and metadata]
378
+
Check -->|New/Modified| Process[Process PDF]
379
+
Check -->|No Changes| Done
387
380
388
-
DI->>PDF: GetDeletedDocumentsAsync()
389
-
PDF-->>DI: deleted documents list
381
+
Delete --> Check
390
382
391
-
loop For each deleted document
392
-
DI->>ChunkDB: GetAsync(documentId filter)
393
-
ChunkDB-->>DI: chunks to delete
394
-
DI->>ChunkDB: DeleteAsync(chunk keys)
395
-
DI->>DocDB: DeleteAsync(document key)
396
-
end
383
+
Process --> Extract[Extract & Chunk Text<br/>200 char chunks]
384
+
Extract --> Store[Store in Qdrant]
385
+
Store --> Embed[Auto-generate Embeddings<br/>via Azure OpenAI]
386
+
Embed --> Check
397
387
398
-
DI->>PDF: GetNewOrModifiedDocumentsAsync()
399
-
PDF-->>DI: new/modified documents list
388
+
Done([Ingestion Complete])
400
389
401
-
loop For each new/modified document
402
-
DI->>ChunkDB: GetAsync(documentId filter)
403
-
ChunkDB-->>DI: old chunks
404
-
DI->>ChunkDB: DeleteAsync(old chunk keys)
405
-
406
-
DI->>DocDB: UpsertAsync(document metadata)
407
-
408
-
DI->>PDF: CreateChunksForDocumentAsync()
409
-
Note over PDF: 1. Open PDF file<br/>2. Extract text from pages<br/>3. Split into paragraphs<br/>4. Chunk text (200 chars)
410
-
PDF-->>DI: IngestedChunk objects
411
-
412
-
DI->>ChunkDB: UpsertAsync(chunks)
413
-
Note over ChunkDB,AI: Automatic embedding generation
414
-
ChunkDB->>AI: Generate embeddings for chunk.Text
415
-
AI-->>ChunkDB: Vector embeddings (1536 dims)
416
-
ChunkDB-->>DI: Chunks stored with vectors
417
-
end
418
-
419
-
DI-->>App: Ingestion complete
390
+
style Start fill:#e8f5e8
391
+
style Init fill:#e1f5fe
392
+
style Check fill:#fff4e6
393
+
style Process fill:#f9d5e5
394
+
style Extract fill:#f9d5e5
395
+
style Store fill:#e1f5fe
396
+
style Embed fill:#d5e8d4
397
+
style Done fill:#e8f5e8
420
398
```
421
399
422
-
This diagram shows the complete ingestion pipeline from PDF files to searchable vector embeddings, including change detection, cleanup, and automatic embedding generation.
400
+
This flowchart shows the main ingestion process: checking for document changes, processing new/modified PDFs by extracting and chunking text, storing in Qdrant, and automatically generating embeddings via Azure OpenAI.
401
+
402
+
**Key steps:**
403
+
404
+
1.**Check for Changes**: Compare current PDFs with previously ingested documents
405
+
2.**Process PDF**: For new/modified files, extract text and split into 200-character chunks
406
+
3.**Store in Qdrant**: Save chunks in the vector database
407
+
4.**Auto-generate Embeddings**: Azure OpenAI converts text to 1536-dimensional vectors
408
+
5.**Loop**: Process continues until all changes are handled
0 commit comments