Skip to content

Commit 2a559e8

Browse files
Copilotjongalloway
andcommitted
Simplify Data Ingestion Flow diagram - replace complex sequence with flowchart
Co-authored-by: jongalloway <68539+jongalloway@users.noreply.github.com>
1 parent 32422ea commit 2a559e8

1 file changed

Lines changed: 30 additions & 44 deletions

File tree

Part 3 - Template Exploration/README.md

Lines changed: 30 additions & 44 deletions
Original file line numberDiff line numberDiff line change
@@ -366,60 +366,46 @@ Document ingestion is handled by the `DataIngestor` service working with `IInges
366366

367367
### Data Ingestion Flow Diagram
368368

369-
Here's a detailed view of how PDF documents are processed and stored:
369+
Here's a simplified view of how PDF documents are processed and stored:
370370

371371
```mermaid
372372
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#f4f4f4', 'primaryTextColor': '#000', 'primaryBorderColor': '#333', 'lineColor': '#333', 'secondaryColor': '#e1f5fe', 'tertiaryColor': '#f3e5f5' }}}%%
373-
sequenceDiagram
374-
participant App as Web Application
375-
participant DI as DataIngestor
376-
participant PDF as PDFDirectorySource
377-
participant DocDB as Documents Collection<br/>(Qdrant)
378-
participant ChunkDB as Chunks Collection<br/>(Qdrant)
379-
participant AI as Embedding Generator<br/>(Azure OpenAI)
380-
381-
App->>DI: IngestDataAsync(PDFDirectorySource)
382-
DI->>DocDB: EnsureCollectionExistsAsync()
383-
DI->>ChunkDB: EnsureCollectionExistsAsync()
373+
flowchart TD
374+
Start([Application Starts]) --> Init[DataIngestor.IngestDataAsync]
375+
Init --> Check{Check for<br/>Changes}
384376
385-
DI->>DocDB: GetAsync(sourceId filter)
386-
DocDB-->>DI: existing documents
377+
Check -->|Deleted| Delete[Remove old chunks<br/>and metadata]
378+
Check -->|New/Modified| Process[Process PDF]
379+
Check -->|No Changes| Done
387380
388-
DI->>PDF: GetDeletedDocumentsAsync()
389-
PDF-->>DI: deleted documents list
381+
Delete --> Check
390382
391-
loop For each deleted document
392-
DI->>ChunkDB: GetAsync(documentId filter)
393-
ChunkDB-->>DI: chunks to delete
394-
DI->>ChunkDB: DeleteAsync(chunk keys)
395-
DI->>DocDB: DeleteAsync(document key)
396-
end
383+
Process --> Extract[Extract & Chunk Text<br/>200 char chunks]
384+
Extract --> Store[Store in Qdrant]
385+
Store --> Embed[Auto-generate Embeddings<br/>via Azure OpenAI]
386+
Embed --> Check
397387
398-
DI->>PDF: GetNewOrModifiedDocumentsAsync()
399-
PDF-->>DI: new/modified documents list
388+
Done([Ingestion Complete])
400389
401-
loop For each new/modified document
402-
DI->>ChunkDB: GetAsync(documentId filter)
403-
ChunkDB-->>DI: old chunks
404-
DI->>ChunkDB: DeleteAsync(old chunk keys)
405-
406-
DI->>DocDB: UpsertAsync(document metadata)
407-
408-
DI->>PDF: CreateChunksForDocumentAsync()
409-
Note over PDF: 1. Open PDF file<br/>2. Extract text from pages<br/>3. Split into paragraphs<br/>4. Chunk text (200 chars)
410-
PDF-->>DI: IngestedChunk objects
411-
412-
DI->>ChunkDB: UpsertAsync(chunks)
413-
Note over ChunkDB,AI: Automatic embedding generation
414-
ChunkDB->>AI: Generate embeddings for chunk.Text
415-
AI-->>ChunkDB: Vector embeddings (1536 dims)
416-
ChunkDB-->>DI: Chunks stored with vectors
417-
end
418-
419-
DI-->>App: Ingestion complete
390+
style Start fill:#e8f5e8
391+
style Init fill:#e1f5fe
392+
style Check fill:#fff4e6
393+
style Process fill:#f9d5e5
394+
style Extract fill:#f9d5e5
395+
style Store fill:#e1f5fe
396+
style Embed fill:#d5e8d4
397+
style Done fill:#e8f5e8
420398
```
421399

422-
This diagram shows the complete ingestion pipeline from PDF files to searchable vector embeddings, including change detection, cleanup, and automatic embedding generation.
400+
This flowchart shows the main ingestion process: checking for document changes, processing new/modified PDFs by extracting and chunking text, storing in Qdrant, and automatically generating embeddings via Azure OpenAI.
401+
402+
**Key steps:**
403+
404+
1. **Check for Changes**: Compare current PDFs with previously ingested documents
405+
2. **Process PDF**: For new/modified files, extract text and split into 200-character chunks
406+
3. **Store in Qdrant**: Save chunks in the vector database
407+
4. **Auto-generate Embeddings**: Azure OpenAI converts text to 1536-dimensional vectors
408+
5. **Loop**: Process continues until all changes are handled
423409

424410
### How Ingestion Works
425411

0 commit comments

Comments
 (0)