Skip to content

Commit 6729773

Browse files
authored
Merge pull request #168 from dotnet-presentations/copilot/fix-97d90932-35f7-4d7e-bfb2-423213ddb90b
Add comprehensive diagrams to Part 3 explaining data ingestion flow and architecture
2 parents c127bad + 09ef405 commit 6729773

1 file changed

Lines changed: 173 additions & 0 deletions

File tree

Part 3 - Template Exploration/README.md

Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,57 @@
66

77
In this workshop, you'll explore the code structure of the AI Web Chat template. You'll learn about the different services configured in the .NET Aspire AppHost, understand the application configuration in the Web project, explore how `IChatClient` is configured and used, and dive into Microsoft Extensions for Vector Data.
88

9+
## Architecture Overview
10+
11+
Before diving into the code, let's visualize how the different components work together:
12+
13+
```mermaid
14+
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#f4f4f4', 'primaryTextColor': '#000', 'primaryBorderColor': '#333', 'lineColor': '#333', 'secondaryColor': '#e1f5fe', 'tertiaryColor': '#f3e5f5' }}}%%
15+
graph TB
16+
subgraph AppHost[".NET Aspire AppHost"]
17+
AH[AppHost Program.cs]
18+
end
19+
20+
subgraph Web["GenAiLab.Web Application"]
21+
WP[Web Program.cs]
22+
DI[DataIngestor]
23+
SS[SemanticSearch]
24+
CHAT[Chat.razor]
25+
end
26+
27+
subgraph External["External Services"]
28+
OAI[Azure OpenAI<br/>Chat + Embeddings]
29+
QD[(Qdrant Vector DB<br/>Chunks & Documents)]
30+
end
31+
32+
subgraph Data["Data Sources"]
33+
PDF[PDF Files<br/>wwwroot/Data]
34+
end
35+
36+
AH -->|orchestrates| Web
37+
AH -->|configures| OAI
38+
AH -->|provisions| QD
39+
40+
WP -->|registers services| DI
41+
WP -->|registers services| SS
42+
WP -->|ingests at startup| PDF
43+
44+
DI -->|processes PDFs| PDF
45+
DI -->|stores chunks| QD
46+
DI -->|generates embeddings via| OAI
47+
48+
CHAT -->|queries| SS
49+
CHAT -->|sends messages to| OAI
50+
SS -->|searches| QD
51+
52+
style AppHost fill:#e8f5e8
53+
style Web fill:#e1f5fe
54+
style External fill:#fff4e6
55+
style Data fill:#f9d5e5
56+
```
57+
58+
This diagram shows how .NET Aspire orchestrates the web application and its dependencies, with the web app coordinating data ingestion and semantic search using Azure OpenAI and Qdrant.
59+
960
## Services in .NET Aspire AppHost Program.cs
1061

1162
Let's start by examining the [`AppHost.cs`](../Part%202%20-%20Project%20Creation/GenAiLab/GenAiLab.AppHost/AppHost.cs) file in the `GenAiLab.AppHost` project:
@@ -313,6 +364,49 @@ Key features of semantic search:
313364

314365
Document ingestion is handled by the `DataIngestor` service working with `IIngestionSource` implementations. The `PDFDirectorySource` processes PDF files and creates chunks that are stored directly in vector collections.
315366

367+
### Data Ingestion Flow Diagram
368+
369+
Here's a simplified view of how PDF documents are processed and stored:
370+
371+
```mermaid
372+
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#f4f4f4', 'primaryTextColor': '#000', 'primaryBorderColor': '#333', 'lineColor': '#333', 'secondaryColor': '#e1f5fe', 'tertiaryColor': '#f3e5f5' }}}%%
373+
flowchart TD
374+
Start([Application Starts]) --> Init[DataIngestor.IngestDataAsync]
375+
Init --> Check{Check for<br/>Changes}
376+
377+
Check -->|Deleted| Delete[Remove old chunks<br/>and metadata]
378+
Check -->|New/Modified| Process[Process PDF]
379+
Check -->|No Changes| Done
380+
381+
Delete --> Check
382+
383+
Process --> Extract[Extract & Chunk Text<br/>200 char chunks]
384+
Extract --> Store[Store in Qdrant]
385+
Store --> Embed[Auto-generate Embeddings<br/>via Azure OpenAI]
386+
Embed --> Check
387+
388+
Done([Ingestion Complete])
389+
390+
style Start fill:#e8f5e8
391+
style Init fill:#e1f5fe
392+
style Check fill:#fff4e6
393+
style Process fill:#f9d5e5
394+
style Extract fill:#f9d5e5
395+
style Store fill:#e1f5fe
396+
style Embed fill:#d5e8d4
397+
style Done fill:#e8f5e8
398+
```
399+
400+
This flowchart shows the main ingestion process: checking for document changes, processing new/modified PDFs by extracting and chunking text, storing in Qdrant, and automatically generating embeddings via Azure OpenAI.
401+
402+
**Key steps:**
403+
404+
1. **Check for Changes**: Compare current PDFs with previously ingested documents
405+
2. **Process PDF**: For new/modified files, extract text and split into 200-character chunks
406+
3. **Store in Qdrant**: Save chunks in the vector database
407+
4. **Auto-generate Embeddings**: Azure OpenAI converts text to 1536-dimensional vectors
408+
5. **Loop**: Process continues until all changes are handled
409+
316410
### How Ingestion Works
317411

318412
When the application starts, it processes documents from the specified source:
@@ -349,12 +443,91 @@ public class IngestedChunk
349443
```
350444

351445
When an `IngestedChunk` is stored via `chunksCollection.UpsertAsync()`, the vector collection automatically:
446+
352447
1. Takes the `Text` property value (returned by the `Vector` property)
353448
2. Generates an embedding using the configured embedding generator
354449
3. Stores both the text and its embedding vector
355450

356451
This approach eliminates the need for manual embedding generation and ensures consistency across all document chunks.
357452

453+
#### Vector Storage Architecture
454+
455+
Here's how the automatic vector generation works when storing chunks:
456+
457+
```mermaid
458+
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#f4f4f4', 'primaryTextColor': '#000', 'primaryBorderColor': '#333', 'lineColor': '#333', 'secondaryColor': '#e1f5fe', 'tertiaryColor': '#f3e5f5' }}}%%
459+
flowchart TB
460+
Chunk["IngestedChunk Object<br/>---<br/>Key: Guid<br/>DocumentId: string<br/>PageNumber: int<br/>Text: 'Product features...'<br/><br/>VectorStoreVector attribute<br/>Vector property → returns Text"]
461+
462+
Chunk -->|Call UpsertAsync| VectorCollection[Vector Collection Framework]
463+
464+
VectorCollection -->|1. Detect VectorStoreVector attribute| Detect{Attribute<br/>Found?}
465+
466+
Detect -->|Yes| Extract[2. Get value from Vector property<br/>Result: 'Product features...']
467+
468+
Extract --> Generate[3. Call Azure OpenAI<br/>Embedding Generator<br/>text-embedding-3-small model]
469+
470+
Generate --> Embed[4. Generate 1536-dimensional<br/>vector embedding]
471+
472+
Embed --> Store[5. Store in Qdrant]
473+
474+
Store --> Result["Stored Record<br/>---<br/>Metadata: Key, DocumentId, Text<br/>Vector: float array 1536 dims<br/>Distance: Cosine Similarity"]
475+
476+
style Chunk fill:#e1f5fe
477+
style VectorCollection fill:#fff4e6
478+
style Detect fill:#fff4e6
479+
style Extract fill:#f9d5e5
480+
style Generate fill:#d5e8d4
481+
style Embed fill:#d5e8d4
482+
style Store fill:#e1f5fe
483+
style Result fill:#e8f5e8
484+
```
485+
486+
**Key Concept**: The `[VectorStoreVector]` attribute on the `Vector` property enables automatic embedding generation:
487+
488+
1. **Attribute Detection**: Framework detects properties marked with `[VectorStoreVector]`
489+
2. **Text Extraction**: Gets the text value from the Vector property
490+
3. **Embedding Generation**: Sends text to Azure OpenAI's text-embedding-3-small model
491+
4. **Vector Creation**: Converts text into a 1536-dimensional vector
492+
5. **Storage**: Stores both the original text metadata and the generated vector using cosine similarity for distance calculations
493+
494+
This automatic process eliminates manual embedding generation and ensures consistency.
495+
496+
### Semantic Search Flow
497+
498+
Once documents are ingested, the `SemanticSearch` service enables finding relevant content:
499+
500+
```mermaid
501+
%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#f4f4f4', 'primaryTextColor': '#000', 'primaryBorderColor': '#333', 'lineColor': '#333', 'secondaryColor': '#e1f5fe', 'tertiaryColor': '#f3e5f5' }}}%%
502+
sequenceDiagram
503+
participant User as User/Chat
504+
participant SS as SemanticSearch
505+
participant VDB as Chunks Collection<br/>(Qdrant)
506+
participant AI as Embedding Generator<br/>(Azure OpenAI)
507+
508+
User->>SS: SearchAsync("What is X?", filter, maxResults)
509+
510+
SS->>AI: Generate embedding for query text
511+
Note over AI: Converts "What is X?"<br/>to 1536-dim vector
512+
AI-->>SS: Query vector
513+
514+
SS->>VDB: SearchAsync(query vector, options)
515+
Note over VDB: 1. Compare query vector<br/>with stored vectors<br/>2. Calculate cosine similarity<br/>3. Rank by similarity<br/>4. Apply filters (if any)<br/>5. Return top N results
516+
517+
VDB-->>SS: List of IngestedChunk records<br/>(most similar first)
518+
519+
SS-->>User: Relevant text chunks with context
520+
Note over User: Chunks used for<br/>RAG (Retrieval Augmented<br/>Generation) in chat
521+
```
522+
523+
The semantic search process:
524+
525+
1. **Query Embedding**: User's search text is automatically converted to a vector
526+
2. **Vector Similarity Search**: Qdrant compares the query vector with all stored chunk vectors using cosine similarity
527+
3. **Ranking**: Results are ranked by similarity score (closest matches first)
528+
4. **Filtering**: Optional DocumentId filter can restrict results to specific documents
529+
5. **Results**: Returns the most relevant text chunks that can be used for RAG in the chat interface
530+
358531
## What You've Learned
359532

360533
- How services are configured and orchestrated in .NET Aspire

0 commit comments

Comments
 (0)