You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: Part 3 - Template Exploration/README.md
+96-99Lines changed: 96 additions & 99 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,7 +8,7 @@ In this workshop, you'll explore the code structure of the AI Web Chat template.
8
8
9
9
## Services in .NET Aspire AppHost Program.cs
10
10
11
-
Let's start by examining the `Program.cs` file in the `GenAiLab.AppHost` project:
11
+
Let's start by examining the [`AppHost.cs`](../Part%202%20-%20Project%20Creation/GenAiLab/GenAiLab.AppHost/AppHost.cs) file in the `GenAiLab.AppHost` project:
The `IChatClient` is used in the `Chat.razor` component to handle user messages and generate AI responses:
126
+
The `IChatClient` is used in the [`Chat.razor`](../Part%202%20-%20Project%20Creation/GenAiLab/GenAiLab.Web/Components/Pages/Chat/Chat.razor#L58-L84) component to handle user messages and generate AI responses:
Let's examine how the `DataIngestor.cs` uses vector collections directly:
185
+
Let's examine how the [`DataIngestor.cs`](../Part%202%20-%20Project%20Creation/GenAiLab/GenAiLab.Web/Services/Ingestion/DataIngestor.cs#L18-L57) uses vector collections directly:
@@ -233,130 +247,113 @@ The template uses several vector collection methods:
233
247
-`DeleteAsync()`: Remove documents and their associated chunks
234
248
-`EnsureCollectionExistsAsync()`: Create collections if they don't exist
235
249
236
-
### SemanticSearchRecord for Vector Storage
250
+
### IngestedChunk for Vector Storage
237
251
238
-
The `SemanticSearchRecord.cs` file shows how data is structured for vector storage:
252
+
The [`IngestedChunk.cs`](../Part%202%20-%20Project%20Creation/GenAiLab/GenAiLab.Web/Services/IngestedChunk.cs) file shows how data is structured for vector storage:
239
253
240
254
```csharp
241
255
namespaceGenAiLab.Web.Services;
242
256
243
-
publicclassSemanticSearchRecord
257
+
publicclassIngestedChunk
244
258
{
245
-
[VectorStoreRecordKey]
259
+
privateconstintVectorDimensions=1536; // 1536 is the default vector size for the OpenAI text-embedding-3-small model
This class represents the data stored in the vector database with specific attributes for vector storage:
263
280
264
-
-`Key`: The unique identifier for the record, marked with `[VectorStoreRecordKey]`
265
-
-`FileName`: The source document's name, marked as filterable with `[VectorStoreRecordData(IsFilterable = true)]`
281
+
-`Key`: The unique identifier for the record, marked with `[VectorStoreKey]`
282
+
-`DocumentId`: The source document's identifier, marked as indexed with `[VectorStoreData(IsIndexed = true)]`
266
283
-`PageNumber`: The page number in the source document
267
284
-`Text`: A chunk of text from the document
268
-
-`Vector`: The embedding vector configured for the OpenAI text-embedding-3-small model's 1536 dimensions using cosine similarity
285
+
-`Vector`: The embedding vector configured for the OpenAI text-embedding-3-small model's 1536 dimensions using cosine similarity. The property returns the Text, which will be automatically embedded when stored.
269
286
270
-
The `SemanticSearch.cs` file shows how these records are queried:
287
+
The [`SemanticSearch.cs`](../Part%202%20-%20Project%20Creation/GenAiLab/GenAiLab.Web/Services/SemanticSearch.cs) file shows how these records are queried:
1.**Automatic Embedding**: The text parameter is automatically converted to an embedding vector
308
+
2.**Vector Similarity**: Finds the most similar chunks using the embedding vector
309
+
3.**Optional Filtering**: Can filter results by document ID if specified
310
+
4.**Direct Results**: Returns the actual `IngestedChunk` records with their text content
311
+
317
312
## Document Ingestion and Embeddings with Vector Collections
318
313
319
-
Let's examine how embeddings are generated during document ingestion using the new vector collection approach. The `PDFDirectorySource` creates chunks and the `DataIngestor` processes them:
314
+
Document ingestion is handled by the `DataIngestor` service working with `IIngestionSource` implementations. The `PDFDirectorySource` processes PDF files and creates chunks that are stored directly in vector collections.
315
+
316
+
### How Ingestion Works
317
+
318
+
When the application starts, it processes documents from the specified source:
1. Documents are retrieved from a source (like PDFs in the wwwroot/Data directory)
353
-
1. Each document is split into smaller chunks for better search precision
354
-
1. For each chunk, an `IngestedChunk` record is created with the text content
355
-
1. The embedding vectors are generated automatically when the chunks are stored in the vector collection
356
-
1. Both document metadata and chunks are stored directly in vector collections
357
-
1. During search, query text is converted to an embedding, and vector similarity finds relevant chunks
351
+
When an `IngestedChunk` is stored via `chunksCollection.UpsertAsync()`, the vector collection automatically:
352
+
1. Takes the `Text` property value (returned by the `Vector` property)
353
+
2. Generates an embedding using the configured embedding generator
354
+
3. Stores both the text and its embedding vector
358
355
359
-
This approach eliminates the need for a separate database to track ingestion state, as the vector collections handle both storage and retrieval of document chunks and their metadata.
356
+
This approach eliminates the need for manual embedding generation and ensures consistency across all document chunks.
0 commit comments