.NET developers need to efficiently process, chunk, and retrieve information from diverse document formats while preserving semantic meaning and structural context. The Microsoft.Extensions.DataIngestion libraries provide a unified approach for representing document ingestion components.
The Microsoft.Extensions.DataIngestion.Abstractions package provides the core exchange types, including IngestionDocument, IngestionChunker<T>, IngestionChunkProcessor<T>, and IngestionChunkWriter<T>. Any .NET library that provides document processing capabilities can implement these abstractions to enable seamless integration with consuming code.
The Microsoft.Extensions.DataIngestion package has an implicit dependency on the Microsoft.Extensions.DataIngestion.Abstractions package. This package enables you to easily integrate components such as enrichment processors, vector storage writers, and telemetry into your applications using familiar dependency injection and pipeline patterns. For example, it provides the SentimentEnricher, KeywordEnricher, and SummaryEnricher processors that can be chained together in ingestion pipelines.
Libraries that provide implementations of the abstractions typically reference only Microsoft.Extensions.DataIngestion.Abstractions.
To also have access to higher-level utilities for working with document ingestion components, reference the Microsoft.Extensions.DataIngestion package instead (which itself references Microsoft.Extensions.DataIngestion.Abstractions). Most consuming applications and services should reference the Microsoft.Extensions.DataIngestion package along with one or more libraries that provide concrete implementations of the abstractions, such as Microsoft.Extensions.DataIngestion.MarkItDown or Microsoft.Extensions.DataIngestion.Markdig.
From the command-line:
dotnet add package Microsoft.Extensions.DataIngestion --prereleaseOr directly in the C# project file:
<ItemGroup>
<PackageReference Include="Microsoft.Extensions.DataIngestion" Version="[CURRENTVERSION]" />
</ItemGroup>The simplest way to store ingestion chunks in a vector store is to use the GetIngestionRecordCollection extension method to create a collection, and then pass it to a VectorStoreWriter:
VectorStoreCollection<Guid, IngestionChunkVectorRecord<string>> collection =
vectorStore.GetIngestionRecordCollection("chunks", dimensionCount: 1536);
using VectorStoreWriter<string, IngestionChunkVectorRecord<string>> writer = new(collection);
await writer.WriteAsync(chunks);To store custom metadata alongside each chunk, create a type derived from IngestionChunkVectorRecord<TChunk> with additional properties, and a VectorStoreWriter subclass that overrides SetMetadata:
public class ChunkWithMetadata : IngestionChunkVectorRecord<string>
{
[VectorStoreVector(1536)]
public override string? Embedding => Content;
[VectorStoreData(StorageName = "classification")]
public string? Classification { get; set; }
}
public class MetadataWriter : VectorStoreWriter<string, ChunkWithMetadata>
{
public MetadataWriter(VectorStoreCollection<Guid, ChunkWithMetadata> collection)
: base(collection) { }
protected override void SetMetadata(ChunkWithMetadata record, string key, object? value)
{
switch (key)
{
case nameof(ChunkWithMetadata.Classification):
record.Classification = value as string;
break;
default:
throw new UnreachableException($"Unknown metadata key: {key}");
}
}
}To map to a pre-existing collection that uses different storage names, create a VectorStoreCollectionDefinition manually:
VectorStoreCollectionDefinition definition = new()
{
Properties =
{
new VectorStoreKeyProperty(nameof(IngestionChunkVectorRecord<string>.Key), typeof(Guid))
{ StorageName = "my_key" },
new VectorStoreVectorProperty(nameof(IngestionChunkVectorRecord<string>.Embedding), typeof(string), 1536)
{ StorageName = "my_embedding" },
new VectorStoreDataProperty(nameof(IngestionChunkVectorRecord<string>.Content), typeof(string))
{ StorageName = "my_content" },
new VectorStoreDataProperty(nameof(IngestionChunkVectorRecord<string>.Context), typeof(string))
{ StorageName = "my_context" },
new VectorStoreDataProperty(nameof(IngestionChunkVectorRecord<string>.DocumentId), typeof(string))
{ StorageName = "my_documentid", IsIndexed = true },
},
};
VectorStoreCollection<Guid, IngestionChunkVectorRecord<string>> collection =
vectorStore.GetCollection<Guid, IngestionChunkVectorRecord<string>>("chunks", definition);
using VectorStoreWriter<string, IngestionChunkVectorRecord<string>> writer = new(collection);The IngestionPipeline<T> orchestrates document reading, chunking, optional processing, and writing. It can accept documents directly or read them from the file system using an IngestionDocumentReader.
Create a pipeline, then call ProcessAsync with an IngestionDocumentReader and a directory or list of files:
IngestionDocumentReader reader = new MarkdownReader();
using IngestionPipeline<string> pipeline = new(CreateChunker(), CreateWriter());
await foreach (IngestionResult result in pipeline.ProcessAsync(reader, new DirectoryInfo("docs"), "*.md"))
{
Console.WriteLine($"Processed '{result.DocumentId}'. Succeeded: {result.Succeeded}");
}You can also supply IngestionDocument instances directly, without any file-system dependency:
using IngestionPipeline<string> pipeline = new(CreateChunker(), CreateWriter());
IngestionDocument document = new("my-doc-id");
document.Sections.Add(new IngestionDocumentSection());
document.Sections[0].Elements.Add(new IngestionDocumentHeader("# Hello"));
document.Sections[0].Elements.Add(new IngestionDocumentParagraph("This content was created in memory."));
await foreach (IngestionResult result in pipeline.ProcessAsync(new[] { document }.ToAsyncEnumerable()))
{
Console.WriteLine($"Processed '{result.DocumentId}'. Succeeded: {result.Succeeded}");
}We welcome feedback and contributions in our GitHub repo.