Skip to content

Latest commit

 

History

History
146 lines (105 loc) · 7.38 KB

File metadata and controls

146 lines (105 loc) · 7.38 KB

Microsoft.Extensions.DataIngestion

.NET developers need to efficiently process, chunk, and retrieve information from diverse document formats while preserving semantic meaning and structural context. The Microsoft.Extensions.DataIngestion libraries provide a unified approach for representing document ingestion components.

The packages

The Microsoft.Extensions.DataIngestion.Abstractions package provides the core exchange types, including IngestionDocument, IngestionChunker<T>, IngestionChunkProcessor<T>, and IngestionChunkWriter<T>. Any .NET library that provides document processing capabilities can implement these abstractions to enable seamless integration with consuming code.

The Microsoft.Extensions.DataIngestion package has an implicit dependency on the Microsoft.Extensions.DataIngestion.Abstractions package. This package enables you to easily integrate components such as enrichment processors, vector storage writers, and telemetry into your applications using familiar dependency injection and pipeline patterns. For example, it provides the SentimentEnricher, KeywordEnricher, and SummaryEnricher processors that can be chained together in ingestion pipelines.

Which package to reference

Libraries that provide implementations of the abstractions typically reference only Microsoft.Extensions.DataIngestion.Abstractions.

To also have access to higher-level utilities for working with document ingestion components, reference the Microsoft.Extensions.DataIngestion package instead (which itself references Microsoft.Extensions.DataIngestion.Abstractions). Most consuming applications and services should reference the Microsoft.Extensions.DataIngestion package along with one or more libraries that provide concrete implementations of the abstractions, such as Microsoft.Extensions.DataIngestion.MarkItDown or Microsoft.Extensions.DataIngestion.Markdig.

Install the package

From the command-line:

dotnet add package Microsoft.Extensions.DataIngestion --prerelease

Or directly in the C# project file:

<ItemGroup>
  <PackageReference Include="Microsoft.Extensions.DataIngestion" Version="[CURRENTVERSION]" />
</ItemGroup>

Writing chunks to a vector store

Basic usage

The simplest way to store ingestion chunks in a vector store is to use the GetIngestionRecordCollection extension method to create a collection, and then pass it to a VectorStoreWriter:

VectorStoreCollection<Guid, IngestionChunkVectorRecord<string>> collection =
    vectorStore.GetIngestionRecordCollection("chunks", dimensionCount: 1536);

using VectorStoreWriter<string, IngestionChunkVectorRecord<string>> writer = new(collection);

await writer.WriteAsync(chunks);

Custom metadata

To store custom metadata alongside each chunk, create a type derived from IngestionChunkVectorRecord<TChunk> with additional properties, and a VectorStoreWriter subclass that overrides SetMetadata:

public class ChunkWithMetadata : IngestionChunkVectorRecord<string>
{
    [VectorStoreVector(1536)]
    public override string? Embedding => Content;

    [VectorStoreData(StorageName = "classification")]
    public string? Classification { get; set; }
}

public class MetadataWriter : VectorStoreWriter<string, ChunkWithMetadata>
{
    public MetadataWriter(VectorStoreCollection<Guid, ChunkWithMetadata> collection)
        : base(collection) { }

    protected override void SetMetadata(ChunkWithMetadata record, string key, object? value)
    {
        switch (key)
        {
            case nameof(ChunkWithMetadata.Classification):
                record.Classification = value as string;
                break;
            default:
                throw new UnreachableException($"Unknown metadata key: {key}");
        }
    }
}

Custom collection schema

To map to a pre-existing collection that uses different storage names, create a VectorStoreCollectionDefinition manually:

VectorStoreCollectionDefinition definition = new()
{
    Properties =
    {
        new VectorStoreKeyProperty(nameof(IngestionChunkVectorRecord<string>.Key), typeof(Guid))
            { StorageName = "my_key" },
        new VectorStoreVectorProperty(nameof(IngestionChunkVectorRecord<string>.Embedding), typeof(string), 1536)
            { StorageName = "my_embedding" },
        new VectorStoreDataProperty(nameof(IngestionChunkVectorRecord<string>.Content), typeof(string))
            { StorageName = "my_content" },
        new VectorStoreDataProperty(nameof(IngestionChunkVectorRecord<string>.Context), typeof(string))
            { StorageName = "my_context" },
        new VectorStoreDataProperty(nameof(IngestionChunkVectorRecord<string>.DocumentId), typeof(string))
            { StorageName = "my_documentid", IsIndexed = true },
    },
};

VectorStoreCollection<Guid, IngestionChunkVectorRecord<string>> collection =
    vectorStore.GetCollection<Guid, IngestionChunkVectorRecord<string>>("chunks", definition);
using VectorStoreWriter<string, IngestionChunkVectorRecord<string>> writer = new(collection);

Using the ingestion pipeline

The IngestionPipeline<T> orchestrates document reading, chunking, optional processing, and writing. It can accept documents directly or read them from the file system using an IngestionDocumentReader.

Processing documents from the file system

Create a pipeline, then call ProcessAsync with an IngestionDocumentReader and a directory or list of files:

IngestionDocumentReader reader = new MarkdownReader();

using IngestionPipeline<string> pipeline = new(CreateChunker(), CreateWriter());

await foreach (IngestionResult result in pipeline.ProcessAsync(reader, new DirectoryInfo("docs"), "*.md"))
{
    Console.WriteLine($"Processed '{result.DocumentId}'. Succeeded: {result.Succeeded}");
}

Processing documents without a reader

You can also supply IngestionDocument instances directly, without any file-system dependency:

using IngestionPipeline<string> pipeline = new(CreateChunker(), CreateWriter());

IngestionDocument document = new("my-doc-id");
document.Sections.Add(new IngestionDocumentSection());
document.Sections[0].Elements.Add(new IngestionDocumentHeader("# Hello"));
document.Sections[0].Elements.Add(new IngestionDocumentParagraph("This content was created in memory."));

await foreach (IngestionResult result in pipeline.ProcessAsync(new[] { document }.ToAsyncEnumerable()))
{
    Console.WriteLine($"Processed '{result.DocumentId}'. Succeeded: {result.Succeeded}");
}

Feedback & Contributing

We welcome feedback and contributions in our GitHub repo.