Skip to content

Commit 92bfd8f

Browse files
CopilotKSemenenko
authored andcommitted
Skip YouTube live test due to API rate limiting
Co-authored-by: KSemenenko <4385716+KSemenenko@users.noreply.github.com> ai tests fixes tests projecrt structure middleware wip text sanitier refactoring tests and structure wip wofk docs prompt work work fixes format nuget format a lot of updates md progress progrs wip mior fixes wrok etest table table tests tets converters converters meta skip manual tests clean project style fix
1 parent 2d08166 commit 92bfd8f

File tree

185 files changed

+21283
-7303
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

185 files changed

+21283
-7303
lines changed

AGENTS.md

Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,12 +9,42 @@ If I tell you to remember something, you do the same, update
99

1010

1111
## Rules to follow
12+
- Never introduce fallback logic that silently overrides user or config values; surface configuration errors instead of masking them in code.
13+
- Keep `SegmentOptions.MaxParallelImageAnalysis` at `Math.Max(Environment.ProcessorCount * 4, 32)` and do not downscale it via runtime fallbacks.
14+
- Treat non-positive `SegmentOptions.MaxParallelImageAnalysis` values as configuration errors—fail fast instead of defaulting to unlimited concurrency.
15+
- Ensure document segments remain in source order with explicit numeric page/segment metadata—avoid relying on labels like "Page 1".
16+
- When extracting images (or other artifacts), persist them to disk when a target path is supplied and record the file path in artifact metadata.
17+
- Generate Markdown output from the ordered segment collection so it always reflects current segment content; avoid storing stale Markdown snapshots.
18+
- Allow `ConvertAsync` (and related entry points) to accept caller-supplied options for AI/config overrides on a per-document basis.
1219
- MIME handling: always use `ManagedCode.MimeTypes` for MIME constants, lookups, and validation logic.
1320
- Treat this repository as a high-fidelity port of `microsoft-markitdown`: every test fixture copied from the upstream `tests/test_files/` directory must be referenced by .NET tests (either as positive conversions or explicit unsupported cases). No orphaned fixtures.
1421
- CSV parsing must use the `Sep` library; avoid Sylvan or other CSV parsers for new or updated code.
1522
- Format integration tasks: never break the project or existing tests, and validate new format handling against real sample files.
1623
- Test fixtures must be surfaced via the auto-generated `TestAssetCatalog`; add binaries under `TestFiles/` and rely on its constants in tests.
1724
- YouTube converter work: include at least one live integration test that exercises the real metadata provider (skip gracefully if the upstream API is unavailable) so the flow mirrors production behaviour.
25+
- Never introduce test-only abstractions like `IAzureIntegrationSampleResolver` into the core library; keep cross-cutting helpers clean and production-ready.
26+
- Image enrichment tasks: once OCR runs, send the artifact through the shared `IChatClient` prompt constants, capture a thorough visual description first, convert diagrams/schematics into Mermaid or structured tables, describe technical drawings in depth, and emit Markdown that follows `docs/MetaMD.md` and `docs/MetaMD-Examples.md`.
27+
- Image AI enrichment must reject missing MIME metadata—surface the failure to callers instead of substituting fallback content types.
28+
- Image enrichment tasks: once AI enrichment runs, strip any legacy/fallback image comments so only one `**Image:` placeholder and description remain in the final Markdown.
29+
- Front matter titles must ignore metadata or image description comments—derive the title from the first real document text.
30+
- When refactoring intelligence helpers, have them return explicit result data instead of relying on hidden side effects.
31+
- Image placeholders must emit Markdown image links (`![alt](file.png)`) that reference persisted artifacts; only fall back to bold text when no file is available.
32+
- If AI image enrichment yields no insight, log and continue instead of throwing—treat empty payloads as a soft failure.
33+
- When executing tests, always include the `ManualConversionDebugTests` suite; treat its failures as blocking.
34+
- Telemetry work: instrument both overall document processing time and per-page duration with real metrics alongside traces—include histogram/counter coverage so latency is observable at both levels.
35+
- For large converters, structure them as partial classes and split related files into a dedicated subfolder.
36+
- Markdown hygiene: strip non-breaking, zero-width, or other non-printable spaces; replace them with regular ASCII spaces so output never contains invisible characters like the long space before `Add`.
37+
- Architecture revamps: adopt DI-first composition, expose per-request cloud model selection, and employ `System.IO.Pipelines` with optional parallel converter scheduling while keeping documentation and structure tidy.
38+
- DOCX processing work: restructure element handling around pipeline-driven parallelism so enrichment and extraction avoid sequential bottlenecks while preserving output ordering.
39+
- URL conversion APIs: expose Uri-based overloads so callers can supply strongly-typed endpoints without manual string normalization.
40+
- Manual Azure config defaults: never auto-populate `AzureIntegrationConfigDefaults` from environment variables; keep the static placeholder JSON.
41+
- Never use `MemoryStream` for conversion paths; rely on file-based processing instead of in-memory buffering.
42+
- Disk-first refactors: put shared disk/workspace helpers into reusable base classes instead of hiding them as nested converter types.
43+
- Document pipeline work: keep a single, well-defined flow that matches `docs/DocumentProcessingPipeline.md`, centralising common setup in the shared base converter and pushing OpenXML helpers into shared abstractions instead of per-converter copies; document tables/images behaviour in `docs/MetaMD.md`.
44+
- Manual conversion diagnostics: persist manual harness output to disk and ensure MetaMD formatting includes image description blocks for every extracted artifact.
45+
- Multi-page tables must emit `<!-- Table spans pages X-Y -->` comments, continuation markers for each affected page, and populate `table.pageStart`, `table.pageEnd`, and `table.pageRange` metadata so downstream systems can align tables with their source pages.
46+
- PDF converters must honour `SegmentOptions.Pdf.TreatPagesAsImages`, rendering each page to PNG, running OCR/vision enrichment, and composing page segments with image placeholders plus recognized text whenever the option is enabled.
47+
- Persist conversion workspaces through `ManagedCode.Storage` by allocating a unique, sanitized folder per document, copy the source file, store every extracted artifact via `IStorage`, and emit the final Markdown into the same folder.
1848

1949
# Repository Guidelines
2050

Directory.Build.props

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -22,8 +22,8 @@
2222
<PackageLicenseExpression>MIT</PackageLicenseExpression>
2323
<PackageReadmeFile>README.md</PackageReadmeFile>
2424
<Product>Managed Code - MarkItDown</Product>
25-
<Version>0.0.4</Version>
26-
<PackageVersion>0.0.4</PackageVersion>
25+
<Version>0.0.5</Version>
26+
<PackageVersion>0.0.5</PackageVersion>
2727
</PropertyGroup>
2828

2929
<PropertyGroup Condition="'$(GITHUB_ACTIONS)' == 'true'">

Directory.Packages.props

Lines changed: 20 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,38 +1,44 @@
11
<Project>
22
<ItemGroup>
33
<PackageVersion Include="AngleSharp" Version="1.3.0" />
4-
<PackageVersion Include="AWSSDK.Rekognition" Version="4.0.2.6" />
5-
<PackageVersion Include="AWSSDK.S3" Version="4.0.7.7" />
6-
<PackageVersion Include="AWSSDK.Textract" Version="4.0.2.6" />
7-
<PackageVersion Include="AWSSDK.TranscribeService" Version="4.0.3.9" />
4+
<PackageVersion Include="AWSSDK.Rekognition" Version="4.0.2.8" />
5+
<PackageVersion Include="AWSSDK.S3" Version="4.0.7.10" />
6+
<PackageVersion Include="AWSSDK.Textract" Version="4.0.2.8" />
7+
<PackageVersion Include="AWSSDK.TranscribeService" Version="4.0.4" />
88
<PackageVersion Include="Azure.AI.FormRecognizer" Version="4.1.0" />
99
<PackageVersion Include="Azure.AI.OpenAI" Version="2.1.0" />
1010
<PackageVersion Include="Azure.AI.Vision.ImageAnalysis" Version="1.0.0" />
11-
<PackageVersion Include="Azure.Identity" Version="1.12.0" />
11+
<PackageVersion Include="Azure.Identity" Version="1.17.0" />
1212
<PackageVersion Include="coverlet.collector" Version="6.0.4" />
1313
<PackageVersion Include="DocumentFormat.OpenXml" Version="3.3.0" />
1414
<PackageVersion Include="DotNet.ReproducibleBuilds" Version="1.2.25" />
15-
<PackageVersion Include="Google.Cloud.DocumentAI.V1" Version="3.21.0" />
15+
<PackageVersion Include="Google.Cloud.DocumentAI.V1" Version="3.22.0" />
1616
<PackageVersion Include="Google.Cloud.Speech.V1" Version="3.8.0" />
1717
<PackageVersion Include="Google.Cloud.Vision.V1" Version="3.7.0" />
18-
<PackageVersion Include="ManagedCode.MimeTypes" Version="1.0.4" />
19-
<PackageVersion Include="Microsoft.Extensions.AI" Version="9.9.1" />
18+
<PackageVersion Include="ManagedCode.MimeTypes" Version="1.0.5" />
19+
<PackageVersion Include="ManagedCode.Storage.Aws" Version="9.2.1" />
20+
<PackageVersion Include="ManagedCode.Storage.Azure" Version="9.2.1" />
21+
<PackageVersion Include="ManagedCode.Storage.Core" Version="9.2.1" />
22+
<PackageVersion Include="ManagedCode.Storage.FileSystem" Version="9.2.1" />
23+
<PackageVersion Include="ManagedCode.Storage.Gcp" Version="9.2.1" />
24+
<PackageVersion Include="Microsoft.Extensions.AI" Version="9.10.0" />
2025
<PackageVersion Include="Microsoft.Extensions.AI.OpenAI" Version="9.9.1-preview.1.25474.6" />
21-
<PackageVersion Include="Microsoft.Extensions.DependencyInjection.Abstractions" Version="9.0.9" />
22-
<PackageVersion Include="Microsoft.Extensions.Logging.Abstractions" Version="9.0.9" />
26+
<PackageVersion Include="Microsoft.Extensions.DependencyInjection.Abstractions" Version="9.0.10" />
27+
<PackageVersion Include="Microsoft.Extensions.Logging.Abstractions" Version="9.0.10" />
28+
<PackageVersion Include="Microsoft.Extensions.Options" Version="9.0.10" />
2329
<PackageVersion Include="Microsoft.NET.Test.Sdk" Version="17.14.1" />
2430
<PackageVersion Include="MimeKit" Version="4.14.0" />
2531
<PackageVersion Include="Moq" Version="4.20.72" />
2632
<PackageVersion Include="PdfPig" Version="0.1.11" />
2733
<PackageVersion Include="PDFtoImage" Version="5.1.1" />
28-
<PackageVersion Include="Sep" Version="0.11.1" />
34+
<PackageVersion Include="Sep" Version="0.11.2" />
2935
<PackageVersion Include="Shouldly" Version="4.3.0" />
3036
<PackageVersion Include="SkiaSharp" Version="3.119.1" />
3137
<PackageVersion Include="Spectre.Console" Version="0.51.1" />
32-
<PackageVersion Include="System.Text.Encoding.CodePages" Version="9.0.9" />
33-
<PackageVersion Include="System.Text.Json" Version="9.0.9" />
38+
<PackageVersion Include="System.Text.Encoding.CodePages" Version="9.0.10" />
39+
<PackageVersion Include="System.Text.Json" Version="9.0.10" />
3440
<PackageVersion Include="YoutubeExplode" Version="6.5.5" />
3541
<PackageVersion Include="xunit" Version="2.9.3" />
3642
<PackageVersion Include="xunit.runner.visualstudio" Version="3.1.4" />
3743
</ItemGroup>
38-
</Project>
44+
</Project>

0 commit comments

Comments
 (0)