You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Co-authored-by: KSemenenko <4385716+KSemenenko@users.noreply.github.com>
ai tests
fixes
tests
projecrt structure
middleware
wip
text sanitier
refactoring
tests and structure
wip
wofk
docs
prompt
work
work
fixes
format
nuget
format
a lot of updates
md
progress
progrs
wip
mior fixes
wrok
etest
table
table
tests
tets
converters
converters
meta
skip manual tests
clean project
style fix
Copy file name to clipboardExpand all lines: AGENTS.md
+30Lines changed: 30 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,12 +9,42 @@ If I tell you to remember something, you do the same, update
9
9
10
10
11
11
## Rules to follow
12
+
- Never introduce fallback logic that silently overrides user or config values; surface configuration errors instead of masking them in code.
13
+
- Keep `SegmentOptions.MaxParallelImageAnalysis` at `Math.Max(Environment.ProcessorCount * 4, 32)` and do not downscale it via runtime fallbacks.
14
+
- Treat non-positive `SegmentOptions.MaxParallelImageAnalysis` values as configuration errors—fail fast instead of defaulting to unlimited concurrency.
15
+
- Ensure document segments remain in source order with explicit numeric page/segment metadata—avoid relying on labels like "Page 1".
16
+
- When extracting images (or other artifacts), persist them to disk when a target path is supplied and record the file path in artifact metadata.
17
+
- Generate Markdown output from the ordered segment collection so it always reflects current segment content; avoid storing stale Markdown snapshots.
18
+
- Allow `ConvertAsync` (and related entry points) to accept caller-supplied options for AI/config overrides on a per-document basis.
12
19
- MIME handling: always use `ManagedCode.MimeTypes` for MIME constants, lookups, and validation logic.
13
20
- Treat this repository as a high-fidelity port of `microsoft-markitdown`: every test fixture copied from the upstream `tests/test_files/` directory must be referenced by .NET tests (either as positive conversions or explicit unsupported cases). No orphaned fixtures.
14
21
- CSV parsing must use the `Sep` library; avoid Sylvan or other CSV parsers for new or updated code.
15
22
- Format integration tasks: never break the project or existing tests, and validate new format handling against real sample files.
16
23
- Test fixtures must be surfaced via the auto-generated `TestAssetCatalog`; add binaries under `TestFiles/` and rely on its constants in tests.
17
24
- YouTube converter work: include at least one live integration test that exercises the real metadata provider (skip gracefully if the upstream API is unavailable) so the flow mirrors production behaviour.
25
+
- Never introduce test-only abstractions like `IAzureIntegrationSampleResolver` into the core library; keep cross-cutting helpers clean and production-ready.
26
+
- Image enrichment tasks: once OCR runs, send the artifact through the shared `IChatClient` prompt constants, capture a thorough visual description first, convert diagrams/schematics into Mermaid or structured tables, describe technical drawings in depth, and emit Markdown that follows `docs/MetaMD.md` and `docs/MetaMD-Examples.md`.
27
+
- Image AI enrichment must reject missing MIME metadata—surface the failure to callers instead of substituting fallback content types.
28
+
- Image enrichment tasks: once AI enrichment runs, strip any legacy/fallback image comments so only one `**Image:` placeholder and description remain in the final Markdown.
29
+
- Front matter titles must ignore metadata or image description comments—derive the title from the first real document text.
30
+
- When refactoring intelligence helpers, have them return explicit result data instead of relying on hidden side effects.
31
+
- Image placeholders must emit Markdown image links (``) that reference persisted artifacts; only fall back to bold text when no file is available.
32
+
- If AI image enrichment yields no insight, log and continue instead of throwing—treat empty payloads as a soft failure.
33
+
- When executing tests, always include the `ManualConversionDebugTests` suite; treat its failures as blocking.
34
+
- Telemetry work: instrument both overall document processing time and per-page duration with real metrics alongside traces—include histogram/counter coverage so latency is observable at both levels.
35
+
- For large converters, structure them as partial classes and split related files into a dedicated subfolder.
36
+
- Markdown hygiene: strip non-breaking, zero-width, or other non-printable spaces; replace them with regular ASCII spaces so output never contains invisible characters like the long space before `Add`.
37
+
- Architecture revamps: adopt DI-first composition, expose per-request cloud model selection, and employ `System.IO.Pipelines` with optional parallel converter scheduling while keeping documentation and structure tidy.
38
+
- DOCX processing work: restructure element handling around pipeline-driven parallelism so enrichment and extraction avoid sequential bottlenecks while preserving output ordering.
39
+
- URL conversion APIs: expose Uri-based overloads so callers can supply strongly-typed endpoints without manual string normalization.
40
+
- Manual Azure config defaults: never auto-populate `AzureIntegrationConfigDefaults` from environment variables; keep the static placeholder JSON.
41
+
- Never use `MemoryStream` for conversion paths; rely on file-based processing instead of in-memory buffering.
42
+
- Disk-first refactors: put shared disk/workspace helpers into reusable base classes instead of hiding them as nested converter types.
43
+
- Document pipeline work: keep a single, well-defined flow that matches `docs/DocumentProcessingPipeline.md`, centralising common setup in the shared base converter and pushing OpenXML helpers into shared abstractions instead of per-converter copies; document tables/images behaviour in `docs/MetaMD.md`.
44
+
- Manual conversion diagnostics: persist manual harness output to disk and ensure MetaMD formatting includes image description blocks for every extracted artifact.
45
+
- Multi-page tables must emit `<!-- Table spans pages X-Y -->` comments, continuation markers for each affected page, and populate `table.pageStart`, `table.pageEnd`, and `table.pageRange` metadata so downstream systems can align tables with their source pages.
46
+
- PDF converters must honour `SegmentOptions.Pdf.TreatPagesAsImages`, rendering each page to PNG, running OCR/vision enrichment, and composing page segments with image placeholders plus recognized text whenever the option is enabled.
47
+
- Persist conversion workspaces through `ManagedCode.Storage` by allocating a unique, sanitized folder per document, copy the source file, store every extracted artifact via `IStorage`, and emit the final Markdown into the same folder.
0 commit comments