Skip to content

Commit cdb8c40

Browse files
authored
Merge pull request #2 from managedcode/copilot/fix-1
Convert Python MarkItDown to C# .NET 8 with Comprehensive Testing & Copilot Instructions
2 parents 8a9d8f1 + 806b584 commit cdb8c40

26 files changed

Lines changed: 3626 additions & 280 deletions

.github/copilot-instructions.md

Lines changed: 327 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,327 @@
1+
# GitHub Copilot Instructions for MarkItDown C# .NET Project
2+
3+
## Project Overview
4+
5+
MarkItDown is a C# .NET 8 library for converting various document formats (HTML, PDF, DOCX, XLSX, etc.) into clean Markdown suitable for Large Language Models (LLMs) and text analysis pipelines. This project is a conversion from the original Python implementation to C# while maintaining API compatibility and adding modern async/await patterns.
6+
7+
## Architecture and Design Principles
8+
9+
### Core Components
10+
11+
```
12+
src/
13+
├── MarkItDown.Core/ # Main library project
14+
│ ├── IDocumentConverter.cs # Converter interface
15+
│ ├── MarkItDown.cs # Main orchestration class
16+
│ ├── StreamInfo.cs # File metadata handling
17+
│ ├── DocumentConverterResult.cs # Conversion results
18+
│ ├── Exceptions/ # Exception hierarchy
19+
│ └── Converters/ # Format-specific converters
20+
├── MarkItDown.Cli/ # Command line tool
21+
tests/
22+
└── MarkItDown.Tests/ # Unit tests with xUnit
23+
```
24+
25+
### Key Design Patterns
26+
27+
1. **Interface-Based Architecture**: All converters implement `IDocumentConverter`
28+
2. **Async/Await Throughout**: Modern C# async patterns for I/O operations
29+
3. **Priority-Based Registration**: Converters are ordered by priority for format detection
30+
4. **Stream-Based Processing**: Avoid temporary files, work with streams
31+
5. **Comprehensive Error Handling**: Specific exception types for different failure modes
32+
33+
## Code Quality Standards
34+
35+
### C# Coding Conventions
36+
37+
- **Target Framework**: .NET 8.0 (net8.0)
38+
- **Language Version**: C# 12
39+
- **Nullable Reference Types**: Enabled
40+
- **Async Patterns**: Use async/await, ConfigureAwait(false) for library code
41+
- **Exception Handling**: Specific exception types, never swallow exceptions
42+
43+
### Naming Conventions
44+
45+
- **Classes**: PascalCase (`DocumentConverter`, `StreamInfo`)
46+
- **Methods**: PascalCase (`ConvertAsync`, `AcceptsInput`)
47+
- **Properties**: PascalCase (`Markdown`, `MimeType`)
48+
- **Fields**: _camelCase with underscore prefix (`_logger`, _converters`)
49+
- **Constants**: PascalCase (`DefaultPriority`)
50+
- **Interfaces**: IPascalCase (`IDocumentConverter`)
51+
52+
### Method Signatures
53+
54+
```csharp
55+
// Async methods should always return Task<T> or Task
56+
public async Task<DocumentConverterResult> ConvertAsync(
57+
Stream stream,
58+
StreamInfo streamInfo,
59+
CancellationToken cancellationToken = default)
60+
61+
// Interface implementations should be explicit about async
62+
bool AcceptsInput(StreamInfo streamInfo);
63+
```
64+
65+
### Error Handling Patterns
66+
67+
```csharp
68+
// Custom exceptions for specific failure modes
69+
public class UnsupportedFormatException : MarkItDownException
70+
{
71+
public UnsupportedFormatException(string format)
72+
: base($"Unsupported format: {format}") { }
73+
}
74+
75+
// Proper async exception handling
76+
try
77+
{
78+
var result = await converter.ConvertAsync(stream, info, cancellationToken);
79+
return result;
80+
}
81+
catch (UnsupportedFormatException)
82+
{
83+
throw; // Re-throw specific exceptions
84+
}
85+
catch (Exception ex)
86+
{
87+
throw new MarkItDownException("Conversion failed", ex);
88+
}
89+
```
90+
91+
### Testing Standards
92+
93+
- **Framework**: xUnit with standard assertions
94+
- **Async Testing**: Proper async test methods
95+
- **Test Naming**: `MethodName_Scenario_ExpectedResult`
96+
- **Coverage**: All public APIs must have tests
97+
- **Edge Cases**: Test null inputs, empty streams, invalid data
98+
99+
```csharp
100+
[Fact]
101+
public async Task ConvertAsync_ValidHtml_ReturnsCorrectMarkdown()
102+
{
103+
// Arrange
104+
var converter = new HtmlConverter();
105+
var html = "<h1>Test</h1><p>Content</p>";
106+
var bytes = Encoding.UTF8.GetBytes(html);
107+
using var stream = new MemoryStream(bytes);
108+
var streamInfo = new StreamInfo(mimeType: "text/html");
109+
110+
// Act
111+
var result = await converter.ConvertAsync(stream, streamInfo);
112+
113+
// Assert
114+
Assert.Contains("# Test", result.Markdown);
115+
Assert.Contains("Content", result.Markdown);
116+
}
117+
```
118+
119+
## Converter Implementation Guidelines
120+
121+
### Creating New Converters
122+
123+
1. **Inherit from Base**: Consider if a base converter class would help
124+
2. **Implement Interface**: All converters must implement `IDocumentConverter`
125+
3. **Priority Assignment**: Lower numbers = higher priority (HTML = 100, Plain Text = 1000)
126+
4. **Format Detection**: Be specific in `AcceptsInput` - check MIME type AND extension
127+
5. **Error Handling**: Wrap third-party exceptions in `MarkItDownException`
128+
129+
### Standard Converter Structure
130+
131+
```csharp
132+
public class YourFormatConverter : IDocumentConverter
133+
{
134+
public int Priority => 200; // Between HTML(100) and PlainText(1000)
135+
136+
public bool AcceptsInput(StreamInfo streamInfo)
137+
{
138+
return streamInfo.MimeType?.StartsWith("application/your-format") == true ||
139+
streamInfo.Extension?.ToLowerInvariant() == ".your-ext";
140+
}
141+
142+
public async Task<DocumentConverterResult> ConvertAsync(
143+
Stream stream,
144+
StreamInfo streamInfo,
145+
CancellationToken cancellationToken = default)
146+
{
147+
try
148+
{
149+
// Reset stream position
150+
if (stream.CanSeek)
151+
stream.Position = 0;
152+
153+
// Your conversion logic here
154+
var markdown = await ConvertToMarkdownAsync(stream, cancellationToken);
155+
156+
return new DocumentConverterResult(
157+
markdown: markdown,
158+
title: ExtractTitle(markdown) // Optional
159+
);
160+
}
161+
catch (Exception ex) when (!(ex is MarkItDownException))
162+
{
163+
throw new MarkItDownException($"Failed to convert {streamInfo.Extension} file", ex);
164+
}
165+
}
166+
}
167+
```
168+
169+
## Package Management and Dependencies
170+
171+
### NuGet Package References
172+
173+
- **Core Dependencies**: Keep minimal - only what's absolutely needed
174+
- **Version Pinning**: Use specific versions for reproducible builds
175+
- **License Compatibility**: Ensure all dependencies are MIT-compatible
176+
- **Security**: Regularly update packages for security fixes
177+
178+
### Current Key Dependencies
179+
180+
```xml
181+
<PackageReference Include="HtmlAgilityPack" Version="1.11.71" />
182+
<PackageReference Include="System.Text.Json" Version="8.0.5" />
183+
<PackageReference Include="Microsoft.Extensions.Logging.Abstractions" Version="8.0.1" />
184+
```
185+
186+
## Testing Philosophy
187+
188+
### Test Coverage Requirements
189+
190+
- **Every Public Method**: Must have at least basic functionality tests
191+
- **Error Conditions**: Test exception scenarios and edge cases
192+
- **Integration Tests**: Test the full MarkItDown workflow
193+
- **Format-Specific Tests**: Each converter needs comprehensive tests
194+
195+
### Test Data Strategy
196+
197+
```csharp
198+
// Use test data that mirrors the original Python test vectors
199+
public static class TestVectors
200+
{
201+
public static readonly FileTestVector[] GeneralTestVectors = {
202+
new FileTestVector(
203+
filename: "test.html",
204+
mimeType: "text/html",
205+
mustInclude: new[] { "# Header", "**bold text**" },
206+
mustNotInclude: new[] { "<html>", "<script>" }
207+
)
208+
};
209+
}
210+
```
211+
212+
### Performance Testing Considerations
213+
214+
- **Large Files**: Test with files >1MB
215+
- **Memory Usage**: Ensure streaming doesn't load entire files into memory
216+
- **Async Patterns**: Verify proper async/await usage with real I/O
217+
218+
## CLI Tool Guidelines
219+
220+
### Command Line Interface
221+
222+
- **System.CommandLine**: Use modern .NET CLI framework
223+
- **Error Codes**: Return appropriate exit codes (0 = success, 1 = error)
224+
- **Logging**: Support verbose output for debugging
225+
- **File Handling**: Support both file paths and stdin/stdout
226+
227+
### CLI Error Handling
228+
229+
```csharp
230+
try
231+
{
232+
var result = await markItDown.ConvertAsync(inputStream, streamInfo);
233+
await Console.Out.WriteAsync(result.Markdown);
234+
return 0;
235+
}
236+
catch (UnsupportedFormatException ex)
237+
{
238+
await Console.Error.WriteLineAsync($"Error: {ex.Message}");
239+
return 1;
240+
}
241+
catch (Exception ex)
242+
{
243+
await Console.Error.WriteLineAsync($"Unexpected error: {ex.Message}");
244+
return 2;
245+
}
246+
```
247+
248+
## Future Extension Points
249+
250+
### Adding New Format Support
251+
252+
Priority for new converters:
253+
1. **PDF Support** (iText7 or PdfPig)
254+
2. **Office Documents** (DocumentFormat.OpenXml)
255+
3. **Images with OCR** (ImageSharp + Tesseract)
256+
4. **Audio Transcription** (Azure Speech Services)
257+
5. **CSV/Excel** (EPPlus or ClosedXML)
258+
259+
### Converter Development Workflow
260+
261+
1. **Research Python Implementation**: Understand the original converter
262+
2. **Choose .NET Library**: Find appropriate NuGet packages
263+
3. **Create Test Cases**: Port Python test vectors to C#
264+
4. **Implement Converter**: Follow the patterns above
265+
5. **Integration Testing**: Test with MarkItDown main class
266+
6. **Documentation**: Update README with new format support
267+
268+
## Maintenance and Updates
269+
270+
### Version Compatibility
271+
272+
- **Semantic Versioning**: Follow SemVer for releases
273+
- **API Stability**: Don't break public interfaces without major version bump
274+
- **Backward Compatibility**: Maintain compatibility with existing code
275+
276+
### Documentation Requirements
277+
278+
- **XML Comments**: All public APIs need XML documentation
279+
- **README Updates**: Keep feature matrix current
280+
- **API Examples**: Provide working code examples
281+
- **Migration Guides**: Help users migrate from Python version
282+
283+
## Build and Deployment
284+
285+
### Project Configuration
286+
287+
```xml
288+
<PropertyGroup>
289+
<TargetFramework>net8.0</TargetFramework>
290+
<Nullable>enable</Nullable>
291+
<LangVersion>12</LangVersion>
292+
<GeneratePackageOnBuild>true</GeneratePackageOnBuild>
293+
</PropertyGroup>
294+
```
295+
296+
### NuGet Package Metadata
297+
298+
- **PackageId**: MarkItDown
299+
- **Authors**: ManagedCode
300+
- **Description**: Clear, concise description
301+
- **Tags**: Include relevant keywords for discovery
302+
- **License**: MIT
303+
- **Repository URL**: GitHub repository link
304+
305+
## Development Best Practices
306+
307+
### Code Reviews
308+
309+
- **Interface Design**: Review public APIs carefully
310+
- **Performance**: Check for memory leaks and performance issues
311+
- **Error Handling**: Ensure proper exception handling
312+
- **Tests**: Verify comprehensive test coverage
313+
- **Documentation**: Check XML comments and README updates
314+
315+
### Debugging Guidelines
316+
317+
```csharp
318+
// Use structured logging for debugging
319+
_logger.LogDebug("Converting {FileName} with MIME type {MimeType}",
320+
streamInfo.FileName, streamInfo.MimeType);
321+
322+
// Add timing for performance analysis
323+
using var activity = MarkItDownActivity.StartActivity("ConvertDocument");
324+
activity?.SetTag("format", streamInfo.Extension);
325+
```
326+
327+
This document should guide all development work on the MarkItDown C# project, ensuring consistency, quality, and maintainability as the project grows.

0 commit comments

Comments
 (0)