Skip to content

Commit b63beb7

Browse files
committed
big renaming
1 parent bc4c0f2 commit b63beb7

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+785
-533
lines changed

AGENTS.md

Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
# Conversations
2+
any resulting updates to agents.md should go under the section "## Rules to follow"
3+
When you see a convincing argument from me on how to solve or do something. add a summary for this in agents.md. so you learn what I want over time.
4+
If I say any of the following point, you do this: add the context to agents.md, and associate this with a specific type of task.
5+
if I say "never do x" in some way.
6+
if I say "always do x" in some way.
7+
if I say "the process is x" in some way.
8+
If I tell you to remember something, you do the same, update
9+
10+
11+
## Rules to follow
12+
- MIME handling: always use `ManagedCode.MimeTypes` for MIME constants, lookups, and validation logic.
13+
14+
# Repository Guidelines
15+
16+
## Project Structure & Module Organization
17+
`MarkItDown.slnx` stitches together the core library under `src/MarkItDown` and the CLI scaffold in `src/MarkItDown.Cli`. The `MarkItDown` project hosts converters, options, and MIME helpers; keep new format handlers inside `Converters/` with focused folders. Integration and regression tests live in `tests/MarkItDown.Tests`, using `*Tests.cs` naming. The `microsoft-markitdown` directory mirrors the upstream Python project via submodule—update it only when syncing parity fixtures. Generated `bin/`, `obj/`, and `TestResults/` folders appear locally; avoid committing them.
18+
19+
## Build, Test, and Development Commands
20+
- `dotnet restore MarkItDown.slnx` – hydrate solution-wide dependencies.
21+
- `dotnet build MarkItDown.slnx` – compile all projects with analyzers enforced.
22+
- `dotnet test MarkItDown.slnx` – run xUnit suites; fails on warnings because of solution settings.
23+
- `dotnet test MarkItDown.slnx --collect:"XPlat Code Coverage"` – emit Cobertura XML under `tests/MarkItDown.Tests/TestResults/`.
24+
- `dotnet run --project src/MarkItDown.Cli -- sample.pdf` – try the CLI (currently experimental) against a local asset.
25+
26+
## Coding Style & Naming Conventions
27+
Projects target `net9.0`, `LangVersion` 13, `Nullable` enabled, and treat warnings as errors. Follow standard C# layout: four-space indents, braces on new lines, and `PascalCase` for types/methods, `camelCase` for locals and parameters. Prefer expression-bodied members only when they improve clarity. Use `var` when the right-hand side makes the type obvious. Keep XML documentation on public APIs and log messages actionable.
28+
29+
## Testing Guidelines
30+
Tests use xUnit with Shouldly helpers; place fixtures alongside the code they cover. Name methods `MethodUnderTest_Scenario_Expectation` to match existing suites. When adding new converters, create integration tests under `tests/MarkItDown.Tests` that ensure round-trip Markdown and negative paths. Collect coverage with the command above and review the generated Cobertura report before submitting.
31+
32+
## Commit & Pull Request Guidelines
33+
Recent history favors short, lower-case commit subjects (for example, `removecli`). Continue with concise, descriptive imperatives, optionally tagging scopes (`converter: add epub caption support`). Each PR should link related issues, outline behaviour changes, and note test or coverage results. Attach CLI output or screenshots when UX-facing changes occur, and call out any parity updates pulled from the Python submodule.
34+
35+
## Security & Configuration Notes
36+
Respect `.gitmodules` and `Directory.Build.props`—they embed repository URLs, reproducible build settings, and authorship data. Never check in API keys or document samples that contain customer data. Configuration overrides belong in `MarkItDownOptions`; guard new options with sensible defaults to keep the library safe for unattended execution.

Directory.Build.props

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,8 +21,8 @@
2121
<PackageLicenseExpression>MIT</PackageLicenseExpression>
2222
<PackageReadmeFile>README.md</PackageReadmeFile>
2323
<Product>Managed Code - MarkItDown</Product>
24-
<Version>0.0.1</Version>
25-
<PackageVersion>0.0.1</PackageVersion>
24+
<Version>0.0.2</Version>
25+
<PackageVersion>0.0.2</PackageVersion>
2626
</PropertyGroup>
2727

2828
<PropertyGroup Condition="'$(GITHUB_ACTIONS)' == 'true'">

MarkItDown.slnx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,9 +5,9 @@
55
<Platform Name="x86" />
66
</Configurations>
77
<Folder Name="/src/">
8-
<Project Path="src/MarkItDown.Core/MarkItDown.Core.csproj" />
8+
<Project Path="src\MarkItDown\MarkItDown.csproj" />
99
</Folder>
1010
<Folder Name="/tests/">
1111
<Project Path="tests/MarkItDown.Tests/MarkItDown.Tests.csproj" />
1212
</Folder>
13-
</Solution>
13+
</Solution>

README.md

Lines changed: 81 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -169,73 +169,105 @@ dotnet add package ManagedCode.MarkItDown
169169
170170
## 💻 Usage
171171

172-
### Basic API Usage
172+
### Convert a local file
173173

174174
```csharp
175-
using MarkItDown.Core;
175+
using MarkItDown;
176176

177-
// Simple conversion
177+
// Convert a DOCX file and print the Markdown
178178
var markItDown = new MarkItDown();
179-
var result = await markItDown.ConvertAsync("document.html");
179+
DocumentConverterResult result = await markItDown.ConvertAsync("report.docx");
180180
Console.WriteLine(result.Markdown);
181181
```
182182

183-
### Advanced Usage with Logging
183+
### Convert a stream with metadata overrides
184184

185185
```csharp
186-
using MarkItDown.Core;
187-
using Microsoft.Extensions.Logging;
186+
using System.IO;
187+
using System.Text;
188+
using MarkItDown;
189+
190+
using var stream = File.OpenRead("invoice.html");
191+
var streamInfo = new StreamInfo(
192+
mimeType: "text/html",
193+
extension: ".html",
194+
charset: Encoding.UTF8,
195+
fileName: "invoice.html");
188196

189-
// With logging and HTTP client for web content
190-
using var loggerFactory = LoggerFactory.Create(builder => builder.AddConsole());
191-
var logger = loggerFactory.CreateLogger<Program>();
197+
var markItDown = new MarkItDown();
198+
var result = await markItDown.ConvertAsync(stream, streamInfo);
199+
Console.WriteLine(result.Title);
200+
```
201+
202+
### Convert content from HTTP/HTTPS
203+
204+
```csharp
205+
using MarkItDown;
206+
using Microsoft.Extensions.Logging;
192207

208+
using var loggerFactory = LoggerFactory.Create(static builder => builder.AddConsole());
193209
using var httpClient = new HttpClient();
194-
var markItDown = new MarkItDown(logger, httpClient);
195210

196-
// Convert from file
197-
var fileResult = await markItDown.ConvertAsync("document.html");
211+
var markItDown = new MarkItDown(
212+
logger: loggerFactory.CreateLogger<MarkItDown>(),
213+
httpClient: httpClient);
214+
215+
DocumentConverterResult urlResult = await markItDown.ConvertFromUrlAsync("https://contoso.example/blog");
216+
Console.WriteLine(urlResult.Title);
217+
```
198218

199-
// Convert from URL
200-
var urlResult = await markItDown.ConvertFromUrlAsync("https://example.com");
219+
### Customise the pipeline with options
201220

202-
// Convert from URI (file:, data:, http:, https:)
203-
var dataResult = await markItDown.ConvertUriAsync("data:text/html;base64,PGgxPkhlbGxvPC9oMT4=");
221+
```csharp
222+
using Azure;
223+
using MarkItDown;
224+
225+
var options = new MarkItDownOptions
226+
{
227+
// Plug in your own services (Azure AI, OpenAI, etc.)
228+
ImageCaptioner = async (bytes, info, token) =>
229+
await myCaptionService.DescribeAsync(bytes, info, token),
230+
AudioTranscriber = async (bytes, info, token) =>
231+
await speechClient.TranscribeAsync(bytes, info, token),
232+
DocumentIntelligence = new DocumentIntelligenceOptions
233+
{
234+
Endpoint = "https://<your-resource>.cognitiveservices.azure.com/",
235+
Credential = new AzureKeyCredential("<document-intelligence-key>")
236+
}
237+
};
204238

205-
// Convert from stream with optional overrides
206-
using var stream = File.OpenRead("document.html");
207-
var streamInfo = new StreamInfo(mimeType: "text/html", extension: ".html");
208-
var streamResult = await markItDown.ConvertAsync(stream, streamInfo);
239+
var markItDown = new MarkItDown(options);
209240
```
210241

211-
### Custom Converters
242+
### Custom converters
212243

213244
Create your own format converters by implementing `IDocumentConverter`:
214245

215246
```csharp
216-
using MarkItDown.Core;
247+
using System.IO;
248+
using MarkItDown;
217249

218-
public class MyCustomConverter : IDocumentConverter
250+
public sealed class MyCustomConverter : IDocumentConverter
219251
{
220-
public bool Accepts(Stream stream, StreamInfo streamInfo, CancellationToken cancellationToken = default)
221-
{
222-
return streamInfo.Extension == ".mycustomformat";
223-
}
252+
public int Priority => ConverterPriority.SpecificFileFormat;
253+
254+
public bool AcceptsInput(StreamInfo streamInfo) =>
255+
string.Equals(streamInfo.Extension, ".mycustom", StringComparison.OrdinalIgnoreCase);
224256

225-
public async Task<DocumentConverterResult> ConvertAsync(
226-
Stream stream,
227-
StreamInfo streamInfo,
257+
public Task<DocumentConverterResult> ConvertAsync(
258+
Stream stream,
259+
StreamInfo streamInfo,
228260
CancellationToken cancellationToken = default)
229261
{
230-
// Your conversion logic here
231-
var markdown = "# Converted from custom format\n\nContent here...";
232-
return new DocumentConverterResult(markdown, "Document Title");
262+
stream.Seek(0, SeekOrigin.Begin);
263+
using var reader = new StreamReader(stream, leaveOpen: true);
264+
var markdown = "# Converted from custom format\n\n" + reader.ReadToEnd();
265+
return Task.FromResult(new DocumentConverterResult(markdown, "Custom document"));
233266
}
234267
}
235268

236-
// Register the custom converter
237269
var markItDown = new MarkItDown();
238-
markItDown.RegisterConverter(new MyCustomConverter(), ConverterPriority.SpecificFileFormat);
270+
markItDown.RegisterConverter(new MyCustomConverter());
239271
```
240272

241273
## 🏗️ Architecture
@@ -287,16 +319,28 @@ dotnet test
287319
dotnet pack --configuration Release
288320
```
289321

322+
### Tests & Coverage
323+
324+
```bash
325+
dotnet test --collect:"XPlat Code Coverage"
326+
```
327+
328+
The command emits standard test results plus a Cobertura coverage report at
329+
`tests/MarkItDown.Tests/TestResults/<guid>/coverage.cobertura.xml`. Tools such as
330+
[ReportGenerator](https://github.com/danielpalme/ReportGenerator) can turn this into
331+
HTML or Markdown dashboards.
332+
290333
### Project Structure
291334

292335
```
293336
├── src/
294-
│ ├── MarkItDown.Core/ # Core library
337+
│ ├── MarkItDown/ # Core library
295338
│ │ ├── Converters/ # Format-specific converters (HTML, PDF, audio, etc.)
296339
│ │ ├── MarkItDown.cs # Main conversion engine
297340
│ │ ├── StreamInfoGuesser.cs # MIME/charset/extension detection helpers
298341
│ │ ├── MarkItDownOptions.cs # Runtime configuration flags
299342
│ │ └── ... # Shared utilities (UriUtilities, MimeMapping, etc.)
343+
│ └── MarkItDown.Cli/ # CLI host (under active development)
300344
├── tests/
301345
│ └── MarkItDown.Tests/ # xUnit + Shouldly tests, Python parity vectors (WIP)
302346
├── Directory.Build.props # Shared build + packaging settings

0 commit comments

Comments
 (0)