| title | Extract PDF to Markdown in C# | Smart Data Extractor | Syncfusion |
|---|---|
| description | Extract PDF documents as Markdown (MD) in C# using Syncfusion<sup>®</sup> Smart Data Extractor library without Microsoft Office or Adobe dependencies |
| platform | document-processing |
| control | SmartDataExtractor |
| documentation | UG |
| keywords | Assemblies |
Markdown is a lightweight markup language that adds formatting elements to plain text documents. The Syncfusion® Smart Data Extractor library extracts structured information from PDF documents and scanned images, and outputs the content as Markdown (MD). It analyzes text blocks, tables, headers, and form fields to preserve layout and formatting.
Refer to the following links for assemblies and NuGet packages required based on platforms to Extract data as Markdown file using the .NET Word Library (DocIO).
To extract form fields across a PDF document using the ExtractDataAsMarkdown method of the DataExtractor class, refer to the following code example:
{% tabs %}
{% highlight c# tabtitle="C# [Cross-platform]" playgroundButtonLink="https://raw.githubusercontent.com/SyncfusionExamples/PDF-Examples/refs/heads/master/Data-Extraction/Smart-Data-Extractor/Extract-data-as-MD-from-PDF/.NET/Extract-data-as-MD-from-PDF/Program.cs" %}
using System.IO; using Syncfusion.SmartDataExtractor; using System.Text;
//Open the input PDF file as a stream. using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read)) { //Initialize the Data Extractor. DataExtractor extractor = new DataExtractor(); //Extract data as Markdown. string data = extractor.ExtractDataAsMarkdown(stream); //Save the extracted Markdown data into an output file. File.WriteAllText("Output.md", data, Encoding.UTF8); }
{% endhighlight %}
{% highlight c# tabtitle="C# [Windows-specific]" %}
using System.IO; using Syncfusion.SmartDataExtractor; using System.Text;
//Open the input PDF file as a stream. using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read)) { //Initialize the Data Extractor. DataExtractor extractor = new DataExtractor(); //Extract data as Markdown. string data = extractor.ExtractDataAsMarkdown(stream); //Save the extracted Markdown data into an output file. File.WriteAllText("Output.md", data, Encoding.UTF8); }
{% endhighlight %}
{% highlight vb.net tabtitle="VB.NET [Windows-specific]" %}
Imports System.IO Imports System.Text Imports Syncfusion.SmartDataExtractor
' Open the input PDF file as a stream. Using stream As New FileStream("Input.pdf", FileMode.Open, FileAccess.Read) ' Initialize the Data Extractor. Dim extractor As New DataExtractor() ' Extract data as Markdown. Dim data As String = extractor.ExtractDataAsMarkdown(stream) ' Save the extracted Markdown data into an output file. File.WriteAllText("Output.md", data, Encoding.UTF8) End Using
{% endhighlight %}
{% endtabs %}
N> If you want to extract data from an image instead of a PDF, replace the input stream with the image file (for example, Input.jpg or Input.png). The rest of the code remains unchanged.
You can download a complete working sample from GitHub.
The following code demonstrates how to use the ExtractDataAsMarkdown method of the DataExtractor class to extract content from a selected page in a PDF and save it as a Markdown file by specifying its page index.
{% tabs %}
{% highlight c# tabtitle="C# [Cross-platform]" %}
using System.IO; using Syncfusion.SmartDataExtractor; using System.Text;
//Open the input PDF file as a stream. using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read)) { //Initialize the Data Extractor. DataExtractor extractor = new DataExtractor(); //Set the page index for extraction (example: page 2). extractor.PageRange = new int[,] { { 2, 2 } }; //Extract data as Markdown using the API. string data = extractor.ExtractDataAsMarkdown(stream); //Save the extracted Markdown data into an output file. File.WriteAllText("Output.md", data, Encoding.UTF8); }
{% endhighlight %}
{% highlight c# tabtitle="C# [Windows-specific]" %}
using System.IO; using Syncfusion.SmartDataExtractor; using System.Text;
//Open the input PDF file as a stream. using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read)) { //Initialize the Data Extractor. DataExtractor extractor = new DataExtractor(); //Set the page index for extraction (example: page 2). extractor.PageRange = new int[,] { { 2, 2 } }; //Extract data as Markdown using the API. string data = extractor.ExtractDataAsMarkdown(stream); //Save the extracted Markdown data into an output file. File.WriteAllText("Output.md", data, Encoding.UTF8); }
{% endhighlight %}
{% highlight vb.net tabtitle="VB.NET [Windows-specific]" %}
Imports System.IO Imports System.Text Imports Syncfusion.SmartDataExtractor
' Open the input PDF file as a stream. Using stream As New FileStream("Input.pdf", FileMode.Open, FileAccess.Read) ' Initialize the Data Extractor. Dim extractor As New DataExtractor() ' Set the page index for extraction (example: page 2). extractor.PageRange = New Integer(,) {{2, 2}} ' Extract data as Markdown using the API. Dim data As String = extractor.ExtractDataAsMarkdown(stream) ' Save the extracted Markdown data into an output file. File.WriteAllText("Output.md", data, Encoding.UTF8) End Using
{% endhighlight %}
{% endtabs %}
The following code demonstrates how to use the ExtractDataAsMarkdown method of the DataExtractor class to extract content from a range of pages in a PDF and save it as a Markdown file by specifying the page range.
{% tabs %}
{% highlight c# tabtitle="C# [Cross-platform]" %}
using System.IO; using Syncfusion.SmartDataExtractor; using System.Text;
//Open the input PDF file as a stream. using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read)) { //Initialize the Data Extractor. DataExtractor extractor = new DataExtractor(); //Set the page range for extraction (pages 1 to 3). extractor.PageRange = new int[,] { { 1, 3 } }; //Extract data as Markdown using the API. string data = extractor.ExtractDataAsMarkdown(stream); //Save the extracted Markdown data into an output file. File.WriteAllText("Output.md", data, Encoding.UTF8); }
{% endhighlight %}
{% highlight c# tabtitle="C# [Windows-specific]" %}
using System.IO; using Syncfusion.SmartDataExtractor; using System.Text;
//Open the input PDF file as a stream. using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read)) { //Initialize the Data Extractor. DataExtractor extractor = new DataExtractor(); //Set the page range for extraction (pages 1 to 3). extractor.PageRange = new int[,] { { 1, 3 } }; //Extract data as Markdown using the API. string data = extractor.ExtractDataAsMarkdown(stream); //Save the extracted Markdown data into an output file. File.WriteAllText("Output.md", data, Encoding.UTF8); }
{% endhighlight %}
{% highlight vb.net tabtitle="VB.NET [Windows-specific]" %}
Imports System.IO Imports System.Text Imports Syncfusion.SmartDataExtractor
' Open the input PDF file as a stream. Using stream As New FileStream("Input.pdf", FileMode.Open, FileAccess.Read) ' Initialize the Data Extractor. Dim extractor As New DataExtractor() ' Set the page range for extraction (pages 1 to 3). extractor.PageRange = New Integer(,) {{1, 3}} ' Extract data as Markdown using the API. Dim data As String = extractor.ExtractDataAsMarkdown(stream) ' Save the extracted Markdown data into an output file. File.WriteAllText("Output.md", data, Encoding.UTF8) End Using
{% endhighlight %}
{% endtabs %}
The ImageNodeVisited event in the SaveOptions class (from the Syncfusion® DocIO library, used within Smart Data Extractor) allows control over how images are handled when generating a Markdown string. With this event, you can:
- Customize image names and storage paths, and save images externally using a FileStream.
- Replace Base64 content with a file path or URL for optimized storage and cloud reference.
- Generate a basic inbuilt report as a Markdown string, which can be directly consumed by LLMs or stored for further processing.
The following code shows how to use the ExtractDataAsMarkdown method of the DataExtractor class with the ImageNodeVisited event to customize image saving while exporting PDF or image files as Markdown.
{% tabs %}
{% highlight c# tabtitle="C# [Cross-platform]" %}
using Syncfusion.Office.Markdown; using Syncfusion.SmartDataExtractor;
//Open the input PDF or Image file as a stream. using (FileStream inputStream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read)) { //Initialize the Data Extractor. DataExtractor extractor = new DataExtractor(); //Hook the event to customize image handling. extractor.SaveOptions.ImageNodeVisited += SaveImage; //Extract Markdown content as string. string data = extractor.ExtractDataAsMarkdown(inputStream); //Save the extracted Markdown data into an output file. File.WriteAllText("DataToMarkdown.md", data); }
{% endhighlight %}
{% highlight c# tabtitle="C# [Windows-specific]" %}
using Syncfusion.Office.Markdown; using Syncfusion.SmartDataExtractor;
//Open the input PDF or Image file as a stream. using (FileStream inputStream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read)) { //Initialize the Data Extractor. DataExtractor extractor = new DataExtractor(); //Hook the event to customize image handling. extractor.SaveOptions.ImageNodeVisited += SaveImage; //Extract Markdown content as string. string data = extractor.ExtractDataAsMarkdown(inputStream); //Save the extracted Markdown data into an output file. File.WriteAllText("DataToMarkdown.md", data); }
{% endhighlight %}
{% highlight vb.net tabtitle="VB.NET [Windows-specific]" %}
Imports Syncfusion.Office.Markdown Imports Syncfusion.SmartDataExtractor
' Open the input PDF or Image file as a stream. Using inputStream As New FileStream("Input.pdf", FileMode.Open, FileAccess.Read) ' Initialize the Data Extractor. Dim extractor As New DataExtractor() ' Hook the event to customize image handling. AddHandler extractor.SaveOptions.ImageNodeVisited, AddressOf SaveImage ' Extract Markdown content as string. Dim data As String = extractor.ExtractDataAsMarkdown(inputStream) ' Save the extracted Markdown data into an output file. File.WriteAllText("DataToMarkdown.md", data) End Using
{% endhighlight %}
{% endtabs %}
The following code shows how to implement the event handler to customize the image path and save images externally.
{% tabs %}
{% highlight c# tabtitle="C# [Cross-platform]" %}
//Event handler to save images externally static void SaveImage(object sender, MdImageNodeVisitedEventArgs args) { //Define output image path (customize naming logic as needed) string imagePath = @"D:\Temp\Image1.png"; //Save the image stream to file using (FileStream fileStreamOutput = File.Create(imagePath)) { args.ImageStream.CopyTo(fileStreamOutput); } //Set the URI to be used in the Markdown output args.Uri = imagePath; }
{% endhighlight %}
{% highlight c# tabtitle="C# [Windows-specific]" %}
//Event handler to save images externally static void SaveImage(object sender, MdImageNodeVisitedEventArgs args) { //Define output image path (customize naming logic as needed) string imagePath = @"D:\Temp\Image1.png"; //Save the image stream to file using (FileStream fileStreamOutput = File.Create(imagePath)) { args.ImageStream.CopyTo(fileStreamOutput); } //Set the URI to be used in the Markdown output args.Uri = imagePath; } {% endhighlight %}
{% highlight vb.net tabtitle="VB.NET [Windows-specific]" %}
' Event handler to save images externally Private Sub SaveImage(sender As Object, args As MdImageNodeVisitedEventArgs) ' Define output image path (customize naming logic as needed) Dim imagePath As String = "D:\Temp\Image1.png" ' Save the image stream to file Using fileStreamOutput As FileStream = File.Create(imagePath) args.ImageStream.CopyTo(fileStreamOutput) End Using ' Set the URI to be used in the Markdown output args.Uri = imagePath End Sub
{% endhighlight %}
{% endtabs %}
This section explains how common PDF elements are converted and preserved in Markdown format, ensuring that document structure and formatting remain consistent during the PDF to Markdown conversion process.
| PDF Elements | Preservation in Markdown |
|---|---|
| Header, Paragraph Title, Document Title | Headings (H2) |
| Paragraph | Paragraph |
| Image | Image (base64 string) |
| Table | Table |
| Text Inline Styles | Bold and Italic |
| Link text without title text | Links |
| Code blocks, Footer, Page Number, List, Block quotes, Subscript, Superscript | Text |