| title | Extract PDF to JSON in C# | Smart Data Extractor | Syncfusion |
|---|---|
| description | Learn how to extract structured data from PDF documents as JSON in C# using the Syncfusion® Smart Data Extractor library for .NET applications. |
| platform | document-processing |
| control | SmartDataExtractor |
| documentation | UG |
| keywords | Assemblies |
JavaScript Object Notation (JSON) is a lightweight data‑interchange format that is easy for humans to read and write, and simple for machines to parse and generate. The Syncfusion® Smart Data Extractor library extracts structured information from PDF documents and scanned images, and outputs the content as JSON. It analyzes text blocks, tables, headers, and form fields to preserve structure, enabling developers to integrate PDF to JSON extraction into their applications.
Refer to the following links for the assemblies and NuGet packages required on different platforms to extract data as a JSON file using the Smart Data Extractor library.
To extract form fields across a PDF document using the ExtractDataAsJson method of the DataExtractor class, refer to the following code example:
{% tabs %}
{% highlight c# tabtitle="C# [Cross-platform]" %}
using System.IO; using Syncfusion.SmartDataExtractor; using Syncfusion.SmartFormRecognizer; using System.Text;
//Open the input PDF file as a stream. using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read)) { //Initialize the Data Extractor. DataExtractor extractor = new DataExtractor(); //Extract data as JSON. string data = extractor.ExtractDataAsJson(stream); //Save the extracted JSON data into an output file. File.WriteAllText("Output.json", data, Encoding.UTF8); }
{% endhighlight %}
{% highlight c# tabtitle="C# [Windows-specific]" %}
using System.IO; using Syncfusion.SmartDataExtractor; using Syncfusion.SmartFormRecognizer; using System.Text;
//Open the input PDF file as a stream.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Extract data as JSON.
string data = extractor.ExtractDataAsJson(stream);
//Save the extracted JSON data into an output file.
File.WriteAllText("Output.json", data, Encoding.UTF8);
}
{% endhighlight %}
{% highlight vb.net tabtitle="VB.NET [Windows-specific]" %}
Imports System.IO Imports System.Text Imports Syncfusion.SmartDataExtractor
' Open the input PDF file as a stream. Using stream As New FileStream("Input.pdf", FileMode.Open, FileAccess.Read) ' Initialize the Data Extractor. Dim extractor As New DataExtractor() ' Extract data as JSON. Dim data As String = extractor.ExtractDataAsJson(stream) ' Save the extracted JSON data into an output file. File.WriteAllText("Output.json", data, Encoding.UTF8) End Using
{% endhighlight %}
{% endtabs %}
N> If you want to extract data from an image instead of a PDF, replace the input stream with the image file (for example, Input.jpg or Input.png). The rest of the code remains unchanged.
You can download a complete working sample from GitHub.
To extract data from a specific range of pages in a PDF document using the ExtractDataAsJson method of the DataExtractor class, refer to the following code example:
{% tabs %}
{% highlight c# tabtitle="C# [Cross-platform]" %}
using System.IO; using Syncfusion.SmartDataExtractor; using System.Text;
//Open the input PDF file as a stream. using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read)) { //Initialize the Data Extractor. DataExtractor extractor = new DataExtractor(); //Set the page range for extraction (pages 1 to 3). extractor.PageRange = new int[,] { { 1, 3 } }; //Extract data as JSON string. string data = extractor.ExtractDataAsJson(stream); //Save the extracted JSON data into an output file. File.WriteAllText("Output.json", data, Encoding.UTF8); }
{% endhighlight %}
{% highlight c# tabtitle="C# [Windows-specific]" %}
using System.IO; using Syncfusion.SmartDataExtractor; using System.Text;
//Open the input PDF file as a stream.
using (FileStream stream = new FileStream("Input.pdf", FileMode.Open, FileAccess.Read))
{
//Initialize the Data Extractor.
DataExtractor extractor = new DataExtractor();
//Set the page range for extraction (pages 1 to 3).
extractor.PageRange = new int[,] { { 1, 3 } };
//Extract data as JSON string.
string data = extractor.ExtractDataAsJson(stream);
//Save the extracted JSON data into an output file.
File.WriteAllText("Output.json", data, Encoding.UTF8);
}
{% endhighlight %}
{% highlight vb.net tabtitle="VB.NET [Windows-specific]" %}
Imports System.IO Imports System.Text Imports Syncfusion.SmartDataExtractor
' Open the input PDF file as a stream. Using stream As New FileStream("D:/Input.pdf", FileMode.Open, FileAccess.Read) ' Initialize the Data Extractor. Dim extractor As New DataExtractor() ' Set the page range for extraction (pages 1 to 3). extractor.PageRange = New Integer(,) {{1, 3}} ' Extract data as JSON string. Dim data As String = extractor.ExtractDataAsJson(stream) ' Save the extracted JSON data into an output file. File.WriteAllText("D:/Output.json", data, Encoding.UTF8) End Using
{% endhighlight %}
{% endtabs %}
The JSON output from the extraction contains structured attributes. For more details on the extracted JSON structure and attributes, refer to the JSON Attributes documentation.