Skip to content

Commit f59adc7

Browse files
committed
Update docs for parse with schema
1 parent a05bcf5 commit f59adc7

3 files changed

Lines changed: 50 additions & 4 deletions

File tree

docs/api.rst

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,7 @@ parse
3333
* ``api_cost_mapping`` (Union[dict, str]): Dictionary containing API cost details or the string path to a JSON file containing
3434
the cost details. Sample file available at ``tests/api_cost_mapping.json``
3535
* ``router_priority`` (str): What the routing strategy should prioritize. Options are ``"speed"`` and ``"accuracy"``. The router directs a file to either ``"STATIC_PARSE"`` or ``"LLM_PARSE"`` based on its type and the selected priority. If priority is "accuracy", it prefers LLM_PARSE unless the PDF has no images but contains embedded/hidden hyperlinks, in which case it uses ``STATIC_PARSE`` (because LLMs currently fail to parse hidden hyperlinks). If priority is "speed", it uses ``STATIC_PARSE`` for documents without images and ``LLM_PARSE`` for documents with images.
36-
* ``api_provider`` (str): The API provider to use for LLM parsing. Options are ``openai``, ``huggingface``, ``togetherai``, ``openrouter``, and ``fireworks``. This parameter is only relevant when using LLM parsing.
36+
* ``api_provider`` (str): The API provider to use for LLM parsing. Options are ``openai``, ``huggingface``, ``together``, ``openrouter``, and ``fireworks``. This parameter is only relevant when using LLM parsing.
3737

3838
Return value format:
3939
A dictionary containing a subset or all of the following keys:
@@ -47,6 +47,30 @@ parse
4747
* ``token_usage``: Token usage statistics
4848
* ``pdf_path``: Path to the intermediate PDF generated when ``as_pdf`` is enabled and the kwarg ``save_dir`` is specified.
4949

50+
51+
parse_with_schema
52+
^^^^^^^^^^^^^^^^^
53+
54+
.. py:function:: lexoid.api.parse_with_schema(path: str, schema: Dict, api: str = "openai", model: str = "gpt-4o-mini", **kwargs) -> List[List[Dict]]
55+
56+
Parses a PDF using an LLM to generate structured output conforming to a given JSON schema.
57+
58+
:param path: Path to the PDF file.
59+
:param schema: JSON schema to which the parsed output should conform.
60+
:param api: LLM API provider to use (``"openai"``, ``"huggingface"``, ``"together"``, ``"openrouter"``, or ``"fireworks"``).
61+
:param model: LLM model name.
62+
:param kwargs: Additional keyword arguments passed to the LLM (e.g., ``temperature``, ``max_tokens``).
63+
:return: A list where each element represents a page, which in turn contains a list of dictionaries conforming to the provided schema.
64+
65+
Additional keyword arguments:
66+
67+
* ``temperature`` (float): Sampling temperature for LLM generation.
68+
* ``max_tokens`` (int): Maximum number of tokens to generate.
69+
70+
Return value format:
71+
A list of pages, where each page is represented as a list of dictionaries. Each dictionary conforms to the structure defined by the input ``schema``.
72+
73+
5074
Examples
5175
--------
5276

@@ -92,6 +116,28 @@ Static Parsing
92116
# Parse using PDFMiner
93117
result = parse("document.pdf", parser_type="STATIC_PARSE", model="pdfminer")
94118
119+
120+
Parse with Schema
121+
^^^^^^^^^^^^^^^^^
122+
123+
.. code-block:: python
124+
125+
from lexoid.api import parse_with_schema
126+
127+
sample_schema = [
128+
{
129+
"Disability Category": "string",
130+
"Participants": "int",
131+
"Ballots Completed": "int",
132+
"Ballots Incomplete/Terminated": "int",
133+
"Accuracy": ["string"],
134+
"Time to complete": ["string"]
135+
}
136+
]
137+
138+
pdf_path = "inputs/test_1.pdf"
139+
result = parse_with_schema(path=pdf_path, schema=sample_schema, model="gpt-4o")
140+
95141
Web Content
96142
^^^^^^^^^^^
97143

docs/conf.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
project = "Lexoid"
1010
copyright = "2025, Lexoid Contributors"
1111
author = "Lexoid Contributors"
12-
release = "0.1.13"
12+
release = "0.1.14"
1313

1414
# -- General configuration ---------------------------------------------------
1515
# https://www.sphinx-doc.org/en/master/usage/configuration.html#general-configuration

lexoid/api.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -308,9 +308,9 @@ def parse_with_schema(
308308
Args:
309309
path (str): Path to the PDF file.
310310
schema (Dict): JSON schema to which the parsed output should conform.
311-
api (str, optional): LLM API provider.
311+
api (str, optional): LLM API provider (One of "openai", "huggingface", "together", "openrouter", and "fireworks").
312312
model (str, optional): LLM model name.
313-
**kwargs: Additional arguments for the parser.
313+
**kwargs: Additional arguments for the parser (e.g.: temperature, max_tokens).
314314
315315
Returns:
316316
List[List[Dict]]: List of dictionaries for each page, each conforming to the provided schema.

0 commit comments

Comments
 (0)