|
33 | 33 | * ``api_cost_mapping`` (Union[dict, str]): Dictionary containing API cost details or the string path to a JSON file containing |
34 | 34 | the cost details. Sample file available at ``tests/api_cost_mapping.json`` |
35 | 35 | * ``router_priority`` (str): What the routing strategy should prioritize. Options are ``"speed"`` and ``"accuracy"``. The router directs a file to either ``"STATIC_PARSE"`` or ``"LLM_PARSE"`` based on its type and the selected priority. If priority is "accuracy", it prefers LLM_PARSE unless the PDF has no images but contains embedded/hidden hyperlinks, in which case it uses ``STATIC_PARSE`` (because LLMs currently fail to parse hidden hyperlinks). If priority is "speed", it uses ``STATIC_PARSE`` for documents without images and ``LLM_PARSE`` for documents with images. |
36 | | - * ``api_provider`` (str): The API provider to use for LLM parsing. Options are ``openai``, ``huggingface``, ``togetherai``, ``openrouter``, and ``fireworks``. This parameter is only relevant when using LLM parsing. |
| 36 | + * ``api_provider`` (str): The API provider to use for LLM parsing. Options are ``openai``, ``huggingface``, ``together``, ``openrouter``, and ``fireworks``. This parameter is only relevant when using LLM parsing. |
37 | 37 |
|
38 | 38 | Return value format: |
39 | 39 | A dictionary containing a subset or all of the following keys: |
|
47 | 47 | * ``token_usage``: Token usage statistics |
48 | 48 | * ``pdf_path``: Path to the intermediate PDF generated when ``as_pdf`` is enabled and the kwarg ``save_dir`` is specified. |
49 | 49 |
|
| 50 | + |
| 51 | +parse_with_schema |
| 52 | +^^^^^^^^^^^^^^^^^ |
| 53 | + |
| 54 | +.. py:function:: lexoid.api.parse_with_schema(path: str, schema: Dict, api: str = "openai", model: str = "gpt-4o-mini", **kwargs) -> List[List[Dict]] |
| 55 | +
|
| 56 | + Parses a PDF using an LLM to generate structured output conforming to a given JSON schema. |
| 57 | + |
| 58 | + :param path: Path to the PDF file. |
| 59 | + :param schema: JSON schema to which the parsed output should conform. |
| 60 | + :param api: LLM API provider to use (``"openai"``, ``"huggingface"``, ``"together"``, ``"openrouter"``, or ``"fireworks"``). |
| 61 | + :param model: LLM model name. |
| 62 | + :param kwargs: Additional keyword arguments passed to the LLM (e.g., ``temperature``, ``max_tokens``). |
| 63 | + :return: A list where each element represents a page, which in turn contains a list of dictionaries conforming to the provided schema. |
| 64 | + |
| 65 | + Additional keyword arguments: |
| 66 | + |
| 67 | + * ``temperature`` (float): Sampling temperature for LLM generation. |
| 68 | + * ``max_tokens`` (int): Maximum number of tokens to generate. |
| 69 | + |
| 70 | + Return value format: |
| 71 | + A list of pages, where each page is represented as a list of dictionaries. Each dictionary conforms to the structure defined by the input ``schema``. |
| 72 | + |
| 73 | + |
50 | 74 | Examples |
51 | 75 | -------- |
52 | 76 |
|
@@ -92,6 +116,28 @@ Static Parsing |
92 | 116 | # Parse using PDFMiner |
93 | 117 | result = parse("document.pdf", parser_type="STATIC_PARSE", model="pdfminer") |
94 | 118 |
|
| 119 | +
|
| 120 | +Parse with Schema |
| 121 | +^^^^^^^^^^^^^^^^^ |
| 122 | + |
| 123 | +.. code-block:: python |
| 124 | +
|
| 125 | + from lexoid.api import parse_with_schema |
| 126 | +
|
| 127 | + sample_schema = [ |
| 128 | + { |
| 129 | + "Disability Category": "string", |
| 130 | + "Participants": "int", |
| 131 | + "Ballots Completed": "int", |
| 132 | + "Ballots Incomplete/Terminated": "int", |
| 133 | + "Accuracy": ["string"], |
| 134 | + "Time to complete": ["string"] |
| 135 | + } |
| 136 | + ] |
| 137 | +
|
| 138 | + pdf_path = "inputs/test_1.pdf" |
| 139 | + result = parse_with_schema(path=pdf_path, schema=sample_schema, model="gpt-4o") |
| 140 | +
|
95 | 141 | Web Content |
96 | 142 | ^^^^^^^^^^^ |
97 | 143 |
|
|
0 commit comments