Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/deploy_docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ jobs:

deploy:
needs: build-docs
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
permissions:
pages: write
Expand Down
360 changes: 291 additions & 69 deletions docs/api.rst

Large diffs are not rendered by default.

12 changes: 8 additions & 4 deletions docs/benchmark.rst
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ The similarity metric is calculated using the following steps (see `calculate_si
3. Whitespace and Punctuation Normalization
Extra whitespace and punctuation are removed from both the parsed and ground truth texts. Therefore, the comparison is purely based on the sequence of characters/words, ignoring any formatting differences.

3. Sequence Matching
4. Sequence Matching
Python's ``SequenceMatcher`` compares the extracted text sequences, calculating a similarity ratio between 0 and 1 that reflects content preservation and accuracy.

Running the Benchmarks
Expand Down Expand Up @@ -60,11 +60,15 @@ Customizing Benchmarks

You can modify the ``test_attributes`` list in the ``main()`` function to test different configurations:

* ``parser_type``: Switch between LLM and static parsing
* ``parser_type``: Switch between LLM and static parsing (``LLM_PARSE``, ``STATIC_PARSE``, ``AUTO``)
* ``model``: Test different LLM models
* ``framework``: Test different static parsing frameworks
* ``framework``: Test different static parsing frameworks (``pdfplumber``, ``pdfminer``, ``paddleocr``)
* ``pages_per_split``: Adjust document chunking
* ``max_threads``: Control parallel processing

.. note::

The benchmark harness currently hard-codes ``max_processes=1`` when calling :py:func:`lexoid.api.parse`, so configurations under the ``max_threads`` sweep knob in ``benchmark.py`` do not actually change
``parse()``'s parallelism. To benchmark parallelism, edit ``tests/benchmark.py`` to forward the sweep value to ``max_processes``.

Benchmark Results
-----------------
Expand Down
99 changes: 99 additions & 0 deletions docs/cli.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
Command-Line Interface
======================

Lexoid ships with a ``lexoid`` command (installed as a console script) for
parsing documents without writing Python code. You can also invoke it via
the module form ``python -m lexoid``.

.. code-block:: bash

lexoid --help
python -m lexoid --help

Commands
--------

The CLI exposes three sub-commands:

* ``lexoid parse`` — Convert a document into markdown (or JSON with metadata).
* ``lexoid schema`` — Extract structured data conforming to a JSON schema.
* ``lexoid latex`` — Convert a document into LaTeX.

Common options
^^^^^^^^^^^^^^

Available across all sub-commands:

* ``--input, -i`` (required): Path to an input file (PDF, image, HTML, DOCX, XLSX, PPTX, CSV, TXT, audio) or a URL (``http://``, ``https://``).
* ``--output, -o``: Path to an output file. If omitted, output goes to stdout (clean — status messages are written to stderr so output can be piped).
* ``--verbose, -v``: Enable detailed logging.

``lexoid parse``
^^^^^^^^^^^^^^^^

.. code-block:: bash

lexoid parse --input document.pdf
lexoid parse --input document.pdf --output output.md
lexoid parse --input document.pdf --format json --output result.json
lexoid parse --input document.pdf --parser-type STATIC_PARSE
lexoid parse --input document.pdf --model gpt-4o

Options:

* ``--parser-type, -p``: ``AUTO`` (default), ``LLM_PARSE``, or ``STATIC_PARSE``.
* ``--model, -m``: LLM model name. Default: ``gemini-2.5-flash``.
* ``--pages-per-split``: Pages per chunk. Default: ``4``.
* ``--max-processes``: Parallel processes. Default: ``4``.
* ``--framework``: Static parsing framework — ``pdfplumber`` or ``paddleocr``.
* ``--format``: ``markdown`` (default; raw markdown text) or ``json`` (full result with segments, metadata, and token usage).
* ``--api``: API provider override. One of ``openai``, ``gemini``, ``anthropic``, ``mistral``, ``together``, ``huggingface``, ``openrouter``, ``fireworks``, ``ollama``. If omitted, inferred from the model name.

``lexoid schema``
^^^^^^^^^^^^^^^^^

Extract structured data using a JSON schema. The schema can be passed as a
file path or as an inline JSON string.

.. code-block:: bash

# Inline schema
lexoid schema \
--input document.pdf \
--schema '{"type": "object", "properties": {"title": {"type": "string"}}}' \
--output result.json

# Schema from file
lexoid schema --input document.pdf --schema schema.json --output result.json

# Specify model and API explicitly
lexoid schema --input document.pdf --schema schema.json --api openai --model gpt-4o

Options:

* ``--schema, -s`` (required): JSON schema — file path or inline JSON.
* ``--model, -m``: LLM model. Default: ``gpt-4o-mini``.
* ``--api``: API provider (auto-detected from model name if omitted).
* ``--example-schema``: Example data (JSON string or file path) illustrating a filled schema.
* ``--fill-single-schema``: Produce a single schema instance for the whole document instead of one per page.

``lexoid latex``
^^^^^^^^^^^^^^^^

.. code-block:: bash

lexoid latex --input document.pdf
lexoid latex --input document.pdf --output output.tex
lexoid latex --input document.pdf --model gpt-4o

Options:

* ``--model, -m``: LLM model. Default: ``gpt-4o-mini``.
* ``--api``: API provider (auto-detected from model name if omitted).

API keys
--------

LLM commands require the relevant environment variable to be set
(see :doc:`installation`). The CLI checks for the required key based on
the resolved provider and raises a clear error if it is missing.
31 changes: 21 additions & 10 deletions docs/index.rst
Original file line number Diff line number Diff line change
@@ -1,44 +1,55 @@
Welcome to Lexoid's Documentation
=================================

Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) PDF document parsing.
Lexoid is an efficient document parsing library that supports both LLM-based and non-LLM-based (static) parsing of PDFs, images, web pages, office documents, and audio files.

.. toctree::
:maxdepth: 2
:caption: Contents:

installation
api
cli
contributing
benchmark

Key Features
------------

* Multiple parsing strategies (LLM-based and static parsing)
* Automatic parsing strategy selection
* Support for multiple LLM providers (OpenAI, Google, Meta/Llama, Together AI)
* Automatic parsing strategy selection (``AUTO`` mode) with optional ML-based LLM auto-selection
* Routing priorities: ``speed``, ``accuracy``, and ``cost``
* Support for many LLM providers (OpenAI, Google Gemini, Anthropic, Mistral, Hugging Face, Together AI, OpenRouter, Fireworks)
* Local LLM inference via Ollama, SmolDocling/granite-docling, and PaddleOCR-VL (no API key required)
* Schema-constrained extraction (``parse_with_schema``) accepting ``dict``, ``dataclass``, or Pydantic ``BaseModel``
* LaTeX conversion (``parse_to_latex``)
* Audio transcription to markdown (via Gemini)
* Multi-format input: PDF, images (PNG/JPG/TIFF/BMP/GIF), HTML, DOCX, XLSX, PPTX, CSV, TXT, audio, and URLs
* Recursive URL parsing
* Table detection and markdown conversion
* Hyperlink detection and preservation
* Recursive URL parsing
* Multi-format support
* Parallel processing support
* Permissive license
* Reference highlighting and bounding box extraction
* Reference highlighting and bounding box extraction (``return_bboxes``)
* Parallel processing via multiprocessing
* Command-line interface (``lexoid`` / ``python -m lexoid``)
* Permissive Apache 2.0 license

Supported API Providers
-----------------------

* Google
* Google (Gemini)
* OpenAI
* Anthropic (Claude)
* Mistral (OCR models)
* Hugging Face
* Together AI
* OpenRouter
* Fireworks
* Ollama (local inference)
* Local models (SmolDocling/granite-docling, PaddleOCR-VL)

Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
* :ref:`search`
64 changes: 59 additions & 5 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,27 +8,81 @@ Installing with pip

pip install lexoid

This installs both the Python library and the ``lexoid`` command-line entry
point. See :doc:`cli` for CLI usage.

Environment Setup
-----------------

To use LLM-based parsing, define the following environment variables or create a ``.env`` file with the following definitions:
To use LLM-based parsing, define the environment variables for the providers
you intend to use (in a shell, ``.env`` file, or your container environment):

.. code-block:: bash

GOOGLE_API_KEY=your_google_api_key
OPENAI_API_KEY=your_openai_api_key
GOOGLE_API_KEY=your_google_api_key # Gemini
OPENAI_API_KEY=your_openai_api_key # OpenAI / GPT
ANTHROPIC_API_KEY=your_anthropic_api_key # Claude
MISTRAL_API_KEY=your_mistral_api_key # Mistral OCR
HUGGINGFACEHUB_API_TOKEN=your_huggingface_token
TOGETHER_API_KEY=your_together_api_key
OPENROUTER_API_KEY=your_openrouter_api_key
FIREWORKS_API_KEY=your_fireworks_api_key

Only the providers you actually use require keys. Local backends (Ollama,
SmolDocling/granite-docling, PaddleOCR-VL) do not require an API key.

Additional environment variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

* ``DEFAULT_LLM`` — overrides the default LLM model. Default: ``gemini-2.5-flash``.
* ``DEFAULT_LOCAL_LM`` — overrides the default local model used by ``parse_with_local_model``. Default: ``ds4sd/SmolDocling-256M-preview``.
* ``DEFAULT_STATIC_FRAMEWORK`` — overrides the default static-parsing framework. Default: ``pdfplumber``.
* ``DEFAULT_MAX_IMAGE_DIMENSION`` — maximum pixel dimension for resizing page/image inputs. Default: ``1000``.
* ``OLLAMA_BASE_URL`` — base URL of the Ollama server. Default: ``http://localhost:11434``.
* ``OLLAMA_TIMEOUT`` — request timeout (seconds) for Ollama. Default: ``120``.

Optional Dependencies
---------------------

To use ``Playwright`` for retrieving web content (instead of the ``requests`` library):
Playwright (for web content retrieval)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To use Playwright for retrieving web content (instead of the bare ``requests``
library), install its browser dependencies after ``pip install lexoid``:

.. code-block:: bash

playwright install --with-deps --only-shell chromium

LibreOffice (for DOCX to PDF on Linux)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

On Linux, ``.doc``/``.docx`` to PDF conversion uses LibreOffice's
``lowriter`` binary (because ``docx2pdf`` is unsupported on Linux). Install
it from your distribution's package manager, e.g.:

.. code-block:: bash

sudo apt-get install libreoffice

On macOS/Windows, ``docx2pdf`` is used automatically (requires Microsoft Word
or compatible installation).

Ollama (for local LLM parsing)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Install `Ollama <https://ollama.com>`_, pull a vision-capable model, and
keep the server running:

.. code-block:: bash

ollama pull gemma4
ollama serve

Then call ``parse(..., api_provider="ollama", model="gemma4:latest", max_processes=1)``.
Lexoid forces ``max_processes=1`` for Ollama-backed parsing to avoid local
multiprocess contention.

Building from Source
--------------------

Expand Down Expand Up @@ -57,4 +111,4 @@ To activate virtual environment:

.. code-block:: bash

source .venv/bin/activate
source .venv/bin/activate
Loading