Skip to content

feat: convenience unstructured-get-json.sh update#3971

Merged
cragwolfe merged 37 commits intomainfrom
crag/unstructured-get-json-update
Mar 31, 2025
Merged

feat: convenience unstructured-get-json.sh update#3971
cragwolfe merged 37 commits intomainfrom
crag/unstructured-get-json-update

Conversation

@cragwolfe
Copy link
Copy Markdown
Contributor

@cragwolfe cragwolfe commented Mar 28, 2025

  • script now supports:
    • the --vlm flag, to process the document with the VLM strategy
    • optionally takes --vlm-model, --vlm-provider args
    • optionally also writes .html outputs by converting unstructured .json output
    • optionally opens those .html outputs in a browser

Tested with:

unstructured-get-json.sh --write-html --open-html --fast                                                                layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --hi-res                                                              layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --ocr-only                                                            layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --vlm                                                                 layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider openai    --vlm-model gpt-4o                     layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider vertexai  --vlm-model gemini-2.0-flash-001       layout-parser-paper-p2.pdf
unstructured-get-json.sh --write-html --open-html --vlm --vlm-provider anthropic --vlm-model claude-3-5-sonnet-20241022 layout-parser-paper-p2.pdf

layout-parser-paper-p2.pdf

@cragwolfe cragwolfe changed the title feat: more permissive conversion to html, script updates feat: convenience unstructured-get-json.sh update Mar 29, 2025
@cragwolfe cragwolfe merged commit 19fc1fc into main Mar 31, 2025
43 checks passed
@cragwolfe cragwolfe deleted the crag/unstructured-get-json-update branch March 31, 2025 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants