- Bump starlette to latest to remediate CVE-2025-54121 (MEDIUM) and CVE-2025-62727 (HIGH). Removes the
starlette==0.41.2constraint pin. - Bump python-multipart to latest to remediate CVE-2026-40347 (MEDIUM).
- Purge uv wheel cache after opencv swap: The 0.1.4 Dockerfile uninstalled the PyPI
opencv-pythonwheel and installed the ffmpeg-free replacement, but the original wheel's extracted contents (includinglibavcodec.so.59.*and friends) remained in~/.cache/uv/archive-v0/…/opencv_python.libs/. Image scanners still flagged the 14 ffmpeg CVEs because they walk the whole filesystem. Addeduv cache cleanat the end of the opencv replacementRUNso the vulnerable libs are evicted from the final image layer.
- Replace PyPI opencv wheels with ffmpeg-free builds in Docker image: After
uv sync, the Dockerfile now substitutes the installed PyPI opencv-python variant with a source-builtopencv-contrib-python-headlesswheel compiled withWITH_FFMPEG=OFF, eliminating 14 bundled ffmpeg CVEs. The contrib-headless variant is a strict superset of the cv2 API (core + contrib modules, no GUI) and can transparently replaceopencv-python,opencv-python-headless, oropencv-contrib-python. Wheel is downloaded from the upstreamUnstructured-IO/unstructuredrelease and hash-verified. Mirrors unstructured#4336.
- security: fix(deps): upgrade vulnerable transitive dependencies [security]
- Bump all packages (refresh uv.lock), pulling
unstructured==0.22.12which replaces NLTK with spaCy - Replace
download_nltk_packagescalls with spaCy model pre-download in Makefile, Dockerfile, and CI - Switch
uv sync --frozentouv sync --lockedacross Dockerfile, Makefile, and CI workflows
- Switch arm64 Docker build runner from custom
opensource-linux-arm64-4coreto GitHub-hostedubuntu-24.04-arm - Consolidate multiarch Docker manifest creation into a single
docker buildx imagetools createcall - Skip inference tests in CD Docker smoke tests for both architectures (already covered by CI)
- Migrate to native uv for package management, replacing pip and pip-compile
- Replace black and flake8 with ruff for linting and formatting
- Remove all version pins from dependencies, use uv.lock for reproducibility
- Update Dockerfile, CI workflows, and Makefile to use uv throughout
- Fix flaky Korean OCR test assertions for tesseract compatibility
- Use
.python-versionfile as single source of truth for Python version across all CI workflows - Re-enable arm64 Docker image builds with a dedicated ARM runner, restoring multiarch support for both amd64 and arm64
- Switch all CI workflows to faster self-hosted runners (
opensource-linux-8core) - Split lint tools into a lightweight dependency group so the CI lint step no longer installs heavy runtime dependencies
- Add explicit dependencies for
backoff,pandas,psutil,pypdf, andrequests(previously only transitive viaunstructured[all-docs]) - Pre-download NLTK models before parallel test runs to prevent race conditions
- Pin uv version in Dockerfile for reproducible builds
- Remove
py3.12-pipfrom Dockerfile (unused since uv migration) - Drop mypy from CI (ruff covers linting sufficiently)
- Add retry logic to parallel-mode curl tests for transient connection failures
- Switch Dependabot from
piptouvecosystem - Remove unused
ARCHvariable from Makefile
- Refactored the Dockerfile to use the chainguard/wolfi-base image instead of the unstructured/base-image. This is to align with the recent change in the unstructured repo where the same change was made.
- upgraded dependancies to address CVEs
- Upgrade pdfminer-six to 20260107 to fix ~15-18% performance regression from eager f-string evaluation
- Upgrade packages to resolve CVEs
- Upgrade version to pull in latest unstructured verison and bump versions of dependancies.
- Upgrade Pillow to 11.3.0 to address a CVE
- Return 422 HTTP code when PDF can't be processed
- Patch various CVEs
- Enable pytest concurrency
- Enable Claude Code
- Use Python 3.12 for testing
- Define version in one place
- Patch various CVEs
- Bump Python version to 3.12, some packages no longer support 3.9
- Patch h11 CVE
- bump httpcore version due to h11 dependency
- Patch various CVEs
- Fix Starlette vulnerability
- Patch various python CVEs
- Bump to
unstructured0.16.11 - No longer attempts to download NLTK asset from S3 which could result in a 403
- Update
strategyparameter to allow'and"as input surrounding the value.
- Bump to
unstructured0.15.10 - Add
include_slide_notesparameter, indicating whether slide notes inpptandpptxfiles should be partitioned. Default isTrue. Now, when slide notes are present in the file, they will be included alongside other elements, which may shift the index numbers of non-note elements.
- Bump to
unstructured0.15.7
- Resolve NLTK CVE.
- Bump to
unstructured0.15.6
- Bump to
unstructured0.15.5
- Use the library's
detect_filetypein API to determine mimetype - Add content_type api parameter
- Bump to
unstructured0.15.1
- Remove constraint on
safetensorsthat preventing us from bumpingtransformers.
- Bump to
unstructured0.15.0
- Bump to
unstructured0.14.10
- Fix certain filetypes failing mimetype lookup in the new base image
- replace rockylinux with chainguard/wolfi as a base image for
amd64
- Bump to
unstructured0.14.6 - Bump to
unstructured-inference0.7.35
- Bump to
unstructured0.14.4 - Add handling for
pdf_infer_table_structureto reflect the "tables off by default" behavior inunstructured.
- Fix list params such as
extract_image_block_typesnot working via the python/js clients
- Allow for a different server port with the PORT variable
- Change pdf_infer_table_structure parameter from being disabled in auto strategy.
- Add support for
unique_element_idsparameter. - Add max lifetime, via MAX_LIFETIME_SECONDS env-var, to API containers
- Bump unstructured to 0.13.5
- Change default values for
pdf_infer_table_structureandskip_infer_table_types. Markpdf_infer_table_structuredeprecated. - Add support for the
starting_page_numberparam.
- Bump unstructured to 0.12.4
- Add support for both
list[str]andstrinput formats forocr_languagesparameter - Adds support for additional MIME types from
unstructured - Document the support for gzip files and add additional testing
- Bump Pydantic to 2.5.x and remove it from explicit dependencies list (will be managed by fastapi)
- Introduce Form params description in the code, which will form openapi and swagger documentation
- Roll back some openapi customizations
- Keep backward compatibility for passing parameters in form of
list[str](will not be shown in the documentation)
- Bump unstructured to 0.12.2
- Fix bug that ignored
combine_under_n_charschunking option argument.
- Add hi_res_model_name to partition and deprecate model_name
- Bump unstructured to 0.12.0
- Add support for returning extracted image blocks as base64 encoded data stored in metadata fields
- Bump unstructured to 0.11.6
- Handle invalid hi_res_model_name kwarg
- Enable self-hosted authorization using UNSTRUCTURED_API_KEY env variable
- Bump unstructured to 0.11.0
- Bump unstructured to 0.10.30
- Make sure
multipage_sectionsparam defaults totrueas per the readme - Bump unstructured to 0.10.29
- Add
max_charactersparam for chunking This param gives users additional control to "chunk" elements into larger or smallerCompositeElements - Bump unstructured to 0.10.28
- Make sure chipperv2 is called when
hi_res_model_name==chipper
- Bump unstructured to 0.10.26
- Bring parent_id metadata field back after fixing a backwards compatibility bug
- Restrict Chipper usage to one at a time. The model is very resource intense, and this will prevent issues while we improve it.
- Bump unstructured to 0.10.25
- Use a generator when splitting pdfs in parallel mode
- Add a default memory minimum for 503 check
- Fix an UnboundLocalError when an invalid docx file is caught
- Bump unstructured to 0.10.23
- Simplify the error message for BadZipFile errors
- Bump unstructured to 0.10.21
- Fix an unhandled error when a non pdf file is sent with content-type pdf
- Fix an unhandled error when a non docx file is sent with content-type docx
- Fix an unhandled error when a non-Unstructured json schema is sent
- Bump unstructured to 0.10.19
- Bump unstructured to 0.10.18
- Remove spurious whitespace in
app-start.sh. This fixes deployments in some envs such as Google Cloud Run.
- Adds
languageskwargocr_languageswill eventually be deprecated and replaced bylanguagesto specify what languages to use for OCR - Adds a startup log and other minor cleanups
- Adds
chunking_strategykwarg and associated params These params allow users to "chunk" elements into larger or smallerCompositeElements - Remove
parent_idfrom the element metadata. New metadata fields are causing errors with existing installs. We'll readd this once a fix is widely available. - Fix some pdfs incorrectly returning a file is encrypted error. The
pypdf.is_encryptedcheck caused us to return this error even if the file is readable.
- Bump unstructured to 0.10.16
- Drop
detection_class_probfrom the element metadata. This broke backwards compatibility when library users calledpartition_via_api. - Bump unstructured to 0.10.15
- Bump unstructured to 0.10.14
- Improve parallel mode retry handling
- Improve logging during error handling. We don't need to log stack traces for expected errors.
- Bump unstructured to 0.10.13
- Bump unstructured-inference to 0.5.25
- Remove dependency on unstructured-api-tools
- Add a top level error handler for more consistent response bodies
- Tesseract minor version bump to 5.3.2
- Update readme for parameter
hi_res_model_name - Fix a bug using
hi_res_model_namein parallel mode - Bump unstructured library to 0.10.12
- Bump unstructured-inference to 0.5.22
- Bump unstructured library to 0.10.8
- Bump unstructured-inference to 0.5.17
- Reject traffic when overloaded via
UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB - Docker image built with Python 3.10 rather than 3.8
- Fix incorrect handling on param skip_infer_table_types
- Pin
safetensorsto fix a build error with 0.0.38
- Fix page break has None page number bug
- Bump unstructured to 0.10.5
- Bump unstructured-ingest to 0.5.15
- Fix UnboundLocalError using pdfs in parallel mode
- Bump unstructured to 0.10.4
- Fix a bug in parallel mode causing
not a valid pdferrors - Bump unstructured to 0.10.2, unstructured-inference to 0.5.13
- Bump unstructured library to 0.9.2
- Fix a misleading error in make docker-test
- Bump unstructured library to 0.9.0
- Add table support for image with parameter
skip_infer_table_types - Add support for gzipped files
- Image tweak, move application entrypoint to scripts/app-start.sh
- Throw 400 error if a PDF is password protected
- Improve logging of params to single line json
- Add support for
include_page_breaksparameter
- Support model name as api parameter
- Add retry parameters on fanout requests
- Bump unstructured library to 0.8.1
- Fix how to remove an element's coordinate information
- Add table extraction support for hi_res strategy
- Add support for
encodingparameter - Add support for
xml_keep_tagsparameter - Add env variables for additional parallel mode tweaking
- Support .msg files
- Refactor parallel mode and add smoke test
- Fix header value for api key
- Bump unstructured library to 0.7.8 for bug fixes
- Update documentation and tests for filetypes to sync with partition.auto
- Add support for .rst, .tsv, .xml
- Move PYPDF2 to pypdf since PYPDF2 is deprecated
- Add support for
ocr_onlystrategy andocr_languagesparameter - Remove building
detectron2from source in Dockerfile - Convert strategy from fast to auto for images since there is no fast strategy for images
- Bump image to use python 3.8.17 instead of 3.8.15
- Add returning text/csv to pipeline_api
- Add support for csv files
- Add parallel processing mode for pages within a pdf
- Bump version of base image to use new stable version of tesseract
- Bump to unstructured==0.7.1 for various bug fixes.
- Supports additional filetypes: epub, odt, rft
- Updating data type of optional os env var
ALLOWED_ORIGINS
- Add optional CORS to api if os env var
ALLOWED_ORIGINSis set
- Add config for unstructured.trace logger
- Fix image build steps to support detectron2 install from Mac M1/M2
- Upgrade to openssl 1.1.1 to accomodate the latest urllib3
- Bump unstructured for SpooledTemporaryFile fix
- Add msg and json types to supported
- Bump unstructured to the latest version
- Posting a bad .pdf results in a 400
- Remove coordinates field from response elements by default
- Add caching from the registry for
make docker-build - Add fix for empty content type error
- Bump unstructured-api-tools for better 'file type not supported' response messages
- Updated detectron version
- Update docker-build to use the public registry as a cache
- Adds a strategy parameter to pipeline_api
- Passing file, file_filename, and content_type to
partition
- Sensible logging config
- Minor version bump
- Minor version bump
- Updated Dockerfile for public release
- Remove rate limiting in the API
- Add file type validation via UNSTRUCTURED_ALLOWED_MIMETYPES
- Major semver route also supported: /general/v0/general
- Changed pipeline name to
pipeline-general - Changed pipeline to handle a variety of documents not just emails
- Update Dockerfile, all supported library files.
- Add sample-docs for pdf and pdf image.
- Add emails pipeline Dockerfile
- Add pipeline notebook
- Initial pipeline setup