Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,13 @@
## 0.17.1-dev1
## 0.17.1

### Enhancements

- **Add image_url of images in html partitioner** `<img>` tags with non-data content include a new image_url metadata field with the content of the src attribute.

- **Use `lxml` instead of `bs4` to parse hOCR data.** `lxml` is much faster than `bs4` given the hOCR data format is regular (garanteed because it is programatically generated)

- **bump `numpy` to `>2`**. And upgrade `paddlepaddle`, `unstructured-paddleocr`, `onnx` so they are compatible with `numpy>2`.

### Features

### Fixes
Expand Down
4 changes: 1 addition & 3 deletions requirements/base.in
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,7 @@ emoji
dataclasses-json
python-iso639
langdetect
# NOTE(robinson) - numpy pin is because ONNX model weights are only compatible
# with numpy 1.x.x
numpy<2
numpy
rapidfuzz
backoff
typing-extensions
Expand Down
8 changes: 4 additions & 4 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./base.in
#
anyio==4.8.0
anyio==4.9.0
# via httpx
backoff==2.2.1
# via -r ./base.in
Expand Down Expand Up @@ -34,7 +34,7 @@ dataclasses-json==0.6.7
# via
# -r ./base.in
# unstructured-client
deepdiff==8.3.0
deepdiff==8.4.2
# via unstructured-client
emoji==2.14.1
# via -r ./base.in
Expand Down Expand Up @@ -76,7 +76,7 @@ nest-asyncio==1.6.0
# via unstructured-client
nltk==3.9.1
# via -r ./base.in
numpy==1.26.4
numpy==2.0.2
# via -r ./base.in
olefile==0.47
# via python-oxmsg
Expand All @@ -90,7 +90,7 @@ psutil==7.0.0
# via -r ./base.in
pycparser==2.22
# via cffi
pypdf==5.3.1
pypdf==5.4.0
# via unstructured-client
python-dateutil==2.9.0.post0
# via unstructured-client
Expand Down
4 changes: 2 additions & 2 deletions requirements/dev.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ click==8.1.8
# pip-tools
distlib==0.3.9
# via virtualenv
filelock==3.17.0
filelock==3.18.0
# via virtualenv
identify==2.6.8
identify==2.6.9
# via pre-commit
importlib-metadata==8.6.1
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-csv.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
#
# pip-compile ./extra-csv.in
#
numpy==1.26.4
numpy==2.0.2
# via
# -c ./base.txt
# pandas
Expand Down
4 changes: 2 additions & 2 deletions requirements/extra-paddleocr.in
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
-c ./deps/constraints.txt
-c base.txt

paddlepaddle==3.0.0b1
unstructured.paddleocr==2.8.1.0
paddlepaddle>=3.0.0b1
unstructured.paddleocr==2.10.0
115 changes: 62 additions & 53 deletions requirements/extra-paddleocr.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,24 @@
#
# pip-compile ./extra-paddleocr.in
#
anyio==4.8.0
albucore==0.0.23
# via
# albumentations
# unstructured-paddleocr
albumentations==2.0.5
# via unstructured-paddleocr
annotated-types==0.7.0
# via pydantic
anyio==4.9.0
# via
# -c ./base.txt
# httpx
astor==0.8.1
# via paddlepaddle
beautifulsoup4==4.13.3
# via
# -c ./base.txt
# unstructured-paddleocr
certifi==2025.1.31
# via
# -c ./base.txt
Expand All @@ -20,20 +32,20 @@ charset-normalizer==3.4.1
# via
# -c ./base.txt
# requests
contourpy==1.3.0
# via matplotlib
cycler==0.12.1
# via matplotlib
cython==3.0.12
# via unstructured-paddleocr
decorator==5.2.1
# via paddlepaddle
eval-type-backport==0.2.2
# via albumentations
exceptiongroup==1.2.2
# via
# -c ./base.txt
# anyio
fire==0.7.0
# via unstructured-paddleocr
fonttools==4.56.0
# via matplotlib
# via unstructured-paddleocr
h11==0.14.0
# via
# -c ./base.txt
Expand All @@ -53,32 +65,26 @@ idna==3.10
# httpx
# requests
imageio==2.37.0
# via
# imgaug
# scikit-image
imgaug==0.4.0
# via unstructured-paddleocr
importlib-resources==6.5.2
# via matplotlib
kiwisolver==1.4.7
# via matplotlib
# via scikit-image
lazy-loader==0.4
# via scikit-image
matplotlib==3.9.4
# via imgaug
lxml==5.3.1
# via
# -c ./base.txt
# python-docx
networkx==3.2.1
# via
# paddlepaddle
# scikit-image
numpy==1.26.4
numpy==2.0.2
# via
# -c ./base.txt
# contourpy
# albucore
# albumentations
# imageio
# imgaug
# matplotlib
# opencv-contrib-python
# opencv-python
# opencv-python-headless
# opt-einsum
# paddlepaddle
# scikit-image
Expand All @@ -89,44 +95,42 @@ numpy==1.26.4
opencv-contrib-python==4.11.0.86
# via unstructured-paddleocr
opencv-python==4.11.0.86
# via unstructured-paddleocr
opencv-python-headless==4.11.0.86
# via
# imgaug
# unstructured-paddleocr
# albucore
# albumentations
opt-einsum==3.3.0
# via paddlepaddle
packaging==24.2
# via
# -c ./base.txt
# lazy-loader
# matplotlib
# scikit-image
paddlepaddle==3.0.0b1
paddlepaddle==3.0.0rc1
# via -r ./extra-paddleocr.in
pdf2image==1.17.0
# via unstructured-paddleocr
pillow==11.1.0
# via
# imageio
# imgaug
# matplotlib
# paddlepaddle
# pdf2image
# scikit-image
# unstructured-paddleocr
protobuf==6.30.0
protobuf==6.30.1
# via
# -c ././deps/constraints.txt
# paddlepaddle
pyclipper==1.3.0.post6
# via unstructured-paddleocr
pyparsing==3.2.1
# via matplotlib
python-dateutil==2.9.0.post0
# via
# -c ./base.txt
# matplotlib
pyyaml==6.0.2
pydantic==2.10.6
# via albumentations
pydantic-core==2.27.2
# via pydantic
python-docx==1.1.2
# via unstructured-paddleocr
pyyaml==6.0.2
# via
# albumentations
# unstructured-paddleocr
rapidfuzz==3.12.2
# via
# -c ./base.txt
Expand All @@ -136,26 +140,27 @@ requests==2.32.3
# -c ./base.txt
# unstructured-paddleocr
scikit-image==0.24.0
# via
# imgaug
# unstructured-paddleocr
# via unstructured-paddleocr
scipy==1.13.1
# via
# imgaug
# albumentations
# scikit-image
shapely==2.0.7
# via
# imgaug
# unstructured-paddleocr
six==1.17.0
# via
# -c ./base.txt
# imgaug
# python-dateutil
# via unstructured-paddleocr
simsimd==6.2.1
# via albucore
sniffio==1.3.1
# via
# -c ./base.txt
# anyio
soupsieve==2.6
# via
# -c ./base.txt
# beautifulsoup4
stringzilla==3.12.3
# via albucore
termcolor==2.5.0
# via fire
tifffile==2024.8.30
# via scikit-image
tqdm==4.67.1
Expand All @@ -165,14 +170,18 @@ tqdm==4.67.1
typing-extensions==4.12.2
# via
# -c ./base.txt
# albucore
# albumentations
# anyio
# beautifulsoup4
# paddlepaddle
unstructured-paddleocr==2.8.1.0
# pydantic
# pydantic-core
# python-docx
unstructured-paddleocr==2.10.0
# via -r ./extra-paddleocr.in
urllib3==1.26.20
# via
# -c ././deps/constraints.txt
# -c ./base.txt
# requests
zipp==3.21.0
# via importlib-resources
5 changes: 3 additions & 2 deletions requirements/extra-pdf-image.in
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
-c ./deps/constraints.txt
-c base.txt

onnx
onnx>=1.17.0
onnxruntime>=1.19.0
pdf2image
pdfminer.six
pikepdf
Expand All @@ -11,5 +12,5 @@ google-cloud-vision
effdet
# Do not move to constraints.in, otherwise unstructured-inference will not be upgraded
# when unstructured library is.
unstructured-inference>=0.8.9
unstructured-inference>=0.8.10
unstructured.pytesseract>=0.3.12
Loading
Loading