Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 4 additions & 3 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,16 @@
## 0.17.7
## 0.17.8

### Enhancements
- **Updated Docker file with ENV HF_HUB_OFFLINE=1 to prevent the contianer from trying to access the internet
- **Bump `unstructured-inference` to `1.0.5`** It includes critical fix to ensure inference model initialization is thread safe

### Features

### Fixes

## 0.17.7-dev0
## 0.17.7

### Enhancements
- **Updated Docker file with ENV HF_HUB_OFFLINE=1 to prevent the contianer from trying to access the internet

### Features

Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-csv.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ numpy==2.0.2
# via
# -c ./base.txt
# pandas
pandas==2.2.3
pandas==2.3.0
# via -r ./extra-csv.in
python-dateutil==2.9.0.post0
# via
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-pdf-image.in
Original file line number Diff line number Diff line change
Expand Up @@ -12,5 +12,5 @@ google-cloud-vision
effdet
# Do not move to constraints.in, otherwise unstructured-inference will not be upgraded
# when unstructured library is.
unstructured-inference>=0.8.10
unstructured-inference>=1.0.5
unstructured.pytesseract>=0.3.12
16 changes: 8 additions & 8 deletions requirements/extra-pdf-image.txt
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ fsspec==2025.5.1
# torch
google-api-core[grpc]==2.25.0
# via google-cloud-vision
google-auth==2.40.2
google-auth==2.40.3
# via
# google-api-core
# google-cloud-vision
Expand All @@ -69,9 +69,9 @@ grpcio==1.72.1
# grpcio-status
grpcio-status==1.72.1
# via google-api-core
hf-xet==1.1.2
hf-xet==1.1.3
# via huggingface-hub
huggingface-hub==0.32.3
huggingface-hub==0.32.4
# via
# accelerate
# timm
Expand Down Expand Up @@ -139,7 +139,7 @@ packaging==25.0
# pikepdf
# transformers
# unstructured-pytesseract
pandas==2.2.3
pandas==2.3.0
# via unstructured-inference
pdf2image==1.17.0
# via -r ./extra-pdf-image.in
Expand Down Expand Up @@ -184,7 +184,7 @@ pyasn1==0.6.1
# rsa
pyasn1-modules==0.4.2
# via google-auth
pycocotools==2.0.9
pycocotools==2.0.10
# via effdet
pycparser==2.22
# via
Expand Down Expand Up @@ -253,14 +253,14 @@ tokenizers==0.21.1
# via
# -c ././deps/constraints.txt
# transformers
torch==2.7.0
torch==2.7.1
# via
# accelerate
# effdet
# timm
# torchvision
# unstructured-inference
torchvision==0.22.0
torchvision==0.22.1
# via
# effdet
# timm
Expand All @@ -280,7 +280,7 @@ typing-extensions==4.14.0
# torch
tzdata==2025.2
# via pandas
unstructured-inference==1.0.2
unstructured-inference==1.0.5
# via -r ./extra-pdf-image.in
unstructured-pytesseract==0.3.15
# via -r ./extra-pdf-image.in
Expand Down
2 changes: 1 addition & 1 deletion requirements/extra-xlsx.txt
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ numpy==2.0.2
# pandas
openpyxl==3.1.5
# via -r ./extra-xlsx.in
pandas==2.2.3
pandas==2.3.0
# via -r ./extra-xlsx.in
python-dateutil==2.9.0.post0
# via
Expand Down
6 changes: 3 additions & 3 deletions requirements/huggingface.txt
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,9 @@ fsspec==2025.5.1
# via
# huggingface-hub
# torch
hf-xet==1.1.2
hf-xet==1.1.3
# via huggingface-hub
huggingface-hub==0.32.3
huggingface-hub==0.32.4
# via
# tokenizers
# transformers
Expand Down Expand Up @@ -90,7 +90,7 @@ tokenizers==0.21.1
# via
# -c ././deps/constraints.txt
# transformers
torch==2.7.0
torch==2.7.1
# via -r ./huggingface.in
tqdm==4.67.1
# via
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ <h1 class="Title" id="5d45a28d875e403c7294a15f22a0162f">
Large Model
</th>
<th style="border: 1px solid black;">
Notes
| Notes
</th>
</tr>
</thead>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -168,33 +168,21 @@ <h1 class="Title" id="5a1838a8f40b4523094652cf14ab974c">
Dataset
</th>
<th style="border: 1px solid black;">
|
| Base Model'|
</th>
<th style="border: 1px solid black;">
Base Model'|
</th>
<th style="border: 1px solid black;">
Large Model |
</th>
<th style="border: 1px solid black;">
Notes
| Notes
</th>
</tr>
</thead>
<tbody>
<tr style="border: 1px solid black;">
<td style="border: 1px solid black;">
PubLayNet
</td>
<td style="border: 1px solid black;">
B8]|
PubLayNet B8]|
</td>
<td style="border: 1px solid black;">
F/M
</td>
<td style="border: 1px solid black;">
M
</td>
<td style="border: 1px solid black;">
Layouts of modern scientific documents
</td>
Expand All @@ -203,14 +191,9 @@ <h1 class="Title" id="5a1838a8f40b4523094652cf14ab974c">
<td style="border: 1px solid black;">
PRImA
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
M
</td>
<td style="border: 1px solid black;">
-
</td>
<td style="border: 1px solid black;">
Layouts of scanned modern magazines and scientific report
</td>
Expand All @@ -219,14 +202,9 @@ <h1 class="Title" id="5a1838a8f40b4523094652cf14ab974c">
<td style="border: 1px solid black;">
Newspaper
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
F
</td>
<td style="border: 1px solid black;">
-
</td>
<td style="border: 1px solid black;">
Layouts of scanned US newspapers from the 20th century
</td>
Expand All @@ -235,11 +213,6 @@ <h1 class="Title" id="5a1838a8f40b4523094652cf14ab974c">
<td style="border: 1px solid black;">
TableBank
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
F
</td>
<td style="border: 1px solid black;">
F
</td>
Expand All @@ -251,14 +224,9 @@ <h1 class="Title" id="5a1838a8f40b4523094652cf14ab974c">
<td style="border: 1px solid black;">
HJDataset
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
F/M
</td>
<td style="border: 1px solid black;">
-
</td>
<td style="border: 1px solid black;">
Layouts of history Japanese documents
</td>
Expand Down Expand Up @@ -348,7 +316,10 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
<thead>
<tr style="border: 1px solid black;">
<th style="border: 1px solid black;">
block.pad(top, bottom, right,
block.pad(top, bottom,
</th>
<th style="border: 1px solid black;">
right,
</th>
<th style="border: 1px solid black;">
left)
Expand All @@ -365,6 +336,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
Scale the current block given the ratio in x and y direction
</td>
Expand All @@ -375,6 +348,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
Move the current block with the shift distances in x and y direction
</td>
Expand All @@ -385,6 +360,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
Whether block] is inside of block2
</td>
Expand All @@ -395,6 +372,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
Return the intersection region of blockl and block2. Coordinate type to be determined based on the inputs
</td>
Expand All @@ -405,6 +384,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
Return the union region of blockl and block2. Coordinate type to be determined based on the inputs
</td>
Expand All @@ -415,6 +396,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
Convert the absolute coordinates of block to relative coordinates to block2
</td>
Expand All @@ -425,6 +408,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
Calculate the absolute coordinates of blockl given the canvas block2’s absolute coordinates
</td>
Expand All @@ -435,6 +420,8 @@ <h1 class="Title" id="2b81bd7a3f21b84379bfcd4bb175c5d1">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
</td>
<td style="border: 1px solid black;">
Obtain the image segments in the block region
</td>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@
"element_id": "dddac446da6c93dc1449ecb5d997c423",
"text": "Dataset | Base Model\" Large Model | Notes PubLayNet [38] P/M M Layouts of modern scientific documents PRImA [3) M - Layouts of scanned modern magazines and scientific reports Newspaper [17] P - Layouts of scanned US newspapers from the 20th century ‘TableBank (18) P P Table region on modern scientific and business document HJDataset (31) | F/M - Layouts of history Japanese documents",
"metadata": {
"text_as_html": "<table><thead><tr><th>Dataset</th><th>| Base Model!|</th><th>Large Model</th><th>Notes</th></tr></thead><tbody><tr><td>PubLayNet [33]</td><td>P/M</td><td>M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td></td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper [17]</td><td>P</td><td></td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank [18]</td><td>P</td><td></td><td>Table region on modern scientific and business document</td></tr><tr><td>HIDataset [31]</td><td>P/M</td><td></td><td>Layouts of history Japanese documents</td></tr></tbody></table>",
"text_as_html": "<table><thead><tr><th>Dataset</th><th>| Base Model!|</th><th>Large Model</th><th>| Notes</th></tr></thead><tbody><tr><td>PubLayNet [33]</td><td>P/M</td><td>M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA [3]</td><td>M</td><td></td><td>Layouts of scanned modern magazines and scientific reports</td></tr><tr><td>Newspaper [17]</td><td>P</td><td></td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank [18]</td><td>P</td><td></td><td>Table region on modern scientific and business document</td></tr><tr><td>HIDataset [31]</td><td>P/M</td><td></td><td>Layouts of history Japanese documents</td></tr></tbody></table>",
"filetype": "image/jpeg",
"languages": [
"eng"
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1459,7 +1459,7 @@
"start_index": 65
}
],
"text_as_html": "<table><thead><tr><th>Dataset</th><th>|</th><th>Base Model'|</th><th>Large Model |</th><th>Notes</th></tr></thead><tbody><tr><td>PubLayNet</td><td>B8]|</td><td>F/M</td><td>M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA</td><td></td><td>M</td><td>-</td><td>Layouts of scanned modern magazines and scientific report</td></tr><tr><td>Newspaper</td><td></td><td>F</td><td>-</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td></td><td>F</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset</td><td></td><td>F/M</td><td>-</td><td>Layouts of history Japanese documents</td></tr></tbody></table>",
"text_as_html": "<table><thead><tr><th>Dataset</th><th>| Base Model'|</th><th>| Notes</th></tr></thead><tbody><tr><td>PubLayNet B8]|</td><td>F/M</td><td>Layouts of modern scientific documents</td></tr><tr><td>PRImA</td><td>M</td><td>Layouts of scanned modern magazines and scientific report</td></tr><tr><td>Newspaper</td><td>F</td><td>Layouts of scanned US newspapers from the 20th century</td></tr><tr><td>TableBank</td><td>F</td><td>Table region on modern scientific and business document</td></tr><tr><td>HJDataset</td><td>F/M</td><td>Layouts of history Japanese documents</td></tr></tbody></table>",
"filetype": "application/pdf",
"languages": [
"eng"
Expand Down Expand Up @@ -2153,7 +2153,7 @@
"element_id": "64bc79d1132a89c71837f420d6e4e2dc",
"text": "Operation Name Description block.pad(top, bottom, right, left) Enlarge the current block according to the input block.scale(fx, fy) Scale the current block given the ratio in x and y direction block.shift(dx, dy) Move the current block with the shift distances in x and y direction block1.is in(block2) Whether block1 is inside of block2 block1.intersect(block2) Return the intersection region of block1 and block2. Coordinate type to be determined based on the inputs. block1.union(block2) Return the union region of block1 and block2. Coordinate type to be determined based on the inputs. block1.relative to(block2) Convert the absolute coordinates of block1 to relative coordinates to block2 block1.condition on(block2) Calculate the absolute coordinates of block1 given the canvas block2’s absolute coordinates block.crop image(image) Obtain the image segments in the block region",
"metadata": {
"text_as_html": "<table><thead><tr><th>block.pad(top, bottom, right,</th><th>left)</th><th>Enlarge the current block according to the input</th></tr></thead><tbody><tr><td>block.scale(fx, fy)</td><td></td><td>Scale the current block given the ratio in x and y direction</td></tr><tr><td>block.shift(dx, dy)</td><td></td><td>Move the current block with the shift distances in x and y direction</td></tr><tr><td>block1.is_in(block2)</td><td></td><td>Whether block] is inside of block2</td></tr><tr><td>block1. intersect (block2)</td><td></td><td>Return the intersection region of blockl and block2. Coordinate type to be determined based on the inputs</td></tr><tr><td>block1.union(block2)</td><td></td><td>Return the union region of blockl and block2. Coordinate type to be determined based on the inputs</td></tr><tr><td>block1.relative_to(block2)</td><td></td><td>Convert the absolute coordinates of block to relative coordinates to block2</td></tr><tr><td>block1.condition_on(block2)</td><td></td><td>Calculate the absolute coordinates of blockl given the canvas block2’s absolute coordinates</td></tr><tr><td>block. crop_image (image)</td><td></td><td>Obtain the image segments in the block region</td></tr></tbody></table>",
"text_as_html": "<table><thead><tr><th>block.pad(top, bottom,</th><th>right,</th><th>left)</th><th>Enlarge the current block according to the input</th></tr></thead><tbody><tr><td>block.scale(fx, fy)</td><td></td><td></td><td>Scale the current block given the ratio in x and y direction</td></tr><tr><td>block.shift(dx, dy)</td><td></td><td></td><td>Move the current block with the shift distances in x and y direction</td></tr><tr><td>block1.is_in(block2)</td><td></td><td></td><td>Whether block] is inside of block2</td></tr><tr><td>block1. intersect (block2)</td><td></td><td></td><td>Return the intersection region of blockl and block2. Coordinate type to be determined based on the inputs</td></tr><tr><td>block1.union(block2)</td><td></td><td></td><td>Return the union region of blockl and block2. Coordinate type to be determined based on the inputs</td></tr><tr><td>block1.relative_to(block2)</td><td></td><td></td><td>Convert the absolute coordinates of block to relative coordinates to block2</td></tr><tr><td>block1.condition_on(block2)</td><td></td><td></td><td>Calculate the absolute coordinates of blockl given the canvas block2’s absolute coordinates</td></tr><tr><td>block. crop_image (image)</td><td></td><td></td><td>Obtain the image segments in the block region</td></tr></tbody></table>",
"filetype": "application/pdf",
"languages": [
"eng"
Expand Down
2 changes: 1 addition & 1 deletion unstructured/__version__.py
Original file line number Diff line number Diff line change
@@ -1 +1 @@
__version__ = "0.17.7" # pragma: no cover
__version__ = "0.17.8" # pragma: no cover
Loading