Skip to content

Commit 8daa154

Browse files
authored
fix: ndjson file type detection (#4349)
This PR fixes a bug where njson detection misclassifies multiline single json files as ndjson. <!-- CURSOR_SUMMARY --> --- > [!NOTE] > **Medium Risk** > Changes JSON/NDJSON detection heuristics and `detect_filetype` routing, which can affect which partitioner is invoked for JSON-like inputs. The logic is more strict and covered by new edge-case tests, but misclassification could still impact downstream parsing behavior. > > **Overview** > Fixes NDJSON file-type detection so **multi-line single JSON objects** (including `.json` and `.ipynb` notebook payloads) are no longer misrouted to `partition_ndjson` and crashing. > > Updates `is_ndjson_processable` to require the *first line* to independently parse as a JSON object (with special handling for potentially truncated long single-line records), adds a bounded read helper (`json_disambiguation_text`) for disambiguation beyond the 4KB `text_head`, and changes JSON/NDJSON disambiguation to default to `FileType.JSON` when NDJSON criteria aren’t met. > > Adds a focused test suite for these NDJSON edge cases, bumps version to `0.22.27`, updates the changelog, and makes CI apt installs more resilient by wrapping them in a retry helper. > > <sup>Reviewed by [Cursor Bugbot](https://cursor.com/bugbot) for commit e82dbd7. Bugbot is set up for automated code reviews on this repo. Configure [here](https://www.cursor.com/dashboard/bugbot).</sup> <!-- /CURSOR_SUMMARY -->
1 parent 199f255 commit 8daa154

5 files changed

Lines changed: 269 additions & 30 deletions

File tree

.github/workflows/ci.yml

Lines changed: 56 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -85,11 +85,24 @@ jobs:
8585
UNS_API_KEY: ${{ secrets.UNS_API_KEY }}
8686
TESSERACT_VERSION: "5.5.1"
8787
run: |
88-
sudo apt-get update
89-
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
90-
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
91-
sudo apt-get update
92-
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
88+
retry() {
89+
local n=1 max=5 delay=15
90+
while true; do
91+
"$@" && return 0
92+
if (( n >= max )); then
93+
echo "Command failed after $n attempts: $*" >&2
94+
return 1
95+
fi
96+
echo "Attempt $n/$max failed for: $*. Retrying in ${delay}s..." >&2
97+
sleep "$delay"
98+
n=$((n+1))
99+
done
100+
}
101+
retry sudo apt-get update
102+
retry sudo apt-get install -y libmagic-dev poppler-utils libreoffice
103+
retry sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
104+
retry sudo apt-get update -o "APT::Update::Error-Mode=any"
105+
retry sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
93106
tesseract --version
94107
installed_tesseract_version=$(tesseract --version | grep -oP '(?<=tesseract )\d+\.\d+\.\d+')
95108
if [ "$installed_tesseract_version" != "${{env.TESSERACT_VERSION}}" ]; then
@@ -161,11 +174,24 @@ jobs:
161174
uv sync --locked ${{ matrix.uv-extras }} --group test
162175
- name: Install system dependencies
163176
run: |
164-
sudo apt-get update
165-
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
166-
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
167-
sudo apt-get update
168-
sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
177+
retry() {
178+
local n=1 max=5 delay=15
179+
while true; do
180+
"$@" && return 0
181+
if (( n >= max )); then
182+
echo "Command failed after $n attempts: $*" >&2
183+
return 1
184+
fi
185+
echo "Attempt $n/$max failed for: $*. Retrying in ${delay}s..." >&2
186+
sleep "$delay"
187+
n=$((n+1))
188+
done
189+
}
190+
retry sudo apt-get update
191+
retry sudo apt-get install -y libmagic-dev poppler-utils libreoffice
192+
retry sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
193+
retry sudo apt-get update -o "APT::Update::Error-Mode=any"
194+
retry sudo apt-get install -y tesseract-ocr tesseract-ocr-kor
169195
tesseract --version
170196
- name: Test
171197
env:
@@ -237,13 +263,26 @@ jobs:
237263
OCR_AGENT: "unstructured.partition.utils.ocr_models.tesseract_ocr.OCRAgentTesseract"
238264
CI: "true"
239265
run: |
240-
sudo apt-get update
241-
sudo apt-get install -y libmagic-dev poppler-utils libreoffice
242-
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
243-
sudo apt-get update
244-
sudo apt-get install -y tesseract-ocr
245-
sudo apt-get install -y tesseract-ocr-kor
246-
sudo apt-get install diffstat
266+
retry() {
267+
local n=1 max=5 delay=15
268+
while true; do
269+
"$@" && return 0
270+
if (( n >= max )); then
271+
echo "Command failed after $n attempts: $*" >&2
272+
return 1
273+
fi
274+
echo "Attempt $n/$max failed for: $*. Retrying in ${delay}s..." >&2
275+
sleep "$delay"
276+
n=$((n+1))
277+
done
278+
}
279+
retry sudo apt-get update
280+
retry sudo apt-get install -y libmagic-dev poppler-utils libreoffice
281+
retry sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
282+
retry sudo apt-get update -o "APT::Update::Error-Mode=any"
283+
retry sudo apt-get install -y tesseract-ocr
284+
retry sudo apt-get install -y tesseract-ocr-kor
285+
retry sudo apt-get install -y diffstat
247286
tesseract --version
248287
uv run --no-sync ./test_unstructured_ingest/test-ingest-src.sh
249288

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,9 @@
1+
## 0.22.27
2+
3+
### Fixes
4+
5+
- **Stop misclassifying multi-line JSON files as NDJSON**: `is_ndjson_processable` previously returned `True` for any text starting with `{`, so `.json` and `.ipynb` files containing a single multi-line JSON object (e.g. Jupyter notebooks) were routed to `partition_ndjson`, which then crashed in its `splitlines()`-based parser.
6+
17
## 0.22.26
28

39
### Enhancements

test_unstructured/file_utils/test_filetype.py

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
_ZipFileDetector,
2727
detect_filetype,
2828
is_json_processable,
29+
is_ndjson_processable,
2930
)
3031
from unstructured.file_utils.model import FileType, create_file_type
3132

@@ -538,6 +539,96 @@ def and_it_affirms_JSON_is_NOT_an_array_of_objects_from_text():
538539
assert is_json_processable(file_text=text) is False
539540

540541

542+
# ================================================================================================
543+
# Describe `is_ndjson_processable()`
544+
# ================================================================================================
545+
546+
547+
def it_recognizes_real_ndjson_with_multiple_object_lines():
548+
assert is_ndjson_processable(file_text='{"a": 1}\n{"b": 2}\n') is True
549+
550+
551+
def it_recognizes_single_line_ndjson_with_trailing_newline():
552+
assert is_ndjson_processable(file_text='{"a": 1}\n{"b": 2}') is True
553+
554+
555+
def it_rejects_a_multiline_single_json_object():
556+
# The bug: was True; now must be False so partition_ndjson does not get this payload.
557+
text = '{\n "id": "Sample-1",\n "name": "Sample 1"\n}'
558+
assert is_ndjson_processable(file_text=text) is False
559+
560+
561+
def it_accepts_a_single_line_json_object_as_one_record_ndjson():
562+
"""A single-line JSON object is a valid 1-record NDJSON payload.
563+
564+
`partition_ndjson` parses it via `splitlines()` and yields one record. Existing callers
565+
rely on this; only multi-line single objects are pathological.
566+
"""
567+
assert is_ndjson_processable(file_text='{"a": 1}') is True
568+
569+
570+
def it_rejects_a_json_array_of_objects():
571+
assert is_ndjson_processable(file_text='[{"a": 1}, {"b": 2}]') is False
572+
573+
574+
def it_rejects_whitespace_only():
575+
assert is_ndjson_processable(file_text=" \n ") is False
576+
577+
578+
def it_rejects_garbage_text():
579+
assert is_ndjson_processable(file_text="not json at all") is False
580+
581+
582+
def it_rejects_a_jupyter_notebook_payload():
583+
"""Jupyter notebooks are a single multi-line JSON object — must not route to NDJSON."""
584+
notebook_text = (
585+
"{\n"
586+
' "cells": [],\n'
587+
' "metadata": {"kernelspec": {"name": "python3"}},\n'
588+
' "nbformat": 4,\n'
589+
' "nbformat_minor": 5\n'
590+
"}\n"
591+
)
592+
assert is_ndjson_processable(file_text=notebook_text) is False
593+
594+
595+
def it_rejects_ndjson_first_line_is_a_bare_value_not_an_object():
596+
# NDJSON of bare values is uncommon and partition_ndjson expects dicts. Be strict.
597+
assert is_ndjson_processable(file_text="1\n2\n3\n") is False
598+
599+
600+
def it_routes_not_unstructured_payload_json_away_from_ndjson_via_detect_filetype():
601+
file_type = detect_filetype(example_doc_path("not-unstructured-payload.json"))
602+
# A multi-line single-object JSON file used to get classified as NDJSON. It should now end up
603+
# as JSON (and partition_json will reject it with the existing schema-mismatch error).
604+
assert file_type == FileType.JSON
605+
606+
607+
def it_classifies_ndjson_correctly_when_first_record_exceeds_text_head_prefix():
608+
"""NDJSON whose first record is longer than the 4096-char text_head prefix.
609+
610+
`_disambiguate_json_file_type` reads past `text_head` to find the first newline, so the
611+
heuristic must not rely on the first record fitting in the prefix. Both single-record and
612+
multi-record cases are exercised — both must round-trip as `FileType.NDJSON`.
613+
"""
614+
big_value = "x" * 5000
615+
payload_one_record = json.dumps({"text": big_value, "type": "NarrativeText"}).encode()
616+
payload_many_records = (
617+
payload_one_record + b"\n" + json.dumps({"text": "tiny", "type": "Title"}).encode()
618+
)
619+
620+
assert detect_filetype(file=io.BytesIO(payload_one_record)) == FileType.NDJSON
621+
assert detect_filetype(file=io.BytesIO(payload_many_records)) == FileType.NDJSON
622+
assert is_ndjson_processable(file=io.BytesIO(payload_one_record)) is True
623+
624+
625+
def it_classifies_multiline_json_as_json_when_first_newline_exceeds_text_head_prefix():
626+
big_value = "x" * 5000
627+
payload = ('{"text": "' + big_value + '",\n "type": "NarrativeText"\n}').encode()
628+
629+
assert detect_filetype(file=io.BytesIO(payload)) == FileType.JSON
630+
631+
541632
# ================================================================================================
542633
# MODULE-LEVEL FIXTURES
543634
# ================================================================================================

unstructured/__version__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.22.26" # pragma: no cover
1+
__version__ = "0.22.27" # pragma: no cover

unstructured/file_utils/filetype.py

Lines changed: 115 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
import tempfile
3939
import zipfile
4040
from functools import cached_property
41-
from typing import IO, Callable, Iterator, Optional
41+
from typing import IO, Callable, Iterator, Optional, cast
4242

4343
import filetype as ft
4444
from olefile import OleFileIO
@@ -54,6 +54,9 @@
5454
from unstructured.partition.common.metadata import set_element_hierarchy
5555
from unstructured.utils import get_call_args_applying_defaults
5656

57+
_JSON_DISAMBIGUATION_CHUNK_SIZE = 8192
58+
_JSON_DISAMBIGUATION_MAX_CHARS = 1024 * 1024
59+
5760
try:
5861
importlib.import_module("magic")
5962
LIBMAGIC_AVAILABLE = True
@@ -136,19 +139,50 @@ def is_ndjson_processable(
136139
file: Optional[IO[bytes]] = None,
137140
file_text: Optional[str] = None,
138141
encoding: Optional[str] = "utf-8",
142+
allow_truncated_single_line: bool = False,
139143
) -> bool:
140-
"""True when file looks like a JSON array of objects.
144+
"""True when file looks like newline-delimited JSON objects.
141145
142-
Uses regex on a file prefix, so not entirely reliable but good enough if you already know the
143-
file is JSON.
146+
NDJSON is a sequence of one JSON value per line, conventionally an object on each line. A
147+
payload that parses as a single JSON value (e.g. a multi-line `{...}` object or a `[...]`
148+
array) is *not* NDJSON and must not be matched here, otherwise `partition_ndjson` will fail
149+
later when it splits the text by lines and tries to parse each fragment.
144150
"""
145151
exactly_one(filename=filename, file=file, file_text=file_text)
146152

153+
allow_truncated = allow_truncated_single_line
147154
if file_text is None:
148-
file_text = _FileTypeDetectionContext.new(
155+
file_text, allow_truncated = _FileTypeDetectionContext.new(
149156
file_path=filename, file=file, encoding=encoding
150-
).text_head
151-
return file_text.lstrip().startswith("{")
157+
).json_disambiguation_text
158+
159+
text = file_text.lstrip()
160+
if not text or not text.startswith("{"):
161+
return False
162+
163+
newline_idx = text.find("\n")
164+
165+
if newline_idx == -1:
166+
# Single-line input. A complete `{...}` parses as a dict and is treated as 1-record
167+
# NDJSON (existing tests and `partition_ndjson` rely on this). When the caller knows this
168+
# is a truncated first line from a JSON-like payload, a parse failure is still compatible
169+
# with a long 1-record NDJSON payload.
170+
try:
171+
return isinstance(json.loads(text), dict)
172+
except json.JSONDecodeError:
173+
return allow_truncated
174+
175+
# Multi-line input. NDJSON requires each record to be on its own line, so the first line
176+
# must independently parse as a JSON object. A pretty-printed single JSON object has its
177+
# first line be just `{` (or similar fragment) which won't parse alone — that's how we
178+
# distinguish it from real NDJSON.
179+
first_line = text[:newline_idx].rstrip()
180+
if not first_line:
181+
return False
182+
try:
183+
return isinstance(json.loads(first_line), dict)
184+
except json.JSONDecodeError:
185+
return False
152186

153187

154188
class _FileTypeDetector:
@@ -224,12 +258,21 @@ def _file_type_from_content_type(self) -> FileType | None:
224258

225259
@property
226260
def _disambiguate_json_file_type(self) -> FileType:
227-
"""Disambiguate JSON/NDJSON file-type based on file contents."""
228-
if is_json_processable(file_text=self._ctx.text_head):
229-
return FileType.JSON
230-
if is_ndjson_processable(file_text=self._ctx.text_head):
261+
"""Disambiguate JSON/NDJSON file-type based on file contents.
262+
263+
NDJSON is detected first because it has the strictest signature (multiple JSON values
264+
separated by newlines, with the first line independently parsable). Anything else that
265+
libmagic flagged as JSON is classified as `FileType.JSON`; the JSON partitioner has its
266+
own `is_json_processable` schema check and will reject non-conforming payloads with a
267+
clear error.
268+
"""
269+
file_text, allow_truncated_single_line = self._ctx.json_disambiguation_text
270+
if is_ndjson_processable(
271+
file_text=file_text,
272+
allow_truncated_single_line=allow_truncated_single_line,
273+
):
231274
return FileType.NDJSON
232-
raise ValueError("Unable to process JSON file")
275+
return FileType.JSON
233276

234277
@property
235278
def _file_type_from_guessed_mime_type(self) -> FileType | None:
@@ -553,13 +596,73 @@ def text_head(self) -> str:
553596
with open(file_path, encoding=encoding) as f:
554597
return f.read(4096)
555598

599+
@cached_property
600+
def json_disambiguation_text(self) -> tuple[str, bool]:
601+
"""Text prefix for JSON/NDJSON disambiguation and whether the first line was truncated."""
602+
603+
if file := self._file_arg:
604+
file.seek(0)
605+
content, first_line_truncated = self._read_until_newline_or_limit(file)
606+
file.seek(0)
607+
if isinstance(content, str):
608+
return content, first_line_truncated
609+
return content.decode(encoding=self.encoding, errors="ignore"), first_line_truncated
610+
611+
file_path = self.file_path
612+
assert file_path is not None # -- guaranteed by `._validate` --
613+
614+
try:
615+
with open(file_path, encoding=self.encoding) as f:
616+
content, first_line_truncated = self._read_until_newline_or_limit(f)
617+
assert isinstance(content, str)
618+
return content, first_line_truncated
619+
except UnicodeDecodeError:
620+
encoding, _ = detect_file_encoding(filename=file_path)
621+
with open(file_path, encoding=encoding) as f:
622+
content, first_line_truncated = self._read_until_newline_or_limit(f)
623+
assert isinstance(content, str)
624+
return content, first_line_truncated
625+
556626
def _validate(self) -> None:
557627
"""Raise if the context is invalid."""
558628
if self.file_path and not os.path.isfile(self.file_path):
559629
raise FileNotFoundError(f"no such file {self._file_path_arg}")
560630
if not self.file_path and not self._file_arg:
561631
raise ValueError("either `file_path` or `file` argument must be provided")
562632

633+
@staticmethod
634+
def _read_until_newline_or_limit(file: IO) -> tuple[str | bytes, bool]:
635+
"""Read through the first newline, stopping at a bounded prefix if none is found."""
636+
chunks: list[str | bytes] = []
637+
chars_read = 0
638+
639+
while chars_read < _JSON_DISAMBIGUATION_MAX_CHARS:
640+
chars_to_read = min(
641+
_JSON_DISAMBIGUATION_CHUNK_SIZE,
642+
_JSON_DISAMBIGUATION_MAX_CHARS - chars_read,
643+
)
644+
chunk = file.read(chars_to_read)
645+
if not chunk:
646+
return _FileTypeDetectionContext._join_text_chunks(chunks), False
647+
648+
newline = b"\n" if isinstance(chunk, bytes) else "\n"
649+
newline_idx = chunk.find(newline)
650+
if newline_idx != -1:
651+
chunks.append(chunk[: newline_idx + 1])
652+
return _FileTypeDetectionContext._join_text_chunks(chunks), False
653+
654+
chunks.append(chunk)
655+
chars_read += len(chunk)
656+
657+
return _FileTypeDetectionContext._join_text_chunks(chunks), True
658+
659+
@staticmethod
660+
def _join_text_chunks(chunks: list[str | bytes]) -> str | bytes:
661+
"""Join chunks without mixing text and bytes types."""
662+
if chunks and isinstance(chunks[0], bytes):
663+
return b"".join(cast(list[bytes], chunks))
664+
return "".join(cast(list[str], chunks))
665+
563666

564667
class _OleFileDetector:
565668
"""Detect and differentiate a CFB file, aka. "OLE" file.

0 commit comments

Comments
 (0)