You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[Connectors Python] Add max_text_document_size param to configs to keep big documents from being ingested (#4013)
## Closeselastic/search-team#14454
Introduces a hard per-document size cap for the Elasticsearch bulk sink
so that
clusters are not overwhelmed by oversized **text** documents — for
example,
docs whose `body` was inflated by the Data Extraction Service (which
bypasses
`service.max_file_download_size`) or structured-only docs from API
connectors
with very long fields.
The cap is **scoped to non-binary documents**: docs that carry an
`_attachment`
key (the binary path) remain governed by
`service.max_file_download_size`, and
are never affected by this cap. The intent is to give operators one knob
for
the text/structured path without changing how binary attachments flow.
### What changed
- New config option **`elasticsearch.bulk.max_text_document_size`**
(MiB,
default `3`, `0` disables, must be `>= 0`).
- `Sink._run`: every non-DELETE bulk op whose `doc` body has **no
`_attachment`
key** is measured against the cap; oversized docs are **skipped, logged
at
`WARNING`, and counted** as `docs_dropped_too_large`.
- Size measurement matches the actual **wire bytes** the ES client
sends,
mirroring `elastic_transport._serializer.JsonSerializer`:
`len(json.dumps(op, ensure_ascii=False, separators=(",",
":")).encode("utf-8", "surrogatepass"))`.
This avoids over-counting i18n / emoji content (default `json.dumps`
uses
`ensure_ascii=True`, escaping each non-ASCII character into 6+ ASCII
bytes —
e.g. `é` → `\u00e9` measured as 6 bytes vs 2 UTF-8 bytes on the wire).
- `_attachment` gate uses **key presence**, not value, so docs that pass
an
empty/None/falsy `_attachment` are still treated as binary path and
exempted.
- `Sink.__init__` rejects negative `max_text_document_size` with a clear
`ValueError` at construction. Without this guard, a negative cap would
be
truthy and silently drop every non-attachment doc.
- `DELETE` ops are exempt (no body).
- Wired through `SyncOrchestrator.async_bulk` via
`options["max_text_document_size"]`, defaulting to
`DEFAULT_MAX_TEXT_DOCUMENT_SIZE` from `connectors.config`, and threaded
into
`_default_config()`.
- Added a commented entry + upgrade-warning paragraph in
`config.yml.example`.
### Behavior
Default: documents whose serialized bulk-op JSON exceeds **3 MiB**
(UTF-8 wire
bytes), and that have no `_attachment` key, are dropped with a log line:
```
Dropping doc id=<id> index=<index> op=<op>: serialized text size <N>B exceeds elasticsearch.bulk.max_text_document_size (<M>B)
```
Set `elasticsearch.bulk.max_text_document_size: 0` to disable the cap
entirely
and restore previous behavior.
### Out of scope (deferred to follow-ups)
This PR fixes the **DES-on text path** specifically. Known related
issues that
are **not** addressed here and need separate PRs:
- Removing the `use_text_extraction_service` short-circuit in
`is_file_size_within_limit` so DES-extracted file size is also bounded.
- Fixing connectors (e.g. GitLab) that pass `file_size=0` and bypass the
metadata-based file-size check.
- Adding a streaming byte counter in `download_to_temp_file` so the
actual
download is bounded even when metadata is missing/wrong.
- Switching `bulk_size` / `chunk_mem_size` accounting from
`pympler.asizeof`
to exact serialized bytes (currently the cap and the chunk flush use
different units).
- Feeding pre-serialized bytes to `client.bulk(...)` to skip
transport-side
re-serialization (this PR does serialize twice for non-`_attachment`
docs:
once to measure, once on the wire).
## Testing
### Unit tests
`tests/test_sink.py` (cap-specific):
- `test_sink_drops_doc_exceeding_max_text_document_size` — oversized
text doc
is dropped, warning string and counter pinned exactly.
- `test_sink_does_not_drop_doc_within_max_text_document_size` — small
text doc
passes through.
- `test_sink_max_text_document_size_disabled[None, 0]` — falsy cap
disables.
- `test_sink_does_not_drop_delete_op_even_if_oversized` — DELETE exempt.
- `test_sink_drops_only_oversized_doc_in_mixed_batch` — only the
oversized
doc is dropped from a mixed batch; the rest are sent.
- `test_sink_does_not_drop_doc_with_attachment_even_if_oversized` —
binary
path exemption.
- `test_sink_attachment_gate_uses_key_presence_not_value[None, "", 0,
False, []]`
— gate is key-presence, not value-truthiness.
- `test_sink_drops_oversized_doc_when_attachment_key_absent` — explicit
key-absent sanity.
- `test_sink_drops_doc_with_body_only_when_oversized` — DES-text shape
(`body`,
no `_attachment`).
- `test_sink_drops_structured_only_doc_when_oversized` — structured-only
doc
shape (no `body`, no `_attachment`).
- `test_sink_measures_serialized_size_in_wire_utf8_bytes` — regression
for
the i18n/emoji over-count: 200 000 × `é` body fits the 1 MiB cap on the
wire (~0.4 MiB UTF-8) but would have been over the cap (~1.15 MiB) under
the old `json.dumps` default measurement.
- `test_sink_rejects_negative_max_text_document_size[-1, -3, -1024]` —
negative
cap is rejected at construction with `ValueError`.
`tests/test_config.py`:
- `test_default_max_text_document_size` asserts the default is `3`.
All 119 cap+config tests pass.
### Functional smoke
The `dir` source ftest still completes end-to-end with the cap enabled:
```
DATA_SIZE=medium MAX_DURATION=1800 REFRESH_RATE=2 \
make -C app/connectors_service ftest NAME=dir
```
(Earlier runs in this PR exercised the older `pympler.asizeof`
measurement
path; the current measurement is wire-byte and the log format is the new
`serialized text size ...` form.)
## Checklists
#### Pre-Review Checklist
- [x] this PR does NOT contain credentials of any kind
- [x] this PR has a meaningful title
- [x] this PR links to all relevant github issues
- [x] this PR has a thorough description
- [x] Covered the changes with automated tests
- [x] Tested the changes locally
- [x] Added a label for each target release version
- [ ] For bugfixes: backport safely to all minor branches still
receiving patch releases
- [x] Considered corresponding documentation changes
- [x] Contributed any configuration settings changes to the
configuration reference
#### Changes Requiring Extra Attention
- **Behavior change (default-on):** by default, non-binary docs whose
serialized bulk-op JSON exceeds 3 MiB will be dropped (with a warning +
`docs_dropped_too_large` counter) instead of being sent. Operators
relying
on ingesting very large structured/text docs must set
`elasticsearch.bulk.max_text_document_size: 0` or raise the threshold.
Binary attachments (`_attachment` present) are unchanged.
## Release Note
Add `elasticsearch.bulk.max_text_document_size` (MiB, default `3`, `0`
disables)
to drop oversized **text** documents (docs without an `_attachment` key)
before
they hit the Elasticsearch bulk API, protecting clusters from being
overwhelmed
by huge extracted/structured documents. The size is measured against the
actual
UTF-8 wire bytes the ES client sends. Dropped docs are logged at
`WARNING` and
counted as `docs_dropped_too_large`. Binary attachments remain governed
by
`service.max_file_download_size`.
**Upgrade note:** earlier connectors versions had no per-document text
cap, so
a sync that previously succeeded with multi-MiB structured/text docs may
begin
reporting `docs_dropped_too_large` after upgrade. Set
`elasticsearch.bulk.max_text_document_size: 0` to preserve previous
behavior,
or raise the value to fit your largest legitimate doc.
---------
Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
0 commit comments