Skip to content

add fuzzing infrastructure, expand fuzz coverage, fix CI integration#1

Open
osyniakov wants to merge 1 commit into
mainfrom
claude/add-fuzzing-tests-3EM9A
Open

add fuzzing infrastructure, expand fuzz coverage, fix CI integration#1
osyniakov wants to merge 1 commit into
mainfrom
claude/add-fuzzing-tests-3EM9A

Conversation

@osyniakov

@osyniakov osyniakov commented Apr 10, 2026

Copy link
Copy Markdown
Owner

Summary

Adds cargo-fuzz targets and ClusterFuzzLite CI to satisfy the OSSF Scorecard
Fuzzing check (was scoring 0). Also fixes a real panic discovered during
fuzzing, adds four more fuzz targets covering the highest-priority security
surfaces, and mitigates a panic found in tantivy's query grammar parser.


Changes

Fuzzing infrastructure

  • quickwit/quickwit-fuzz/ — cargo-fuzz crate, member of the parent workspace
  • .clusterfuzzlite/Dockerfile — extends gcr.io/oss-fuzz-base/base-builder-rust; installs protoc 3.20.3 (Ubuntu 20.04's bundled 3.6.x predates --experimental_allow_proto3_optional)
  • .clusterfuzzlite/build.sh — builds all fuzz targets with RUSTFLAGS="--cfg tokio_unstable" cargo +nightly fuzz build --fuzz-dir quickwit-fuzz
  • .github/workflows/clusterfuzzlite.yml — code-change mode (push/PR); path filters skip frontend-only changes
  • .github/workflows/clusterfuzzlite-batch.yml — batch mode on nightly schedule (1 hour); grows the corpus and finds new bugs independently of PR CI

Bug fix found by fuzzer

quickwit-datetime/src/date_time_parsing.rsparse_timestamp_str panicked
on inputs containing multi-byte UTF-8 characters in the subsecond position.
The old code used str.len().min(9) (byte count) to slice, which can land
mid-character. Fixed with a byte-level scan that always stops at a valid char boundary.

Tantivy query grammar panic (mitigation)

fuzz_query_string found a panic in tantivy's UserInputLeaf::set_field
(query-grammar/src/user_input_ast.rs:51: "Exist query without a field isn't
allowed") triggered by bare exist-style inputs such as *:. This is a bug in
the upstream tantivy query grammar parser with no tracked issue yet.

Standard catch_unwind is insufficient here because libfuzzer-sys installs a
panic hook that calls abort() before stack unwinding begins. The workaround
temporarily replaces the hook with a no-op around the call, then restores it,
so the fuzzer continues rather than aborting.

Fuzz targets (8 total)

Target What it fuzzes Why
fuzz_query_dsl ElasticQueryDsl JSON deserialization Full ES query DSL from user HTTP requests
fuzz_query_string Lucene UserInputQuery parser Complex parser; found tantivy panic on *:
fuzz_datetime parse_date_time_str across all formats Found and fixed the subsecond UTF-8 panic
fuzz_doc_mapper DocMapper::doc_from_json_str Document ingestion path
fuzz_doc_mapper_config DocMapperBuilder deserialization + try_build() Schema definition via index-creation API; covers regex tokenizer ReDoS
fuzz_java_datetime_format StrptimeParser::from_java_datetime_format / from_strptime Format-string parsing reachable from ES range query "format" field
fuzz_otlp_spans parse_otlp_spans_protobuf + parse_otlp_spans_json External tracing agents at gRPC boundary; recursive AnyValue has no depth limit
fuzz_otlp_logs parse_otlp_logs_protobuf + parse_otlp_logs_json External log collectors at gRPC boundary; same recursion risk

Other

  • .dockerignore — added !.clusterfuzzlite/ exception (was excluded by **/.*)
  • quickwit/Cargo.toml — added quickwit-fuzz to workspace members
  • quickwit/deny.toml — NCSA license exception for libfuzzer-sys
  • LICENSE-3rdparty.csv — added arbitrary and libfuzzer-sys entries

Test plan

  • All 8 fuzz targets build: RUSTFLAGS="--cfg tokio_unstable" cargo +nightly fuzz build --fuzz-dir quickwit-fuzz
  • All 8 targets run without crashes or panics
  • fuzz_datetime no longer panics on multi-byte UTF-8 subsecond inputs
  • fuzz_query_string no longer aborts on *: and similar tantivy-panicking inputs
  • ClusterFuzzLite code-change CI passes end-to-end
  • Frontend-only PRs skip the fuzzing workflow (path filter)
  • Nightly batch workflow runs independently of PR CI
  • OSSF Scorecard detects google/clusterfuzzlite action → Fuzzing score > 0

https://claude.ai/code/session_01PKpEBTpgHSndurjdPbJodB

@osyniakov osyniakov changed the title add fuzzing infrastructure and fix datetime parsing panic add fuzzing infrastructure, fix datetime panic, expand fuzz coverage Apr 12, 2026
@osyniakov osyniakov changed the title add fuzzing infrastructure, fix datetime panic, expand fuzz coverage add fuzzing infrastructure, expand fuzz coverage, fix CI integration Apr 12, 2026
osyniakov pushed a commit that referenced this pull request Jun 30, 2026
…ming engine (default off, no-op rollout) (quickwit-oss#6441)

* test(merge): engine-parity tests + share MS-7 serial lock

Adds `merge::tests::parity` with two tests that run the same realistic
input fixture through both `merge_sorted_parquet_files` (in-memory
engine) and `execute_merge_operation` (streaming engine over the same
`LocalFileByteSource` the executor uses in production), then asserts
row-by-row equivalence on every visible column. These gate the upcoming
YAML flag that flips regular merges to the streaming engine: parity
must hold before the default is flipped in production.

The streaming engine writes a process-global atomic
(`PEAK_BODY_COL_PAGE_CACHE_LEN`) that the MS-7 tests reset-then-read.
Any test that runs a streaming merge must serialise against MS-7 or
inflate its readings. Move `ms7_serial_lock` from the streaming-tests
submodule to module scope (still `#[cfg(test)] pub(crate)`) so the new
parity tests acquire the same lock.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(merge-executor): YAML flag to route regular merges through the streaming engine

Wires the streaming Parquet merge engine into the regular (non-promotion)
merge path behind a node-level YAML flag,
`indexer.parquet_merge_use_streaming_engine`, defaulted to false. When
true, `ParquetMergeExecutor::handle` runs every merge through
`execute_merge_operation` (the column-major, page-bounded streaming
engine) instead of the in-memory `merge_sorted_parquet_files`.
Promotion merges (`target_prefix_len_override.is_some()`) continue to
take the streaming path unconditionally — the in-memory engine can't
handle mixed `rg_partition_prefix_len` inputs.

The in-memory engine stays in place as the runtime fallback. If the
streaming engine hits a bug in production, an operator can flip the
flag back to `false` via YAML without redeploying. Once the streaming
path has soaked, the fallback branch and `merge_sorted_parquet_files`
itself can be removed.

The flag is plumbed `IndexerConfig` → `IndexingService` →
`ParquetMergePipelineParams` → `ParquetMergeExecutor::new`, and exercised
end-to-end by the engine parity tests in the previous commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(config): exercise parquet_merge_use_streaming_engine in YAML/JSON/TOML fixtures

Adds the new flag to the YAML, JSON, and TOML node-config test fixtures
and bumps the expected `IndexerConfig` in `node_config_parse_*` to
`parquet_merge_use_streaming_engine: true`. Catches parse / serde
regressions on the field — e.g., a rename or a default-fn typo would
fail the test instead of silently parsing as `false`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(merge-pipeline): end-to-end streaming-engine flag verification

Adds `test_merge_pipeline_end_to_end_with_streaming_engine_flag`, an
integration test that runs the full actor chain (planner → downloader
→ executor → uploader → publisher) with
`ParquetMergePipelineParams::use_streaming_engine = true`. Asserts:

1. Publish fired with the right replaced_split_ids (merge ran
   end-to-end through the executor).
2. `PEAK_BODY_COL_PAGE_CACHE_LEN > 0` after the merge. The streaming
   engine increments this on every body-col page assembly; the
   in-memory engine never touches it. Non-zero is direct evidence
   the streaming path executed — not a silent fallback to in-memory.
3. The merge output row count and metric names are correct.

To make assertion (2) work cross-crate, exposes
`PEAK_BODY_COL_PAGE_CACHE_LEN` as `pub` under
`#[cfg(any(test, feature = "testsuite"))]`. The visibility widening is
test-only — production builds never see the symbol.

This is the closest analog to the sesh-mode "production-path" rule
that is feasible today: the metrics pipeline's OTLP gRPC ingest path
is not yet wired into `quickwit-serve`, so the closest end-to-end
test is the actor-chain integration test that this PR adds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* docs(adr): record dual Parquet merge engines as deviation #1

Captures the intentional, time-bounded divergence from ADR-003 §4
introduced by the streaming-engine wire-in: two engines coexist in
production behind `IndexerConfig::parquet_merge_use_streaming_engine`,
with the in-memory engine retained as the runtime fallback.

Documents:
- The ADR-003 §4 quote the deviation diverges from (page-granular
  streaming, bounded memory).
- The current dual-engine implementation and routing logic.
- Why this exists (production safety, staged rollout, parity is
  strong-but-not-total).
- Explicit exit criteria: default flipped to `true`, ≥ 2-week
  production soak with no merge-correctness incidents, no rollback.
  When met, a follow-up PR deletes the in-memory branch and engine,
  the flag, and the parity tests.

This is the first deviation recorded under the EVOLUTION.md framework.
Indexes the doc in `deviations/README.md`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(merge-pipeline): share full correctness contract between both engine tests

Extracts the steps-5-through-8 assertions (replaced_split_ids, staged
metadata, Parquet file content, Parquet KV headers) into
`assert_cpu_mem_merge_outputs_correct` and calls it from both
`test_merge_pipeline_end_to_end` (in-memory engine) and
`test_merge_pipeline_end_to_end_with_streaming_engine_flag` (streaming
engine). The streaming-engine test had been doing only a small subset
of the checks — row count and metric names. It now runs the full
contract: time_range, num_merge_ops, sort_fields, row_keys_proto,
zonemap_regexes, low_cardinality_tags, all 100 timestamps,
sorted_series monotonicity, cpu/mem sort-order semantics, and every
`qh.*` Parquet KV header.

By construction both engines must produce a file that satisfies the
same contract — the helper is the executable parity between engines
at the pipeline-integration level, complementing the column-level
parity tests in `quickwit-parquet-engine::merge::tests::parity`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(merge-pipeline,merge-engine): multi-metric + multi-RG + m:n disjointness

Expands test coverage along three axes the existing helpers didn't hit:

1. **Multi-input, multi-metric pipeline tests** (new file
   `parquet_merge_pipeline_multi_metric_test.rs`). Three inputs, each
   carrying three metrics with overlapping per-metric timeseries IDs
   and overlapping-but-distinct timestamps — the merge must
   row-by-row interleave across all three inputs. Output writer uses
   `row_group_size = 50` so the 180-row merge output breaks into
   four row groups, exercising the writer's multi-RG path in both
   engines. Both engine variants (in-memory + streaming) covered.
   Streaming-engine test asserts `PEAK_BODY_COL_PAGE_CACHE_LEN > 0`
   to confirm the flag routed through the streaming path.

2. **Engine-level multi-output contract** in
   `merge::tests::parity::assert_engine_parity`. Beyond the existing
   engine-vs-engine column equivalence, every parity test now also
   verifies on the in-memory engine's outputs (equivalent to the
   streaming engine's): sum of per-output row counts equals total
   input rows, each output internally monotonic on `sorted_series`,
   and across outputs the partition is disjoint (no two outputs
   share any `sorted_series` value). This is the m:n
   non-overlap contract.

3. **Multi-metric overlapping-input m:n** test
   `parity_multi_metric_overlapping_inputs_multi_output` exercises
   the strengthened contract with three inputs × three metrics where
   per-metric keyspaces overlap across inputs. n = 3 outputs target.

Honest scope note in the new pipeline test module's doc: the actor
pipeline today hardcodes `num_outputs = 1` in `ParquetMergeExecutor`,
so n > 1 is not reachable end-to-end through the actor system. The
new engine-level test covers the n > 1 correctness contract for now;
when the executor is taught to accept `num_outputs > 1` from the
merge policy, the pipeline tests can grow an n > 1 variant.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* feat(merge-executor): compute num_outputs from target_split_size_bytes

Replaces the hardcoded `MergeConfig { num_outputs: 1, ... }` in
`ParquetMergeExecutor::handle` with a per-merge computation:

    num_outputs = max(1, ceil(total_input_bytes / target_split_size_bytes))

So a merge that ingests more than one target's worth of data spreads
across multiple output files; merges that fit in one target keep
producing a single output (preserving today's behavior for the
common case). The engine clamps the request to the number of
`sorted_series` boundaries actually available, so the value is an
upper bound, not an exact count.

Plumbing: `IndexerConfig` already carries `target_split_size_bytes`
in `ParquetMergePolicyConfig`. Pass that through
`ParquetMergePipelineParams.target_split_size_bytes` →
`ParquetMergeExecutor::new`. Default for tests:
`256 * 1024 * 1024` (matches the production default).

Latent multi-output bug fixed at the same time: with n>1, the
executor used to assign the planner-supplied `merge_split_id` to
**every** output split, which would have collided on the rename to
`{split_id}.parquet`. First output keeps the planner ID for
observability continuity; subsequent outputs use the fresh IDs
generated by `merge_parquet_split_metadata`.

Also exposes `quickwit_parquet_engine::merge::streaming::ms7_serial_lock`
as `pub` under the `testsuite` feature so cross-crate streaming tests
(in `quickwit-indexing`) can serialise against the same global lock
the in-crate MS-7 tests use. The streaming engine writes to a
process-global atomic on every merge — without shared locking, the
existing pipeline streaming-engine test races `store(0)` against
other tests' merges. Adds the appropriate
`#[allow(clippy::await_holding_lock)]` to the in-crate
`test_merge_pipeline_end_to_end_with_streaming_engine_flag` to
match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* test(merge-pipeline): bonus — prefix_len=1 multi-RG inputs + m:n outputs

Adds the bonus scenario: three multi-metric inputs each written with
`rg_partition_prefix_len = 1` and one row group per distinct
metric_name (via `row_group_size = ROWS_PER_METRIC_PER_INPUT` so the
writer flushes at every metric boundary after sorting). Merged with a
small `ParquetMergePipelineParams::target_split_size_bytes = 500`
that forces the executor's `num_outputs` calculation to ask the
engine for multiple outputs — exercising the m:n merge path now
reachable through the actor pipeline (PR's earlier commit removed
the `num_outputs = 1` hardcode).

Both engines covered:

- `test_prefix_aligned_multi_metric_three_input_multi_output_in_memory_engine`
- `test_prefix_aligned_multi_metric_three_input_multi_output_streaming_engine`

The streaming-engine variant also asserts
`PEAK_BODY_COL_PAGE_CACHE_LEN > 0` (under `ms7_serial_lock`) so a
silent fallback to the in-memory path would fail.

The shared assertion helper
`assert_three_input_three_metric_multi_output_correct` checks the
m:n contract end-to-end at the pipeline level:

- All three input splits replaced.
- ≥ 2 output splits staged (proves splitting happened).
- Sum of per-output row counts = total input rows.
- Each output internally monotonic on `sorted_series`.
- Across outputs, the `sorted_series` partition is disjoint — no
  two outputs share any key, which is the "non-overlapping output"
  contract the engine promises.
- Union of metric_names / services across outputs = full input set.
- Every output has `num_merge_ops = 1`, `row_keys_proto`, and a
  `metric_name` zonemap regex.

To pin the test to exactly one merge (not a cascade of merges over
the now-multiple staged outputs), `make_pipeline_params` now takes
`max_merge_ops` and the bonus tests set it to `1`: outputs land at
`num_merge_ops = 1`, equal to the policy ceiling, and the planner
refuses to merge them again. The existing n=1 tests stay at 5
(headroom — they produce a single output that can't trigger another
merge anyway, since `merge_factor = 3`).

Updates the module doc to drop the now-stale scope note about m:n
not being reachable through the pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* style: nightly rustfmt + drop useless borrows in assert!

Reformats doc comments / format strings under nightly rustfmt
(`wrap_comments`, `format_strings`), and removes two redundant `&` in
`assert!` arguments flagged by clippy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Add quickwit-fuzz crate with 8 fuzz targets covering query parsing,
  datetime handling, doc mapping, OTLP log/span ingestion
- Add .clusterfuzzlite/Dockerfile and build.sh for ClusterFuzzLite builds;
  install protoc 3.20.3 to satisfy proto3 optional field requirements
- Add clusterfuzzlite.yml workflow (code-change mode, 30s) for PR regression
  checks, and clusterfuzzlite-batch.yml (batch mode, 3600s) on nightly schedule
- Fix fuzz_query_string: upstream tantivy panics on bare exists queries like
  '*:' (a bug in the tantivy query grammar parser with no tracked issue yet);
  work around by replacing the libfuzzer-sys abort hook with a no-op before
  catch_unwind so the known panic is caught rather than aborting the fuzzer
- Fix date_time_parsing: subsecond digit slicing used char-boundary arithmetic
  that could panic on non-ASCII input; switch to byte-level position scan
- Allow NCSA license exception in deny.toml for libfuzzer-sys (dev-only, never shipped)
- Add libfuzzer-sys and arbitrary to LICENSE-3rdparty.csv
- Expose .clusterfuzzlite/ directory in .dockerignore (was excluded by **/.*
  wildcard)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01PKpEBTpgHSndurjdPbJodB
@osyniakov osyniakov force-pushed the claude/add-fuzzing-tests-3EM9A branch from e528b6e to 4b5da4b Compare June 30, 2026 19:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants