Skip to content

Commit 4b52e23

Browse files
Merge pull request #865 from ClickHouse/add-quickwit-entry
Add Quickwit entry
2 parents 62d1944 + 3216a67 commit 4b52e23

15 files changed

Lines changed: 794 additions & 1 deletion

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -311,7 +311,7 @@ Please help us add more systems and run the benchmarks on more types of VMs:
311311
- [ ] MS SQL Server with Column Store Index (without publishing)
312312
- [ ] OceanBase
313313
- [ ] Planetscale (without publishing)
314-
- [ ] Quickwit
314+
- [x] Quickwit
315315
- [ ] Redshift Spectrum
316316
- [ ] Seafowl
317317
- [ ] ShitholeDB

quickwit/README.md

Lines changed: 49 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
# Quickwit
2+
3+
[Quickwit](https://quickwit.io) is a Rust-based search engine for log analytics, built on top of [Tantivy](https://github.com/quickwit-oss/tantivy). It exposes an Elasticsearch-compatible REST API for ingestion and search, but does not implement an SQL endpoint, so this benchmark uses the native Elasticsearch query DSL directly.
4+
5+
## Methodology
6+
7+
Infrastructure:
8+
- Single-node Quickwit **v0.9.0-rc** (Docker `quickwit/quickwit:v0.9.0-rc`).
9+
10+
Stable **0.8.2** is missing `cardinality`, `wildcard`, and several other features the benchmark relies on, so we use the v0.9 release candidate. The v0.9 line is still unreleased — as soon as a stable v0.9.x ships, bump `QW_IMAGE` in `benchmark.sh`.
11+
12+
Index configuration (`index_config.yaml`):
13+
- All scalar fields declared with `fast: true` so they can participate in aggregations and sorts.
14+
- Keyword-like text fields use the `raw` tokenizer with the `raw` fast-field normalizer to mimic Elasticsearch's `keyword` mapping.
15+
- `EventTime` is set as the index's timestamp field, providing time-based pruning.
16+
17+
Ingestion (`benchmark.sh`):
18+
- Streams `hits.json.gz` decompressed into `quickwit tool local-ingest`, which builds splits directly on local storage. We do **not** use the Elasticsearch bulk endpoint: v0.9's sharded ingest-v2 API caps single-node throughput to a few MB/s in our testing and stalls waiting for shards to scale. `local-ingest` bypasses the ingest pipeline entirely.
19+
- The server picks up the new splits on its next metastore poll (default 30 s).
20+
21+
Queries (`queries.json`):
22+
- Each query in `queries.sql` is hand-translated to the Elasticsearch DSL on the corresponding line of `queries.json`, and submitted to `/api/v1/_elastic/hits/_search`.
23+
- Timing is taken from the `took` field returned by Quickwit (milliseconds, engine-internal).
24+
- Queries that are not expressible in Quickwit's DSL are recorded as `null`.
25+
26+
## Unsupported queries
27+
28+
The following ClickBench queries cannot currently be expressed in Quickwit's Elasticsearch-compatible DSL and are reported as `null`:
29+
30+
| Q | Reason |
31+
|----|-----------------------------------------------------------------------|
32+
| 19 | `extract(minute FROM …)` — no scripted/runtime fields |
33+
| 26 | `ORDER BY` on text field — `sort by field on type text is currently not supported` |
34+
| 27 | `ORDER BY` on text field |
35+
| 28 | `AVG(length(URL))` — no scripted/runtime fields |
36+
| 29 | `REGEXP_REPLACE` — not supported |
37+
| 30 | `SUM(col + N)` — no scripted aggregations |
38+
| 36 | `ClientIP - N` — no scripted aggregations |
39+
| 40 | `CASE WHEN …` — no scripted/runtime fields |
40+
41+
All other 35 queries run through the native Elasticsearch DSL, including `cardinality` (Q5/6/9/10/11/12/14) and `wildcard` (Q21/22/23/24).
42+
43+
## Running
44+
45+
```bash
46+
bash benchmark.sh
47+
```
48+
49+
Installs Docker and Quickwit, creates the index, downloads `hits.json.gz`, runs `local-ingest`, then runs `run.sh` to time each query three times with caches dropped between runs.

quickwit/benchmark.sh

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
#!/bin/bash
2+
set -eo pipefail
3+
4+
export DEBIAN_FRONTEND=noninteractive
5+
6+
# Install prerequisites quietly
7+
sudo apt-get update -qq >/dev/null
8+
sudo apt-get install -y -qq wget curl jq bc docker.io >/dev/null
9+
sudo systemctl start docker
10+
11+
# We use the Quickwit v0.9 release candidate. Stable v0.8.2 is missing
12+
# `cardinality`, `wildcard`, and several other features the benchmark relies
13+
# on; only the v0.9 line (still unreleased as of writing) provides them.
14+
QW_IMAGE="quickwit/quickwit:v0.9.0-rc"
15+
sudo docker pull -q "$QW_IMAGE" >/dev/null
16+
17+
# Quickwit's data directory (shared between the server and the local-ingest
18+
# container).
19+
QW_DATA="$(pwd)/qwdata"
20+
sudo rm -rf "$QW_DATA"
21+
mkdir -p "$QW_DATA"
22+
23+
# Start the server in the background. Quickwit defaults: REST on 7280, gRPC on 7281.
24+
# Mount node-config.yaml on top of the image's default config to bump the
25+
# searcher timeouts (defaults are 30s, which is too low for some of the
26+
# nested high-cardinality aggregations on the full 100M-row dataset).
27+
sudo docker run -d --name qw --network host \
28+
-v "$QW_DATA":/quickwit/qwdata \
29+
-v "$(pwd)/node-config.yaml":/quickwit/config/quickwit.yaml \
30+
"$QW_IMAGE" run >/dev/null
31+
echo "Quickwit container started"
32+
33+
# Wait for the server to come up.
34+
for i in $(seq 1 60); do
35+
if curl -sS -f http://localhost:7280/api/v1/version >/dev/null 2>&1; then
36+
echo "Quickwit is ready"
37+
break
38+
fi
39+
sleep 1
40+
done
41+
42+
# Create the index from the YAML config.
43+
curl -sS -X POST http://localhost:7280/api/v1/indexes \
44+
-H 'Content-Type: application/yaml' \
45+
--data-binary @index_config.yaml | jq -r '.index_uid // .message'
46+
47+
# Download the data quietly (the dataset is ~14 GB; full progress would
48+
# dominate the captured benchmark log).
49+
wget --continue -q 'https://datasets.clickhouse.com/hits_compatible/hits.json.gz'
50+
51+
START=$(date +%s)
52+
53+
# Use `quickwit tool local-ingest` instead of the Elasticsearch-compatible
54+
# bulk endpoint. v0.9's sharded ingest-v2 API caps single-node throughput
55+
# to a few MB/s and gets stuck waiting for shards to scale, while
56+
# `local-ingest` builds splits directly and writes them to the index
57+
# storage. The running server picks up new splits on its next metastore
58+
# poll (default 30s).
59+
#
60+
# local-ingest emits a "Num docs ... Thrghput ... Time" progress line
61+
# roughly once per second; we throttle that to once per ~30 seconds so
62+
# the captured log stays compact, and pass the surrounding lines through
63+
# unchanged.
64+
zcat hits.json.gz | sudo docker run --rm -i --network host \
65+
-v "$QW_DATA":/quickwit/qwdata \
66+
"$QW_IMAGE" tool local-ingest --index hits -y 2>&1 \
67+
| awk '/Num docs/ { n = systime(); if (n - last >= 30) { print; fflush(); last = n } next }
68+
{ print; fflush() }'
69+
70+
# Wait long enough for the server to refresh its metastore view.
71+
sleep 35
72+
73+
# Show stats.
74+
curl -sS "http://localhost:7280/api/v1/indexes/hits/describe" \
75+
| jq '{num_published_docs, num_published_splits, size_published_splits}' \
76+
| tee stats.json
77+
78+
END=$(date +%s)
79+
echo "Load time: $((END - START))"
80+
81+
# Data size on disk.
82+
echo -n "Data size: "
83+
sudo du -sb "$QW_DATA" | awk '{print $1}'
84+
85+
# Run queries
86+
chmod +x run.sh
87+
./run.sh
88+
89+
sudo docker stop qw 2>/dev/null || true
90+
sudo docker rm qw 2>/dev/null || true

quickwit/index_config.yaml

Lines changed: 149 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
version: 0.8
2+
3+
index_id: hits
4+
5+
doc_mapping:
6+
mode: strict
7+
timestamp_field: EventTime
8+
field_mappings:
9+
- {name: WatchID, type: i64, indexed: true, fast: true}
10+
- {name: JavaEnable, type: i64, indexed: true, fast: true}
11+
- {name: Title, type: text, tokenizer: raw, fast: {normalizer: raw}}
12+
- {name: GoodEvent, type: i64, indexed: true, fast: true}
13+
- name: EventTime
14+
type: datetime
15+
input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
16+
output_format: unix_timestamp_secs
17+
indexed: true
18+
fast: true
19+
fast_precision: seconds
20+
- name: EventDate
21+
type: datetime
22+
input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
23+
output_format: unix_timestamp_secs
24+
indexed: true
25+
fast: true
26+
fast_precision: seconds
27+
- {name: CounterID, type: i64, indexed: true, fast: true}
28+
- {name: ClientIP, type: i64, indexed: true, fast: true}
29+
- {name: RegionID, type: i64, indexed: true, fast: true}
30+
- {name: UserID, type: i64, indexed: true, fast: true}
31+
- {name: CounterClass, type: i64, indexed: true, fast: true}
32+
- {name: OS, type: i64, indexed: true, fast: true}
33+
- {name: UserAgent, type: i64, indexed: true, fast: true}
34+
- {name: URL, type: text, tokenizer: raw, fast: {normalizer: raw}}
35+
- {name: Referer, type: text, tokenizer: raw, fast: {normalizer: raw}}
36+
- {name: IsRefresh, type: i64, indexed: true, fast: true}
37+
- {name: RefererCategoryID, type: i64, indexed: true, fast: true}
38+
- {name: RefererRegionID, type: i64, indexed: true, fast: true}
39+
- {name: URLCategoryID, type: i64, indexed: true, fast: true}
40+
- {name: URLRegionID, type: i64, indexed: true, fast: true}
41+
- {name: ResolutionWidth, type: i64, indexed: true, fast: true}
42+
- {name: ResolutionHeight, type: i64, indexed: true, fast: true}
43+
- {name: ResolutionDepth, type: i64, indexed: true, fast: true}
44+
- {name: FlashMajor, type: i64, indexed: true, fast: true}
45+
- {name: FlashMinor, type: i64, indexed: true, fast: true}
46+
- {name: FlashMinor2, type: text, tokenizer: raw, fast: {normalizer: raw}}
47+
- {name: NetMajor, type: i64, indexed: true, fast: true}
48+
- {name: NetMinor, type: i64, indexed: true, fast: true}
49+
- {name: UserAgentMajor, type: i64, indexed: true, fast: true}
50+
- {name: UserAgentMinor, type: text, tokenizer: raw, fast: {normalizer: raw}}
51+
- {name: CookieEnable, type: i64, indexed: true, fast: true}
52+
- {name: JavascriptEnable, type: i64, indexed: true, fast: true}
53+
- {name: IsMobile, type: i64, indexed: true, fast: true}
54+
- {name: MobilePhone, type: i64, indexed: true, fast: true}
55+
- {name: MobilePhoneModel, type: text, tokenizer: raw, fast: {normalizer: raw}}
56+
- {name: Params, type: text, tokenizer: raw, fast: {normalizer: raw}}
57+
- {name: IPNetworkID, type: i64, indexed: true, fast: true}
58+
- {name: TraficSourceID, type: i64, indexed: true, fast: true}
59+
- {name: SearchEngineID, type: i64, indexed: true, fast: true}
60+
- {name: SearchPhrase, type: text, tokenizer: raw, fast: {normalizer: raw}}
61+
- {name: AdvEngineID, type: i64, indexed: true, fast: true}
62+
- {name: IsArtifical, type: i64, indexed: true, fast: true}
63+
- {name: WindowClientWidth, type: i64, indexed: true, fast: true}
64+
- {name: WindowClientHeight, type: i64, indexed: true, fast: true}
65+
- {name: ClientTimeZone, type: i64, indexed: true, fast: true}
66+
- name: ClientEventTime
67+
type: datetime
68+
input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
69+
output_format: unix_timestamp_secs
70+
indexed: true
71+
fast: true
72+
fast_precision: seconds
73+
- {name: SilverlightVersion1, type: i64, indexed: true, fast: true}
74+
- {name: SilverlightVersion2, type: i64, indexed: true, fast: true}
75+
- {name: SilverlightVersion3, type: i64, indexed: true, fast: true}
76+
- {name: SilverlightVersion4, type: i64, indexed: true, fast: true}
77+
- {name: PageCharset, type: text, tokenizer: raw, fast: {normalizer: raw}}
78+
- {name: CodeVersion, type: i64, indexed: true, fast: true}
79+
- {name: IsLink, type: i64, indexed: true, fast: true}
80+
- {name: IsDownload, type: i64, indexed: true, fast: true}
81+
- {name: IsNotBounce, type: i64, indexed: true, fast: true}
82+
- {name: FUniqID, type: i64, indexed: true, fast: true}
83+
- {name: OriginalURL, type: text, tokenizer: raw, fast: {normalizer: raw}}
84+
- {name: HID, type: i64, indexed: true, fast: true}
85+
- {name: IsOldCounter, type: i64, indexed: true, fast: true}
86+
- {name: IsEvent, type: i64, indexed: true, fast: true}
87+
- {name: IsParameter, type: i64, indexed: true, fast: true}
88+
- {name: DontCountHits, type: i64, indexed: true, fast: true}
89+
- {name: WithHash, type: i64, indexed: true, fast: true}
90+
- {name: HitColor, type: text, tokenizer: raw, fast: {normalizer: raw}}
91+
- name: LocalEventTime
92+
type: datetime
93+
input_formats: ["%Y-%m-%d %H:%M:%S", "%Y-%m-%d", unix_timestamp, rfc3339]
94+
output_format: unix_timestamp_secs
95+
indexed: true
96+
fast: true
97+
fast_precision: seconds
98+
- {name: Age, type: i64, indexed: true, fast: true}
99+
- {name: Sex, type: i64, indexed: true, fast: true}
100+
- {name: Income, type: i64, indexed: true, fast: true}
101+
- {name: Interests, type: i64, indexed: true, fast: true}
102+
- {name: Robotness, type: i64, indexed: true, fast: true}
103+
- {name: RemoteIP, type: i64, indexed: true, fast: true}
104+
- {name: WindowName, type: i64, indexed: true, fast: true}
105+
- {name: OpenerName, type: i64, indexed: true, fast: true}
106+
- {name: HistoryLength, type: i64, indexed: true, fast: true}
107+
- {name: BrowserLanguage, type: text, tokenizer: raw, fast: {normalizer: raw}}
108+
- {name: BrowserCountry, type: text, tokenizer: raw, fast: {normalizer: raw}}
109+
- {name: SocialNetwork, type: text, tokenizer: raw, fast: {normalizer: raw}}
110+
- {name: SocialAction, type: text, tokenizer: raw, fast: {normalizer: raw}}
111+
- {name: HTTPError, type: i64, indexed: true, fast: true}
112+
- {name: SendTiming, type: i64, indexed: true, fast: true}
113+
- {name: DNSTiming, type: i64, indexed: true, fast: true}
114+
- {name: ConnectTiming, type: i64, indexed: true, fast: true}
115+
- {name: ResponseStartTiming, type: i64, indexed: true, fast: true}
116+
- {name: ResponseEndTiming, type: i64, indexed: true, fast: true}
117+
- {name: FetchTiming, type: i64, indexed: true, fast: true}
118+
- {name: SocialSourceNetworkID, type: i64, indexed: true, fast: true}
119+
- {name: SocialSourcePage, type: text, tokenizer: raw, fast: {normalizer: raw}}
120+
- {name: ParamPrice, type: i64, indexed: true, fast: true}
121+
- {name: ParamOrderID, type: text, tokenizer: raw, fast: {normalizer: raw}}
122+
- {name: ParamCurrency, type: text, tokenizer: raw, fast: {normalizer: raw}}
123+
- {name: ParamCurrencyID, type: i64, indexed: true, fast: true}
124+
- {name: OpenstatServiceName, type: text, tokenizer: raw, fast: {normalizer: raw}}
125+
- {name: OpenstatCampaignID, type: text, tokenizer: raw, fast: {normalizer: raw}}
126+
- {name: OpenstatAdID, type: text, tokenizer: raw, fast: {normalizer: raw}}
127+
- {name: OpenstatSourceID, type: text, tokenizer: raw, fast: {normalizer: raw}}
128+
- {name: UTMSource, type: text, tokenizer: raw, fast: {normalizer: raw}}
129+
- {name: UTMMedium, type: text, tokenizer: raw, fast: {normalizer: raw}}
130+
- {name: UTMCampaign, type: text, tokenizer: raw, fast: {normalizer: raw}}
131+
- {name: UTMContent, type: text, tokenizer: raw, fast: {normalizer: raw}}
132+
- {name: UTMTerm, type: text, tokenizer: raw, fast: {normalizer: raw}}
133+
- {name: FromTag, type: text, tokenizer: raw, fast: {normalizer: raw}}
134+
- {name: HasGCLID, type: i64, indexed: true, fast: true}
135+
- {name: RefererHash, type: i64, indexed: true, fast: true}
136+
- {name: URLHash, type: i64, indexed: true, fast: true}
137+
- {name: CLID, type: i64, indexed: true, fast: true}
138+
139+
store_source: false
140+
141+
indexing_settings:
142+
commit_timeout_secs: 30
143+
merge_policy:
144+
type: stable_log
145+
merge_factor: 10
146+
max_merge_factor: 12
147+
148+
search_settings:
149+
default_search_fields: []

quickwit/node-config.yaml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
version: 0.8
2+
3+
searcher:
4+
# Bump the per-request and leaf-search timeouts above the 30s default —
5+
# a few of the high-cardinality aggregations on the full 100M-row ClickBench
6+
# dataset (e.g. WatchID + ClientIP nested terms) take longer than that.
7+
request_timeout_secs: 60
8+
leaf_request_timeout_secs: 60
9+
10+
# Disable the per-split partial result cache so warm runs don't replay a
11+
# memoized answer. The other in-memory caches (fast_field_cache,
12+
# split_footer_cache, predicate_cache) are data-level caches (analogous to
13+
# ClickHouse's query condition cache) and are kept at their defaults;
14+
# run.sh restarts the container before each query so they also start cold
15+
# for the first run.
16+
partial_request_cache:
17+
capacity: 0

0 commit comments

Comments
 (0)