perf(ingest): sub-slice indexing in splitOnDelimiter (closes #354)#467
Conversation
The old implementation appended bytes one-at-a-time into a growing
[]byte per part, allocating a fresh backing array (with several grows)
for every space and comma split in every line-protocol line. Sub-slice
indexing emits `data[start:i]` sub-slices that alias the input buffer
— no per-byte append, no per-part allocation.
Behavior is unchanged at the edges (consecutive delimiters skip empty
parts; leading/trailing delimiters handled identically; escape branch
preserves the \\x pair; quote tracking unchanged).
The returned parts now ALIAS the input — verified by trace that every
downstream caller copies via string(...) or unescape() before storing
anything beyond the parse call (models.Record holds map[string]string
and map[string]interface{} only; no []byte references escape).
Also pre-sizes the parts slice to cap=4 (typical telegraf line splits
3-4 on space, 4-8 on comma); makes the first append allocation-free.
Bench (Apple M3 Max), telegraf-style line + 10-line batch:
BenchmarkLineProtocolParser_ParseLine:
before: 1547 ns, 1728 B, 50 allocs
after: 938 ns, 1152 B, 24 allocs (-39% ns, -52% allocs)
BenchmarkLineProtocolParser_ParseBatch:
before: 9215 ns, 12152 B, 282 allocs
after: 6185 ns, 10952 B, 142 allocs (-33% ns, -50% allocs)
BenchmarkSplitOnDelimiter/space_delimiter (new micro):
after: 139 ns, 96 B, 1 alloc
Adds BenchmarkSplitOnDelimiter covering both space + comma paths so
future regressions on the ingest hot path don't go unnoticed. All 33
existing line-protocol tests pass under -race.
|
@gemini-code-assist review please |
There was a problem hiding this comment.
Code Review
This pull request optimizes the splitOnDelimiter function in internal/ingest/lineprotocol.go by switching from a per-byte append loop to sub-slice indexing, which avoids unnecessary allocations on the ingest hot path. Additionally, a benchmark has been added in internal/ingest/lineprotocol_test.go to verify the performance improvements. There are no review comments to address, and I have no additional feedback to provide.
There was a problem hiding this comment.
Code Review
This pull request optimizes the splitOnDelimiter function in internal/ingest/lineprotocol.go by switching from a per-byte append loop to sub-slice indexing, which avoids unnecessary allocations on the ingest hot path. It also adds a benchmark suite to verify the performance improvements. The review feedback suggests a fast-path for empty inputs and dynamically sizing the initial slice capacity based on the delimiter to further reduce allocations.
Partial accept of Gemini's R1 suggestion: (1) Empty-input fast path — ACCEPTED. Free to add; not reachable on the production hot path today (callers pre-check len(line) > 0) but defensive for hypothetical future callers. Saves the make([]) allocation when the function is called with empty input. (2) Dynamic cap=8 on comma path — DECLINED with bench evidence. Micro-bench on a 6-part input shows the expected -1 alloc / -38 ns win, but the full-path ParseBatch regresses ~9% / +1.9 KB because most real comma-splits have ≤4 parts (1-tag measurements like `cpu,host=h`, single-field writes like `v=1`). Oversizing those to cap=8 costs more in aggregate than the rare 5+-part case saves. Comment added next to the cap=4 line so the next reader doesn't re-attempt without re-benching. Bench (cap=4 retained, empty-input added): BenchmarkSplitOnDelimiter/space_delimiter: 138 ns, 96 B, 1 alloc BenchmarkSplitOnDelimiter/comma_delimiter: 148 ns, 288 B, 2 allocs BenchmarkLineProtocolParser_ParseLine: 915 ns, 1152 B, 24 allocs BenchmarkLineProtocolParser_ParseBatch: 6273 ns, 10952 B, 142 allocs All matches the pre-Gemini-fix baseline; empty-input fast path has no measurable cost on non-empty inputs (callers always pass non-empty on the hot path). All 33 line-protocol tests pass under -race x 2.
|
Round 1 addressed in faa16f4: partial accept with bench evidence.
All 33 line-protocol tests pass under @gemini-code-assist review please |
There was a problem hiding this comment.
Code Review
This pull request optimizes the splitOnDelimiter function in internal/ingest/lineprotocol.go by replacing the per-byte append loop with sub-slice indexing to reduce allocations on the ingestion hot path. It also adds benchmarks to verify these performance improvements. The reviewer suggested further optimizing memory usage by deferring the allocation of the parts slice until a delimiter is actually encountered or the end of the data is reached, preventing unnecessary allocations for inputs with no delimiters.
| parts := make([][]byte, 0, 4) | ||
| start := 0 | ||
| inQuotes := false | ||
|
|
||
| // Note on `continue` below: bytes consumed by the escape and quote | ||
| // branches are NOT dropped — they remain inside the current sub-slice | ||
| // because sub-slicing captures the whole [start:next-delim] range. | ||
| // `continue` here only skips the delimiter check; the byte itself is | ||
| // implicitly captured when the next delimiter or end-of-data fires. | ||
| for i := 0; i < len(data); i++ { | ||
| if data[i] == '\\' && i+1 < len(data) { | ||
| // Escaped character - include both backslash and next char | ||
| current = append(current, data[i], data[i+1]) | ||
| // Escape consumes the next byte verbatim. Advance past it | ||
| // so the delimiter check below doesn't fire on, e.g., an | ||
| // escaped space ('\ ') or escaped comma ('\,'). | ||
| i++ | ||
| } else if data[i] == '"' { | ||
| continue | ||
| } | ||
| if data[i] == '"' { | ||
| inQuotes = !inQuotes | ||
| current = append(current, data[i]) | ||
| } else if data[i] == delim && !inQuotes { | ||
| if len(current) > 0 { | ||
| parts = append(parts, current) | ||
| current = nil | ||
| continue | ||
| } | ||
| if data[i] == delim && !inQuotes { | ||
| if i > start { | ||
| parts = append(parts, data[start:i]) | ||
| } | ||
| } else { | ||
| current = append(current, data[i]) | ||
| start = i + 1 | ||
| } | ||
| } | ||
|
|
||
| if len(current) > 0 { | ||
| parts = append(parts, current) | ||
| if len(data) > start { | ||
| parts = append(parts, data[start:]) | ||
| } | ||
|
|
||
| return parts |
There was a problem hiding this comment.
To further optimize memory usage on the ingestion hot path, we can defer the allocation of the parts slice until we actually find a delimiter or reach the end of the data. For inputs with no delimiters (such as measurements without tags, or single-field writes), this avoids allocating a slice with a capacity of 4, instead allocating a single-element slice of capacity 1. This reduces memory footprint and allocation overhead for these very common cases.
var parts [][]byte
start := 0
inQuotes := false
// Note on `continue` below: bytes consumed by the escape and quote
// branches are NOT dropped — they remain inside the current sub-slice
// because sub-slicing captures the whole [start:next-delim] range.
// `continue` here only skips the delimiter check; the byte itself is
// implicitly captured when the next delimiter or end-of-data fires.
for i := 0; i < len(data); i++ {
if data[i] == '\\' && i+1 < len(data) {
// Escape consumes the next byte verbatim. Advance past it
// so the delimiter check below doesn't fire on, e.g., an
// escaped space ('\ ') or escaped comma ('\,').
i++
continue
}
if data[i] == '"' {
inQuotes = !inQuotes
continue
}
if data[i] == delim && !inQuotes {
if i > start {
if parts == nil {
parts = make([][]byte, 0, 4)
}
parts = append(parts, data[start:i])
}
start = i + 1
}
}
if len(data) > start {
if parts == nil {
parts = [][]byte{data[start:]}
} else {
parts = append(parts, data[start:])
}
}
return partsThere was a problem hiding this comment.
Benched first per [feedback_bench_before_accepting_gemini_perf]. Accepted — real win on the full-path bench:
```
B/op (10-line batch)
empty-input only: 10952
lazy alloc (suggested): 10232 (-720 bytes / -6.6%)
```
Latency is ~flat (~6300 ns, in noise band). The savings come from no-delimiter inputs inside ParseBatch — the micro-benches don't move because both test inputs DO have delimiters.
The lazy-path `[][]byte{data[start:]}` single-element literal is a meaningful improvement over `make([][]byte, 0, 4) → append → 1-element slice` for the very common "measurement with no tags / single-field write" case.
Tests pass under -race × 2. Comment updated to explain the lazy alloc + retain the cap=4 R1 rationale.
Deferred the make([][]byte, 0, 4) until we actually find a delimiter or non-empty trailing data. For no-delimiter inputs (measurements with no tags `cpu`, single-field writes `v=1`), the function now returns a 1-element literal slice instead of allocating cap=4. Bench (ParseBatch, 10 lines, 10s benchtime, count=3): before R2: 10952 B / 142 allocs after R2: 10232 B / 142 allocs (-720 bytes / -6.6%) Latency flat at ~6300 ns (noise). Micro-benches unchanged because both test inputs have delimiters — the lazy path only triggers when the input has zero delimiters. Tests pass under -race x 2.
|
Round 2 addressed in 92b69f0: lazy parts allocation accepted with bench evidence. ParseBatch on 10-line batch: 10952 B → 10232 B (-720 bytes / -6.6%) with same alloc count. The win lands on the no-delimiter inputs (measurements with no tags, single-field writes) — those now return a 1-element literal slice instead of a cap=4 allocation. Latency flat (~6300 ns). Tests pass under @gemini-code-assist review please |
There was a problem hiding this comment.
Code Review
This pull request refactors the splitOnDelimiter function in internal/ingest/lineprotocol.go to use sub-slice indexing instead of a per-byte append loop, significantly reducing allocations on the ingest hot path. It also introduces lazy allocation for the parts slice to optimize cases with no delimiters. Additionally, a benchmark suite BenchmarkSplitOnDelimiter has been added to verify these performance improvements. There are no review comments to address, and we have no further feedback to provide.
Three small follow-ups that landed alongside the larger v26.06.1 features and bug fixes: - arc_query_client_disconnects_total Prometheus counter (PR #466, closes #426) — observability follow-up to PR #424's streaming-on- client-disconnect fix. Placed under the existing #424 entry. - Arrow streaming tests now leak-detect with CheckedAllocator (PR #465, closes #427) — test-hygiene follow-up to the same #424 work; buildArrowBatch helper now releases builders + cols defensively so CheckedAllocator + AssertSize(t,0) catches future Arrow leaks. - Line Protocol parser sub-slice indexing in splitOnDelimiter (PR #467, closes #354) — ingest hot-path perf: ParseLine -41% ns / -52% allocs, ParseBatch -31% ns / -50% allocs. Includes the bench table and notes the declined dynamic-cap-8 suggestion from Gemini review (caused a +9% regression on the full path). Doc-only; +35 lines in RELEASE_NOTES_2026.06.1.md. Co-authored-by: Ignacio Van Droogenbroeck <ignacio@vandroogenbroeck.net>
Real-world throughput numbers from a 30-second line-protocol bench: 4.1M → 4.63M rec/s (+12.9%), p50 2.41 → 2.14 ms, p99 8.85 → 8.46 ms. Attribution: ParseLine microbench -41% × ~30% pipeline share predicts +12% RPS, matches observed +12.9%. The win is the parser; latency also benefits from -50% allocations per line (less GC pressure).
Line Protocol parser sub-slice indexing (PR #467) lifted the end-to-end ingest bench from 4.1M → 4.6M rec/s and trimmed p50 2.41 → 2.14 ms, p99 8.85 → 8.46 ms. Updating the README ingestion table to match.
Closes #354.
Summary
Bench (Apple M3 Max)
Internal review (configuration matrix + deep reviewer): no Blockers/High. Two Mediums addressed — clarifying comment on the `continue` / sub-slice capture semantics, and pre-sized slice (S1).
Test plan
🤖 Generated with Claude Code