Skip to content

perf(ingest): sub-slice indexing in splitOnDelimiter (closes #354)#467

Merged
xe-nvdk merged 3 commits into
mainfrom
fix/354-splitondelimiter-subslice
May 27, 2026
Merged

perf(ingest): sub-slice indexing in splitOnDelimiter (closes #354)#467
xe-nvdk merged 3 commits into
mainfrom
fix/354-splitondelimiter-subslice

Conversation

@xe-nvdk
Copy link
Copy Markdown
Member

@xe-nvdk xe-nvdk commented May 27, 2026

Closes #354.

Summary

  • Rewrites `splitOnDelimiter` from per-byte append to sub-slice indexing — returned parts alias the input buffer.
  • Pre-sizes the parts slice (`make([][]byte, 0, 4)`) so the first append is allocation-free on typical telegraf lines.
  • Edge cases verified identical to original: consecutive/leading/trailing delimiters, escape-pair preservation, quote tracking, trailing backslash handling.
  • All downstream callers traced — every part is consumed via `string(...)` or `unescape()` before storage; no `[]byte` references escape the parse call.

Bench (Apple M3 Max)

Bench Before After Delta
ParseLine 1547 ns, 50 allocs 938 ns, 24 allocs -39% ns, -52% allocs
ParseBatch (10 lines) 9215 ns, 282 allocs 6185 ns, 142 allocs -33% ns, -50% allocs
SplitOnDelimiter (new direct bench) 139 ns, 1 alloc n/a

Internal review (configuration matrix + deep reviewer): no Blockers/High. Two Mediums addressed — clarifying comment on the `continue` / sub-slice capture semantics, and pre-sized slice (S1).

Test plan

  • `go test -race ./internal/ingest/ -run 'TestLineProtocol|TestSplit' -count=1` — all 33 line-protocol tests pass
  • `go test -bench=... -benchmem` — numbers above
  • `go vet` + `gofmt` clean
  • Gemini Code Assist review

🤖 Generated with Claude Code

The old implementation appended bytes one-at-a-time into a growing
[]byte per part, allocating a fresh backing array (with several grows)
for every space and comma split in every line-protocol line. Sub-slice
indexing emits `data[start:i]` sub-slices that alias the input buffer
— no per-byte append, no per-part allocation.

Behavior is unchanged at the edges (consecutive delimiters skip empty
parts; leading/trailing delimiters handled identically; escape branch
preserves the \\x pair; quote tracking unchanged).

The returned parts now ALIAS the input — verified by trace that every
downstream caller copies via string(...) or unescape() before storing
anything beyond the parse call (models.Record holds map[string]string
and map[string]interface{} only; no []byte references escape).

Also pre-sizes the parts slice to cap=4 (typical telegraf line splits
3-4 on space, 4-8 on comma); makes the first append allocation-free.

Bench (Apple M3 Max), telegraf-style line + 10-line batch:

  BenchmarkLineProtocolParser_ParseLine:
    before:  1547 ns, 1728 B,  50 allocs
    after:    938 ns, 1152 B,  24 allocs   (-39% ns, -52% allocs)

  BenchmarkLineProtocolParser_ParseBatch:
    before:  9215 ns, 12152 B, 282 allocs
    after:   6185 ns, 10952 B, 142 allocs  (-33% ns, -50% allocs)

  BenchmarkSplitOnDelimiter/space_delimiter (new micro):
    after:    139 ns,    96 B,   1 alloc

Adds BenchmarkSplitOnDelimiter covering both space + comma paths so
future regressions on the ingest hot path don't go unnoticed. All 33
existing line-protocol tests pass under -race.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented May 27, 2026

@gemini-code-assist review please

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the splitOnDelimiter function in internal/ingest/lineprotocol.go by switching from a per-byte append loop to sub-slice indexing, which avoids unnecessary allocations on the ingest hot path. Additionally, a benchmark has been added in internal/ingest/lineprotocol_test.go to verify the performance improvements. There are no review comments to address, and I have no additional feedback to provide.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the splitOnDelimiter function in internal/ingest/lineprotocol.go by switching from a per-byte append loop to sub-slice indexing, which avoids unnecessary allocations on the ingest hot path. It also adds a benchmark suite to verify the performance improvements. The review feedback suggests a fast-path for empty inputs and dynamically sizing the initial slice capacity based on the delimiter to further reduce allocations.

Comment thread internal/ingest/lineprotocol.go Outdated
Partial accept of Gemini's R1 suggestion:

(1) Empty-input fast path — ACCEPTED. Free to add; not reachable
on the production hot path today (callers pre-check len(line) > 0)
but defensive for hypothetical future callers. Saves the make([])
allocation when the function is called with empty input.

(2) Dynamic cap=8 on comma path — DECLINED with bench evidence.
Micro-bench on a 6-part input shows the expected -1 alloc / -38 ns
win, but the full-path ParseBatch regresses ~9% / +1.9 KB because
most real comma-splits have ≤4 parts (1-tag measurements like
`cpu,host=h`, single-field writes like `v=1`). Oversizing those to
cap=8 costs more in aggregate than the rare 5+-part case saves.
Comment added next to the cap=4 line so the next reader doesn't
re-attempt without re-benching.

Bench (cap=4 retained, empty-input added):
  BenchmarkSplitOnDelimiter/space_delimiter:    138 ns,  96 B, 1 alloc
  BenchmarkSplitOnDelimiter/comma_delimiter:    148 ns, 288 B, 2 allocs
  BenchmarkLineProtocolParser_ParseLine:        915 ns, 1152 B, 24 allocs
  BenchmarkLineProtocolParser_ParseBatch:      6273 ns, 10952 B, 142 allocs

All matches the pre-Gemini-fix baseline; empty-input fast path has
no measurable cost on non-empty inputs (callers always pass non-empty
on the hot path).

All 33 line-protocol tests pass under -race x 2.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented May 27, 2026

Round 1 addressed in faa16f4: partial accept with bench evidence.

  • Empty-input fast path: accepted (free, defensive)
  • Dynamic cap=8 on comma: declined — full-path ParseBatch regresses +9% / +1.9 KB because most real comma-splits have ≤4 parts. Inline comment added so the next reader doesn't re-attempt without re-benching.

All 33 line-protocol tests pass under -race × 2; benches stable at baseline + empty-input win.

@gemini-code-assist review please

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request optimizes the splitOnDelimiter function in internal/ingest/lineprotocol.go by replacing the per-byte append loop with sub-slice indexing to reduce allocations on the ingestion hot path. It also adds benchmarks to verify these performance improvements. The reviewer suggested further optimizing memory usage by deferring the allocation of the parts slice until a delimiter is actually encountered or the end of the data is reached, preventing unnecessary allocations for inputs with no delimiters.

Comment thread internal/ingest/lineprotocol.go Outdated
Comment on lines 174 to 207
parts := make([][]byte, 0, 4)
start := 0
inQuotes := false

// Note on `continue` below: bytes consumed by the escape and quote
// branches are NOT dropped — they remain inside the current sub-slice
// because sub-slicing captures the whole [start:next-delim] range.
// `continue` here only skips the delimiter check; the byte itself is
// implicitly captured when the next delimiter or end-of-data fires.
for i := 0; i < len(data); i++ {
if data[i] == '\\' && i+1 < len(data) {
// Escaped character - include both backslash and next char
current = append(current, data[i], data[i+1])
// Escape consumes the next byte verbatim. Advance past it
// so the delimiter check below doesn't fire on, e.g., an
// escaped space ('\ ') or escaped comma ('\,').
i++
} else if data[i] == '"' {
continue
}
if data[i] == '"' {
inQuotes = !inQuotes
current = append(current, data[i])
} else if data[i] == delim && !inQuotes {
if len(current) > 0 {
parts = append(parts, current)
current = nil
continue
}
if data[i] == delim && !inQuotes {
if i > start {
parts = append(parts, data[start:i])
}
} else {
current = append(current, data[i])
start = i + 1
}
}

if len(current) > 0 {
parts = append(parts, current)
if len(data) > start {
parts = append(parts, data[start:])
}

return parts
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

To further optimize memory usage on the ingestion hot path, we can defer the allocation of the parts slice until we actually find a delimiter or reach the end of the data. For inputs with no delimiters (such as measurements without tags, or single-field writes), this avoids allocating a slice with a capacity of 4, instead allocating a single-element slice of capacity 1. This reduces memory footprint and allocation overhead for these very common cases.

	var parts [][]byte
	start := 0
	inQuotes := false

	// Note on `continue` below: bytes consumed by the escape and quote
	// branches are NOT dropped — they remain inside the current sub-slice
	// because sub-slicing captures the whole [start:next-delim] range.
	// `continue` here only skips the delimiter check; the byte itself is
	// implicitly captured when the next delimiter or end-of-data fires.
	for i := 0; i < len(data); i++ {
		if data[i] == '\\' && i+1 < len(data) {
			// Escape consumes the next byte verbatim. Advance past it
			// so the delimiter check below doesn't fire on, e.g., an
			// escaped space ('\ ') or escaped comma ('\,').
			i++
			continue
		}
		if data[i] == '"' {
			inQuotes = !inQuotes
			continue
		}
		if data[i] == delim && !inQuotes {
			if i > start {
				if parts == nil {
					parts = make([][]byte, 0, 4)
				}
				parts = append(parts, data[start:i])
			}
			start = i + 1
		}
	}

	if len(data) > start {
		if parts == nil {
			parts = [][]byte{data[start:]}
		} else {
			parts = append(parts, data[start:])
		}
	}

	return parts

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Benched first per [feedback_bench_before_accepting_gemini_perf]. Accepted — real win on the full-path bench:

```
B/op (10-line batch)
empty-input only: 10952
lazy alloc (suggested): 10232 (-720 bytes / -6.6%)
```

Latency is ~flat (~6300 ns, in noise band). The savings come from no-delimiter inputs inside ParseBatch — the micro-benches don't move because both test inputs DO have delimiters.

The lazy-path `[][]byte{data[start:]}` single-element literal is a meaningful improvement over `make([][]byte, 0, 4) → append → 1-element slice` for the very common "measurement with no tags / single-field write" case.

Tests pass under -race × 2. Comment updated to explain the lazy alloc + retain the cap=4 R1 rationale.

Deferred the make([][]byte, 0, 4) until we actually find a delimiter
or non-empty trailing data. For no-delimiter inputs (measurements
with no tags `cpu`, single-field writes `v=1`), the function now
returns a 1-element literal slice instead of allocating cap=4.

Bench (ParseBatch, 10 lines, 10s benchtime, count=3):
  before R2: 10952 B / 142 allocs
  after  R2: 10232 B / 142 allocs  (-720 bytes / -6.6%)

Latency flat at ~6300 ns (noise). Micro-benches unchanged because
both test inputs have delimiters — the lazy path only triggers when
the input has zero delimiters.

Tests pass under -race x 2.
@xe-nvdk
Copy link
Copy Markdown
Member Author

xe-nvdk commented May 27, 2026

Round 2 addressed in 92b69f0: lazy parts allocation accepted with bench evidence.

ParseBatch on 10-line batch: 10952 B → 10232 B (-720 bytes / -6.6%) with same alloc count. The win lands on the no-delimiter inputs (measurements with no tags, single-field writes) — those now return a 1-element literal slice instead of a cap=4 allocation. Latency flat (~6300 ns).

Tests pass under -race × 2.

@gemini-code-assist review please

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the splitOnDelimiter function in internal/ingest/lineprotocol.go to use sub-slice indexing instead of a per-byte append loop, significantly reducing allocations on the ingest hot path. It also introduces lazy allocation for the parts slice to optimize cases with no delimiters. Additionally, a benchmark suite BenchmarkSplitOnDelimiter has been added to verify these performance improvements. There are no review comments to address, and we have no further feedback to provide.

@xe-nvdk xe-nvdk merged commit 89a4f25 into main May 27, 2026
6 checks passed
@xe-nvdk xe-nvdk deleted the fix/354-splitondelimiter-subslice branch May 27, 2026 19:44
xe-nvdk added a commit that referenced this pull request May 27, 2026
Three small follow-ups that landed alongside the larger v26.06.1
features and bug fixes:

- arc_query_client_disconnects_total Prometheus counter (PR #466,
  closes #426) — observability follow-up to PR #424's streaming-on-
  client-disconnect fix. Placed under the existing #424 entry.
- Arrow streaming tests now leak-detect with CheckedAllocator (PR
  #465, closes #427) — test-hygiene follow-up to the same #424 work;
  buildArrowBatch helper now releases builders + cols defensively so
  CheckedAllocator + AssertSize(t,0) catches future Arrow leaks.
- Line Protocol parser sub-slice indexing in splitOnDelimiter (PR
  #467, closes #354) — ingest hot-path perf: ParseLine -41% ns /
  -52% allocs, ParseBatch -31% ns / -50% allocs. Includes the bench
  table and notes the declined dynamic-cap-8 suggestion from Gemini
  review (caused a +9% regression on the full path).

Doc-only; +35 lines in RELEASE_NOTES_2026.06.1.md.

Co-authored-by: Ignacio Van Droogenbroeck <ignacio@vandroogenbroeck.net>
xe-nvdk pushed a commit that referenced this pull request May 27, 2026
Real-world throughput numbers from a 30-second line-protocol bench:
4.1M → 4.63M rec/s (+12.9%), p50 2.41 → 2.14 ms, p99 8.85 → 8.46 ms.

Attribution: ParseLine microbench -41% × ~30% pipeline share predicts
+12% RPS, matches observed +12.9%. The win is the parser; latency
also benefits from -50% allocations per line (less GC pressure).
xe-nvdk pushed a commit that referenced this pull request May 27, 2026
Line Protocol parser sub-slice indexing (PR #467) lifted the
end-to-end ingest bench from 4.1M → 4.6M rec/s and trimmed p50
2.41 → 2.14 ms, p99 8.85 → 8.46 ms. Updating the README ingestion
table to match.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

low(ingest): splitOnDelimiter byte-by-byte append

1 participant