Skip to content

Optimize telemetry event publishing#297

Open
IlyaasK wants to merge 1 commit into
mainfrom
hypeship/profile-vmtelemetry
Open

Optimize telemetry event publishing#297
IlyaasK wants to merge 1 commit into
mainfrom
hypeship/profile-vmtelemetry

Conversation

@IlyaasK

@IlyaasK IlyaasK commented Jun 24, 2026

Copy link
Copy Markdown

Summary

  • avoid JSON-marshaling every telemetry envelope on the normal small-event publish path
  • use a conservative JSON-size estimate so escaped metadata still falls back to the full truncation check
  • assign seq before truncation and write seq+event to the ring under one lock
  • avoid wake-channel allocation when no readers are blocked
  • reuse per-session metadata for events without source metadata
  • add publish benchmarks and regression tests for seq assignment, large metadata fallback, escaped metadata fallback, assigned-seq truncation, and session metadata stability

Measurement

Before:

  • BenchmarkEventStreamPublish: ~1.9-2.1 us/op, 384 B/op, 3 allocs/op
  • BenchmarkEventStreamPublishRead: ~2.0-2.1 us/op, 496 B/op, 4 allocs/op
  • BenchmarkTelemetrySessionPublish: ~3.6-4.0 us/op, 857-858 B/op, 9 allocs/op

After:

  • BenchmarkEventStreamPublish: ~48-50 ns/op, 0 B/op, 0 allocs/op
  • BenchmarkEventStreamPublishRead: ~283-290 ns/op, 112 B/op, 1 alloc/op
  • BenchmarkTelemetrySessionPublish: ~136-144 ns/op, 0 B/op, 0 allocs/op

Tests

  • go test ./lib/events ./lib/telemetry ./cmd/api/api
  • go test ./lib/events ./lib/telemetry -run '^$' -bench 'Benchmark(EventStream|TelemetrySession)' -benchmem -count=3
  • go test -race ./lib/events ./lib/telemetry ./cmd/api/api
  • go test $(go list ./... | grep -v /e2e$)
  • go vet ./...

Note: a broad race run also hit an unrelated lib/devtoolsproxy Chromium temp-dir cleanup failure; the touched telemetry/event/API packages pass under race.

@IlyaasK IlyaasK requested a review from Sayan- June 24, 2026 21:27
@IlyaasK IlyaasK marked this pull request as ready for review June 24, 2026 21:34
Comment thread server/lib/events/event.go
@IlyaasK IlyaasK force-pushed the hypeship/profile-vmtelemetry branch from 61ab04c to 140f7c2 Compare June 24, 2026 21:54
@IlyaasK IlyaasK marked this pull request as draft June 24, 2026 21:54

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 140f7c2. Configure here.

env, _ = truncateIfNeeded(env)
es.ring.publish(env)
return env
return es.ring.publishNext(env)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Truncation checks size before seq

Medium Severity

EventStream.Publish now runs truncateIfNeeded before publishNext assigns seq, so the slow-path size check marshals with "seq":0. Envelopes whose JSON length is just under maxS2RecordBytes at seq zero can pass without truncation, then exceed the S2 1 MiB cap once the real monotonic seq is written and consumers remarshal for SSE or storage.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 140f7c2. Configure here.

@IlyaasK IlyaasK force-pushed the hypeship/profile-vmtelemetry branch from 140f7c2 to 80e5ed1 Compare June 24, 2026 23:06
@IlyaasK IlyaasK marked this pull request as ready for review June 25, 2026 13:25
@IlyaasK IlyaasK requested a review from archandatta June 25, 2026 13:32

@Sayan- Sayan- left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed — nice, well-benchmarked optimization pass. the publish-path wins are real and the seq-before-truncation fix is a genuine correctness improvement over the prior shape. one design question worth acknowledging, plus a few smaller things. nothing blocking.

heads up: I checked both Cursor Bugbot inline findings (fast-path escaping; truncation-before-seq) against current HEAD — both are already resolved and covered by new regression tests, so they don't need action.

main design question (likely an acceptable trade — just want it acknowledged)

  • ringbuffer.go:147 (+ eventsstorage.go:49, cmd/api/api/events.go:106) — Reader.Read moved RLockLock to keep the new waiters count accurate, so blocking readers now serialize. that means the S2 durability writer and every SSE stream contend on the exclusive lock, and each publish wakes the whole herd to queue for it. given the crux of this PR is write perf (zero-alloc publish), serializing the far-less-frequent readers is probably a reasonable trade — and the read critical section is tiny (a few field reads + one Envelope copy, no I/O or marshal under the lock), so I'd expect negligible real impact. mostly want it called out as intentional. if S2 read latency is ever a hard constraint, a quick measurement under realistic SSE fan-out would settle it; the way to keep both wins would be an atomic waiters counter with a re-check after registering so Read stays on RLock (but that double-check ordering is fiddly, which is likely why the simple one-lock form was chosen).

questions / suggestions

  • telemetry.go:83 — nil-metadata events now share one sessionMetadata map by pointer. safe today (write-once in Start, frozen before publication, read-only downstream), but it drops the old structural guard where each event got its own map. the safety now rests on an unenforced invariant: a published event's Source.Metadata is read-only downstream. if any future consumer mutates the metadata of an event pulled off the ring, it races every other reader of that shared map → fatal concurrent map read and map write (hard crash, intermittent). worth making the invariant explicit with a comment at the assignment and on the sessionMetadata field.
  • ringbuffer.go:79truncateIfNeeded (up to 2× json.Marshal of a ~1MB envelope) now runs under the ring write lock, so an oversized event briefly blocks all readers+publishers. the marshal only depends on seq's digit-width (≤20 bytes), so it could be hoisted out of the lock with a worst-case-seq pad, leaving only seq-assign + ring-write locked. current form is correct, just holds the lock during the rare big marshal.

tests

  • solid, targeted coverage: the truncation-boundary test (TestEventStreamPublishTruncatesWithAssignedSeq) straddles the real 1MB limit at seq:0 vs the 19-digit seq, so a regression back to truncating-at-seq-0 would fail it; seq assignment/ordering and escaped+large-metadata fallback are covered too. the waiters wake path is already exercised by the pre-existing TestConcurrentPublishRead and TestRingBufferResetWithActiveReader (both block a reader and rely on a publish waking it — a dropped wakeup would hang them to timeout). no new test directly asserts the waiters==0 skip itself, but the benchmark's 0 allocs/op evidences it.

nits

  • event.go:128 — the warn log dropped "seq"; it's available now (assigned before truncateIfNeeded) and useful for correlating which oversized event hit the limit.
  • event.go:133,149estimatedEnvelopeBytes / maxEscapedJSONLen have no doc comments while the rest of the file is commented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants