Skip to content

add OTEL-compatible tracing + make targets + scripts#139

Open
djs55 wants to merge 19 commits intocontainerd:mainfrom
djs55:djs55/otel-tracing
Open

add OTEL-compatible tracing + make targets + scripts#139
djs55 wants to merge 19 commits intocontainerd:mainfrom
djs55:djs55/otel-tracing

Conversation

@djs55
Copy link
Copy Markdown
Contributor

@djs55 djs55 commented Mar 26, 2026

Help us understand (and later optimise) VM/container start time.

If you run this:

make test-tracing
make jaeger-open

It produces this:

Screenshot 2026-03-26 at 16 24 54

The biggest thing on the critical path is the VM start + initial connection, which needs breaking down more (future investigation)

We can also share Mermaid diagrams which can be rendered on GitHub by running:

make jaeger-gantt

which produces this:

gantt
    dateFormat x
    axisFormat %s.%L s
    title Trace for nerdbox-vm

    section VM.NewInstance+Start
    nerdbox/VM.NewInstance+Start :VM_NewInstance+Start_9f6fda, 0, 83
    . nerdbox/libkrun.VMStart :libkrun_VMStart_90885a, 5, 82
    . nerdbox/libkrun.WaitForTTRPC :libkrun_WaitForTTRPC_10af7c, 5, 82
    . nerdbox/TTRPC.Task.Create :TTRPC_Task_Create_5e832c, 83, 85
    . . nerdbox/containerd.task.v3.Task/Create :containerd_task_v3_Task_Create_3a4175, 84, 85
    . . . containerd.task.v3.Task/Create :containerd_task_v3_Task_Create_61f5b3, 83, 85
    . . . . task.Create :task_Create_a0b789, 84, 85
    . . . . . runc.NewContainer :runc_NewContainer_f4a190, 84, 85
    . . . . . . crun.create :crun_create_beaf4f, 84, 85
    . nerdbox/TTRPC.Task.Start :TTRPC_Task_Start_c8de57, 85, 87
    . . nerdbox/containerd.task.v3.Task/Start :containerd_task_v3_Task_Start_f0139b, 85, 87
    . . . containerd.task.v3.Task/Start :containerd_task_v3_Task_Start_1d5297, 85, 87
    . . . . task.Start :task_Start_84961b, 85, 87
    . . . . . . container.Start :container_Start_5372d6, 85, 87
    . . . . . . . crun.start :crun_start_ce160c, 85, 87
Loading

The first patch adds the infra and tracing for the host part. This gets most of the benefit IMHO.

The second patch extends the tracing to the VM: this is a bit complicated because I didn't want to assume there was a network, so the spans are relayed back to the host and then forwarded to the collector.

The clock is interesting because the RTC only has 1s granularity, so to correct we send the host timestamp to the VM before the tracing.

Copilot AI review requested due to automatic review settings March 26, 2026 16:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenTelemetry (OTel) tracing across the VM/container startup path, including relaying VM-originated spans back to the host, plus developer tooling (Jaeger + Mermaid Gantt) and an integration test to visualize traces.

Changes:

  • Introduce a VM-side TTRPC traces service + channel-backed OTel span exporter to stream spans out of the VM.
  • Instrument key host/shim + VM code paths with spans and propagate context over TTRPC.
  • Add tracing integration test, Jaeger/Makefile helpers, Mermaid conversion script, and tracing documentation.

Reviewed changes

Copilot reviewed 20 out of 21 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
plugins/vminit/tracing/plugin.go Registers VM tracing plugin and installs OTel tracer provider/exporter.
internal/vminit/tracing/service.go TTRPC streaming service that forwards exported spans to the host.
internal/vminit/tracing/exporter.go Channel-backed OTel SpanExporter implementation for VM span buffering.
internal/vminit/task/service.go Adds VM-side spans around Create/Start and networking wait/connect.
internal/vminit/runc/container.go Adds VM-side spans around container creation and mount setup.
internal/vminit/process/init.go Adds VM-side spans around crun create/start operations.
internal/vminit/ctrnetworking/ctrnetworking.go Adds VM-side spans per-network setup within errgroup goroutines.
internal/vm/libkrun/instance.go Adds host-side spans for VMStart/WaitForTTRPC + TTRPC client interceptor.
internal/shim/task/tracing.go Implements host-side relay of VM span stream to an OTLP HTTP exporter.
internal/shim/task/service.go Adds shim spans and starts VM trace relay during Create.
integration/tracing_test.go Build-tagged integration test that exports spans to Jaeger and relays VM spans.
hack/trace-to-mermaid.py Fetches latest Jaeger trace and renders it as a Mermaid Gantt chart.
docs/tracing.md Documentation for running tracing workflow with Jaeger + Mermaid.
cmd/vminitd/main.go Ensures tracing plugin is loaded by vminitd.
api/proto/nerdbox/services/traces/v1/traces.proto New Traces service + Span/KeyValue messages for VM→host span streaming.
api/services/traces/v1/traces.pb.go Generated protobuf types for traces service.
api/services/traces/v1/traces_ttrpc.pb.go Generated TTRPC bindings for traces service.
api/services/traces/v1/doc.go Package doc for traces API.
Makefile Adds test-tracing, Jaeger lifecycle, UI open, and Gantt generation targets.
go.mod Adds OTel SDK + OTLP HTTP exporter dependencies.
go.sum Records new module checksums for added dependencies.
Comments suppressed due to low confidence (1)

internal/vm/libkrun/instance.go:337

  • This wait loop has multiple return paths before the explicit ttrpcWaitSpan.End() call, so the WaitForTTRPC span may never be ended/exported. Use a defer ttrpcWaitSpan.End() right after starting the span, or restructure to end the span before each early return.
	_, ttrpcWaitSpan := tracer.Start(ctx, "libkrun.WaitForTTRPC")
	startedAt := time.Now()
	for {
		select {
		case err := <-errC:
			if err != nil {
				return fmt.Errorf("failure running vm: %w", err)
			}
		case <-ctx.Done():
			return ctx.Err()
		case <-time.After(time.Millisecond):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/vminit/ctrnetworking/ctrnetworking.go
Comment thread internal/shim/task/service.go
Comment thread internal/shim/task/tracing.go Outdated
Comment thread plugins/vminit/tracing/plugin.go Outdated
Comment thread Makefile Outdated
Comment thread internal/vm/libkrun/instance.go Outdated
@djs55 djs55 force-pushed the djs55/otel-tracing branch from a7fd67c to 830e59d Compare March 26, 2026 16:38
Copilot AI review requested due to automatic review settings March 26, 2026 16:43
@djs55 djs55 force-pushed the djs55/otel-tracing branch from 830e59d to ca1697e Compare March 26, 2026 16:43
@djs55 djs55 force-pushed the djs55/otel-tracing branch from ca1697e to f06a31f Compare March 26, 2026 16:51
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

internal/vm/libkrun/instance.go:341

  • Spans started for VMStart/WaitForTTRPC are only ended on the success path. If Start() returns early (errC error, ctx.Done(), or vmStartTimeout), these spans never end, which produces incomplete traces and can leak span state. Consider deferring span.End() immediately after each tracer.Start (or using a single parent span with defer) so all exit paths close the spans.
	tracer := otel.Tracer("nerdbox")
	_, krunStartSpan := tracer.Start(ctx, "libkrun.VMStart")
	errC := make(chan error)
	go func() {
		defer close(errC)
		if err := v.vmc.Start(); err != nil {
			errC <- err
		}
	}()

	v.shutdownCallbacks = []func(context.Context) error{
		func(context.Context) error {
			cerr := v.vmc.Shutdown()
			select {
			case err := <-errC:
				if err != nil {
					return fmt.Errorf("failure running vm: %w", err)
				}
			default:
			}
			return cerr
		},
	}

	var conn net.Conn
	// Initial TTRPC ping deadline. On Windows, the vsock listen-mode proxy
	// has more overhead (host→Unix socket→vsock→guest→vsock→Unix socket→host)
	// so we start with a longer deadline.
	d := 2 * time.Millisecond
	if runtime.GOOS == "windows" {
		d = 500 * time.Millisecond
	}
	_, ttrpcWaitSpan := tracer.Start(ctx, "libkrun.WaitForTTRPC")
	startedAt := time.Now()
	for {
		select {
		case err := <-errC:
			if err != nil {
				return fmt.Errorf("failure running vm: %w", err)
			}
		case <-ctx.Done():
			return ctx.Err()
		case <-time.After(time.Millisecond):
		}
		if time.Since(startedAt) > vmStartTimeout {
			log.G(ctx).WithField("timeout", vmStartTimeout).Warn("Timeout while waiting for VM to start")
			return fmt.Errorf("VM did not start within %s", vmStartTimeout)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/shim/task/tracing.go Outdated
Comment thread Makefile
@djs55 djs55 force-pushed the djs55/otel-tracing branch from f06a31f to b45ef30 Compare March 26, 2026 16:55
Copilot AI review requested due to automatic review settings March 26, 2026 17:04
@djs55 djs55 force-pushed the djs55/otel-tracing branch from b45ef30 to d5c5c51 Compare March 26, 2026 17:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/vminit/tracing/exporter.go Outdated
Comment thread internal/vm/libkrun/instance.go
Comment thread Makefile
Comment thread hack/trace-to-mermaid.py Outdated
Comment thread hack/trace-to-mermaid.py
@djs55 djs55 force-pushed the djs55/otel-tracing branch from d5c5c51 to f70b6f4 Compare March 26, 2026 17:22
Copilot AI review requested due to automatic review settings March 26, 2026 18:47
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Makefile
Comment thread internal/vminit/tracing/exporter.go Outdated
Comment thread internal/tracing/relay.go Outdated
Comment thread internal/tracing/relay.go Outdated
@djs55 djs55 force-pushed the djs55/otel-tracing branch from 7087d9a to 415b643 Compare March 26, 2026 19:06
Copilot AI review requested due to automatic review settings March 26, 2026 19:15
@djs55 djs55 force-pushed the djs55/otel-tracing branch from 415b643 to 378cb97 Compare March 26, 2026 19:15
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 22 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

internal/vm/libkrun/instance.go:304

  • errC is unbuffered and the VM start goroutine does a blocking errC <- err. If Start returns early (e.g., ctx.Done() or timeout) before receiving from errC, that goroutine can deadlock forever on send (and never reach close(errC)), leaking the goroutine and potentially hanging shutdown logic that expects the channel to close. Make errC buffered (size 1) and/or make the send non-blocking so errors can't block when the caller has already returned.
	errC := make(chan error)
	go func() {
		defer close(errC)
		if err := v.vmc.Start(); err != nil {
			errC <- err
		}
	}()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Makefile
Comment thread internal/shim/task/service.go Outdated
Comment thread internal/vminit/tracing/service.go
@djs55 djs55 marked this pull request as ready for review March 26, 2026 21:47
@djs55 djs55 force-pushed the djs55/otel-tracing branch from 378cb97 to e82db3d Compare April 1, 2026 10:45
Copilot AI review requested due to automatic review settings April 1, 2026 15:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 16 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/vm/libkrun/instance.go Outdated
Comment thread internal/vminit/task/service.go Outdated
Comment thread internal/vminit/task/service.go Outdated
Comment thread internal/vminit/task/service.go Outdated
Comment thread internal/vminit/task/service.go Outdated
Comment thread internal/shim/task/service.go Outdated
Comment thread internal/shim/task/service.go
Comment thread internal/tracing/relay.go Outdated
Comment thread internal/tracing/relay.go
Comment thread hack/trace-to-mermaid.py Outdated
@djs55 djs55 force-pushed the djs55/otel-tracing branch from ffb3c23 to 99e7a9f Compare April 3, 2026 14:03
Copilot AI review requested due to automatic review settings April 3, 2026 14:17
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

internal/vminit/task/service.go:314

  • In Start, ctx is overwritten with the context returned from vmTracer.Start for waitForCtrNetConnect, but after waitNetSpan.End() the function continues using that derived ctx for subsequent operations (getContainer, logs, etc.). This again nests later work under an already-ended span. Consider using a scoped variable (e.g., waitCtx) for waitForCtrNetConnect and then continuing with startCtx (or the original ctx) for the rest of the method.
	startCtx := ctx // preserve parent context for starting sibling spans

	ctx, waitNetSpan := vmTracer.Start(startCtx, "waitForCtrNetConnect")
	if err := s.waitForCtrNetConnect(ctx, r.ID); err != nil {
		waitNetSpan.End()
		return nil, errgrpc.ToGRPC(err)
	}
	waitNetSpan.End()

	container, err := s.getContainer(r.ID)
	if err != nil {
		return nil, err

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/tracing/relay.go Outdated
@dmcgowan
Copy link
Copy Markdown
Member

dmcgowan commented Apr 8, 2026

My main concern with this one is binary size. Luckily it seems the initrd size increase is reasonable. However, the shim doubles in size. If we were to keep the exporter only in the runtime, would we need to use a similar tracing API in the shim to pull them up to the runtime?

This is the analysis I am looking at

Commit Description Shim Delta Initrd Delta
8493dd7 base (pre-otel) 8,763,064 (8.4 MB) 5,910,307 (5.6 MB)
48325ca +startup tracing 10,735,512 (10.2 MB) +1,972,448 (+23%) 5,910,307 (5.6 MB) 0
77cd1bc +VM→host span relay 16,039,448 (15.3 MB) +5,303,936 (+49%) 6,038,453 (5.8 MB) +128,146 (+2%)
70e56d8 +tracing plugin 16,302,840 (15.5 MB) +263,392 (+2%) 6,038,948 (5.8 MB) +495 (~0%)
Total +7,539,776 (+86%) +128,641 (+2%)

@djs55 djs55 force-pushed the djs55/otel-tracing branch from 70e56d8 to f754e9f Compare April 9, 2026 09:17
@djs55
Copy link
Copy Markdown
Contributor Author

djs55 commented Apr 9, 2026

Yeah I agree the binary size is bad. OTEL always bothers me: every time I use it all I want is create a tree of named spans over HTTP and yet the required deps are huge. I'm probably only using 0.1% of its capabilities.

I did a quick experiment to try to reduce the dependencies. Rather than using OTEL, I've added a minimal span sending thing which only requires json (I think I need this to remain wire-compatible with Jaeger etc). I noticed the net/http pulled in a bunch of crypto so I dropped that too if we assume everything is localhost HTTP.

Comparing binary sizes to main (Mac/arm64)

 ls -lh _output.main _output.pr
_output.main:
total 76288
-rwxr-xr-x@ 1 djs55  staff   9.6M  9 Apr 10:04 containerd-shim-nerdbox-v1
-rw-r--r--@ 1 djs55  staff   5.1M  9 Apr 10:05 nerdbox-initrd

_output.pr:
total 72088
-rwxr-xr-x@ 1 djs55  staff   9.7M  9 Apr 10:13 containerd-shim-nerdbox-v1
-rw-r--r--@ 1 djs55  staff   4.4M  9 Apr 10:14 nerdbox-initrd
  • so the shim is up 0.1M, probably json
  • the initrd is smaller (!), this seems to be because the otelttrpc dep was already in main since the beginning but we can use the mini-tracing thing instead.

Let me know what you think. I pushed these patches on top of the previous ones, so the branch is a bit of a mess and would need more tidying 😅

@djs55 djs55 changed the title add OTEL tracing + make targets + scripts add OTEL-compatible tracing + make targets + scripts Apr 9, 2026
djs55 and others added 18 commits April 9, 2026 10:26
Add OpenTelemetry spans to the host-side VM startup path (libkrun.VMStart,
libkrun.WaitForTTRPC) with developer tooling: Jaeger Makefile targets,
a Mermaid Gantt chart generator, tracing integration test, and docs.

Signed-off-by: David Scott <dave@recoil.org>
Add a ttrpc Traces streaming service inside the VM that exports OTel spans
over vsock to the host shim. The shim relays them to the OTLP collector
with clock-skew correction. Includes VM-side spans for container create,
start, mount setup, and networking, plus the proto definitions and
generated code.

Signed-off-by: David Scott <dave@recoil.org>
Register a host-side OTel tracing plugin in the shim that configures
a TracerProvider with an OTLP HTTP exporter, enabling end-to-end trace
context propagation across the shim's ttrpc calls.

Signed-off-by: David Scott <dave@recoil.org>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: David Scott <dave@recoil.org>
Signed-off-by: David Scott <dave@recoil.org>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: David Scott <dave@recoil.org>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: David Scott <dave@recoil.org>
Replace the OTel SDK-based Exporter with the minimal internal
tracing.Collector, removing the last OTel SDK dependency from the
vminit tracing service.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: David Scott <dave@recoil.org>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: David Scott <dave@recoil.org>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: David Scott <dave@recoil.org>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: David Scott <dave@recoil.org>
Replace all go.opentelemetry.io/* imports with the internal tracing
package. The test now uses tracing.Init, tracing.Start, and
tracing.ForwardTraces directly instead of manually wiring up OTel
exporters, providers, and propagators.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: David Scott <dave@recoil.org>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: David Scott <dave@recoil.org>
Signed-off-by: David Scott <dave@recoil.org>
Signed-off-by: David Scott <dave@recoil.org>
Signed-off-by: David Scott <dave@recoil.org>
Signed-off-by: David Scott <dave@recoil.org>
@djs55 djs55 force-pushed the djs55/otel-tracing branch from f754e9f to 76b8f4e Compare April 9, 2026 09:26
Signed-off-by: David Scott <dave@recoil.org>
Copilot AI review requested due to automatic review settings April 9, 2026 09:35
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 28 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread internal/tracing/flush.go
Comment on lines +119 to +129
func spanToOTLPJSON(s *Span) otlpSpan {
return otlpSpan{
TraceID: s.TraceID.String(),
SpanID: s.SpanID.String(),
ParentSpanID: s.ParentSpanID.String(),
Name: s.Name,
Kind: 1, // SPAN_KIND_INTERNAL
StartTimeUnixNano: strconv.FormatInt(s.StartTime.UnixNano(), 10),
EndTimeUnixNano: strconv.FormatInt(s.EndTime.UnixNano(), 10),
Status: otlpStatus{Code: 1}, // STATUS_CODE_OK
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spanToOTLPJSON always sets ParentSpanID to a non-empty string. For root spans this becomes an all-zero parent span ID, which is invalid in OTLP and can confuse trace reconstruction. Consider omitting parentSpanId when ParentSpanID is the zero value (set it to the empty string so the omitempty tag works).

Copilot uses AI. Check for mistakes.
Comment thread internal/tracing/relay.go
Comment on lines +67 to +75
span := otlpSpan{
TraceID: hex.EncodeToString(s.TraceID),
SpanID: hex.EncodeToString(s.SpanID),
ParentSpanID: hex.EncodeToString(s.ParentSpanID),
Name: s.Name,
Kind: int(s.Kind),
StartTimeUnixNano: strconv.FormatInt(startNano, 10),
EndTimeUnixNano: strconv.FormatInt(endNano, 10),
Status: otlpStatus{
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exportVMSpan always encodes and sets ParentSpanID. For root spans this becomes an all-zero parent span ID, which is invalid in OTLP and can break parent/child relationships in Jaeger. Consider leaving ParentSpanID empty when the incoming ParentSpanID is empty/all-zeros so the omitempty tag on otlpSpan.ParentSpanID can drop it.

Copilot uses AI. Check for mistakes.
Comment thread internal/tracing/flush.go
addr += ":80"
}
}
return &OTLPEndpoint{addr: addr, path: "/v1/traces"}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseOTLPEndpoint ignores any path component in OTEL_EXPORTER_OTLP_ENDPOINT and always forces path: "/v1/traces". This breaks valid endpoints like http://collector:4318/v1/traces (you'll POST to /v1/traces on the host but can't honor custom paths/reverse proxies). Consider using u.Path when non-empty (and defaulting to /v1/traces only when the path is empty).

Suggested change
return &OTLPEndpoint{addr: addr, path: "/v1/traces"}
path := "/v1/traces"
if u.Path != "" {
path = u.Path
}
return &OTLPEndpoint{addr: addr, path: path}

Copilot uses AI. Check for mistakes.
Comment thread internal/tracing/flush.go
Comment on lines +188 to +201
u, err := url.Parse(endpoint)
if err != nil || u.Host == "" {
// Treat as host:port directly.
return &OTLPEndpoint{addr: endpoint, path: "/v1/traces"}
}
addr := u.Host
if u.Port() == "" {
if u.Scheme == "https" {
addr += ":443"
} else {
addr += ":80"
}
}
return &OTLPEndpoint{addr: addr, path: "/v1/traces"}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ParseOTLPEndpoint treats https://... as port 443, but postOTLP sends a plaintext HTTP/1.1 request over raw TCP (no TLS). This will fail against real HTTPS collectors and can be a security footgun if someone expects encryption. Consider either (a) explicitly rejecting https endpoints with a clear error/log and documenting that only plain HTTP is supported, or (b) implementing TLS for HTTPS endpoints.

Copilot uses AI. Check for mistakes.
Comment thread internal/tracing/flush.go
Comment on lines +64 to +68
// Shutdown flushes remaining spans and stops the background goroutine.
func (f *Flusher) Shutdown(ctx context.Context) error {
close(f.done)
return f.flush(ctx)
}
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flusher.Shutdown closes f.done unconditionally. If the shutdown callback is invoked more than once (or Shutdown is called defensively), this will panic with "close of closed channel". Consider making shutdown idempotent (e.g., sync.Once around the close, or a non-blocking select on f.done).

Copilot uses AI. Check for mistakes.
Comment thread internal/tracing/flush.go
Comment on lines +162 to +172
// Read just the status line to check for errors.
var resp [128]byte
n, _ := conn.Read(resp[:])
if n < 12 {
return fmt.Errorf("OTLP response too short")
}
// "HTTP/1.1 200" — status code starts at byte 9
if resp[9] >= '4' {
return fmt.Errorf("OTLP export failed: %s", string(resp[:n]))
}
return nil
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postOTLP checks resp[9] >= '4' to detect errors, which treats any 3xx response as success and also assumes the status code is always at byte 9. Consider parsing the HTTP status line more robustly and treating non-2xx responses as errors (at minimum, require resp[9] == '2').

Copilot uses AI. Check for mistakes.
Comment thread docs/tracing.md

Nerdbox supports distributed tracing across the full container startup path,
including spans that originate inside the VM. Traces are exported via
OTLP/HTTP (JSON) to any compatible collector (e.g., Jaeger).
Copy link

Copilot AI Apr 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs say traces are exported via OTLP/HTTP to any compatible collector, but the current implementation sends a plaintext HTTP/1.1 request over raw TCP and does not support HTTPS/TLS endpoints. Consider documenting this limitation explicitly (or adding TLS support) so users don't assume https://... endpoints will work.

Suggested change
OTLP/HTTP (JSON) to any compatible collector (e.g., Jaeger).
OTLP/HTTP (JSON) to compatible collectors over plaintext `http://`
endpoints (for example, a local Jaeger setup). HTTPS/TLS (`https://`)
endpoints are not currently supported.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants