Refactor beholder lifecycle and chip ingress batching by thomaska · Pull Request #1862 · smartcontractkit/chainlink-common

thomaska · 2026-02-27T14:43:14Z

Ticket: https://smartcontract-it.atlassian.net/browse/INFOPLAT-3436

Summary

This refactor makes beholder.Client the explicit lifecycle owner for chip-ingress batching and removes constructor-time background startup.

What changed

beholder.Client now owns start/stop for the optional batch emitter service
ChipIngressBatchEmitterService no longer starts runtime goroutines in the constructor
LOOP now starts/stops the beholder client directly instead of using ManagedServices()
batching paths no longer retain caller request contexts after enqueue
pkg/chipingress/batch.Client keeps simple single-owner Start/Stop semantics

Metrics

Added batch client metrics for:

request count / failures
batch size in messages
batch size in bytes
request latency
config info gauge

Tests

Added:

beholder lifecycle coverage
batch/emitter metric assertions
benchmark smoke coverage for batch queueing and emitter enqueue paths

Supports

smartcontractkit/chainlink#21327

github-actions · 2026-02-27T14:43:26Z

👋 thomaska, thanks for creating this pull request!

To help reviewers, please consider creating future PRs as drafts first. This allows you to self-review and make any final changes before notifying the team.

Once you're ready, you can mark it as "Ready for review" to request feedback. Thanks!

Copilot

Pull request overview

This pull request replaces the per-event ChIP Ingress emission with a batched approach to reduce overhead from N gRPC calls + N Kafka transactions to 1 call + 1 transaction per flush interval. The implementation introduces a new ChipIngressBatchEmitter that buffers events per (domain, entity) pair and flushes them periodically using PublishBatch.

Changes:

Introduced ChipIngressBatchEmitter with per-(domain, entity) worker goroutines for batching events
Added chipIngressEmitterWorker to handle batch assembly and sending with configurable timeouts
Removed goroutine wrapper from DualSourceEmitter.Emit() since batching is now non-blocking (channel send)
Added 4 new configuration parameters with sensible defaults (BufferSize: 100, MaxBatchSize: 50, SendInterval: 500ms, SendTimeout: 10s)

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
pkg/beholder/chip_ingress_batch_emitter.go	New batch emitter with per-worker buffering and periodic flushing via PublishBatch
pkg/beholder/chip_ingress_emitter_worker.go	Worker implementation handling batch assembly, channel draining, and exponential backoff logging for drops
pkg/beholder/chip_ingress_batch_emitter_test.go	Comprehensive test coverage (10 tests) for batching, max batch size, isolation, buffer overflow, lifecycle, errors, and defaults
pkg/beholder/dual_source_emitter.go	Simplified Emit() by removing goroutine wrapper since ChipIngressBatchEmitter.Emit() is non-blocking
pkg/beholder/client.go	Updated to create and start ChipIngressBatchEmitter instead of ChipIngressEmitter; added comment about closure ordering
pkg/beholder/config.go	Added 4 new config fields with inline documentation and default values
pkg/beholder/config_test.go	Updated expected output to include new config fields

Comments suppressed due to low confidence (2)

pkg/beholder/config.go:50

The comment states "Zero disables batching" but the implementation in NewChipIngressBatchEmitter treats zero as "use default" and sets it to 500ms. The comment should be corrected to match the actual behavior, e.g., "Flush interval per worker (default 500ms when zero or unset)".

	ChipIngressSendInterval time.Duration // Flush interval per worker (default 500ms). Zero disables batching.

pkg/beholder/client.go:248

The messageLoggerProvider appears twice in the shutdowner slice. This will cause it to be shut down twice, which could lead to errors or undefined behavior. Remove one of the duplicate entries.

		for _, provider := range []shutdowner{messageLoggerProvider, loggerProvider, tracerProvider, meterProvider, messageLoggerProvider} {

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-02-27T19:00:46Z

⚠️ API Diff Results - `github.com/smartcontractkit/chainlink-common`

⚠️ Breaking Changes (2)

`pkg/beholder` (1)

NewDualSourceEmitter — Type changed:

func(
  Emitter, 
  Emitter, 
  + bool
)
(Emitter, error)

`pkg/beholder.Client` (1)

Close — 🗑️ Removed

✅ Compatible Changes (26)

`pkg/beholder` (2)

ChipIngressBatchEmitterService — ➕ Added
NewChipIngressBatchEmitterService — ➕ Added

`pkg/beholder.(*Client)` (1)

ManagedServices — ➕ Added

`pkg/beholder.BeholderClient` (1)

Service — ➕ Added

`pkg/beholder.Client` (1)

Service — ➕ Added

`pkg/beholder.Config` (8)

ChipIngressBatchEmitterEnabled — ➕ Added
ChipIngressBufferSize — ➕ Added
ChipIngressDrainTimeout — ➕ Added
ChipIngressLogger — ➕ Added
ChipIngressMaxBatchSize — ➕ Added
ChipIngressMaxConcurrentSends — ➕ Added
ChipIngressSendInterval — ➕ Added
ChipIngressSendTimeout — ➕ Added

`pkg/beholder.writerClientConfig` (8)

ChipIngressBatchEmitterEnabled — ➕ Added
ChipIngressBufferSize — ➕ Added
ChipIngressDrainTimeout — ➕ Added
ChipIngressLogger — ➕ Added
ChipIngressMaxBatchSize — ➕ Added
ChipIngressMaxConcurrentSends — ➕ Added
ChipIngressSendInterval — ➕ Added
ChipIngressSendTimeout — ➕ Added

`pkg/loop.EnvConfig` (1)

ChipIngressBatchEmitterEnabled — ➕ Added

`pkg/services.HealthReporter` (3)

HealthReport — ➕ Added
Name — ➕ Added
Ready — ➕ Added

`pkg/services.Service` (1)

Start — ➕ Added

📄 View full apidiff report

thomaska · 2026-02-27T19:48:57Z

pkg/beholder/client.go

 			return nil, err
 		}

-		chipIngressEmitter, err := NewChipIngressEmitter(chipIngressClient)


add a feature flag

thomaska · 2026-02-27T21:01:56Z

pkg/beholder/chip_ingress_batch_emitter.go

+// via chipingress.Client.PublishBatch on a periodic interval.
+// It satisfies the Emitter interface so it can be used as a drop-in replacement
+// for ChipIngressEmitter.
+type ChipIngressBatchEmitter struct {


name it Service

fouadkada · 2026-03-02T10:45:26Z

pkg/beholder/chip_ingress_batch_emitter.go

+	return e, nil
+}
+
+func (e *ChipIngressBatchEmitter) start(_ context.Context) error {


what's the role of this function if it always returns null?

it was mostly added as a placeholder, but can be omitted as well.
And after checking, in the core/services/workflows/syncer/v2/handler.go in EventHandler it's also omitted, so. probably it's more consistent.

fouadkada · 2026-03-02T10:46:58Z

pkg/beholder/chip_ingress_batch_emitter.go

+
+// NewChipIngressBatchEmitter creates a batch emitter backed by the given chipingress client.
+// Call Start() to begin health monitoring, and Close() to stop all workers.
+func NewChipIngressBatchEmitter(client chipingress.Client, lggr logger.Logger, cfg Config) (*ChipIngressBatchEmitter, error) {


this is pure stylistic, and feel free to ignore it, make the logger the last param and after renaming the struct so ChipIngressBatchService, make sure to adjust the name of the constructor

fouadkada · 2026-03-02T11:21:41Z

pkg/beholder/chip_ingress_emitter_worker.go

+	var events []chipingress.CloudEvent
+
+	for len(w.ch) > 0 && len(events) < int(w.maxBatchSize) { // #nosec G115
+		payload := <-w.ch


if im not mistaken, this can block if the channel is drained by another goroutine.

i'd use a select instead

max := int(w.maxBatchSize) for len(events) < max { select { case payload := <-w.ch: event, err := w.payloadToEvent(payload) if err != nil { w.lggr.Warnf("failed to build CloudEvent, dropping: %v", err) continue } events = append(events, event) default: return } }

hendoxc · 2026-03-13T14:04:52Z

pkg/beholder/chip_ingress_batch_emitter.go

+
+		queueErr := e.batchClient.QueueMessage(eventPb, func(sendErr error) {
+			if sendErr != nil {
+				e.metrics.eventsDropped.Add(context.Background(), 1, metricAttrs)


use the context passed in from the parameters

hendoxc · 2026-03-13T14:09:11Z

pkg/chipingress/batch/client.go

-		ctx, cancel := b.stopCh.CtxWithTimeout(b.shutdownTimeout)
+		// Use a standalone timeout context so the shutdown wait isn't cancelled
+		// by close(b.stopCh) below.
+		ctx, cancel := context.WithTimeout(context.Background(), b.shutdownTimeout)


not sure I follow why we are doing this?

This is a potential issue that opus pointed out and it made sense to me.
TL;DR; the timeout will never be respected and the drain didn't have time to run properly, as the same context was being closed right after it was created.
Longer version:
In L121: close(b.stopCh) -- closes the context
which is the same used in L113: ctx, cancel := b.stopCh.CtxWithTimeout(b.shutdownTimeout)
thus the "Done" part in the following select is executed instantaneouly along with the warning message:

select { case <-done: // All successfully shutdown ---> case <-ctx.Done(): // timeout or context cancelled b.log.Warnw("timed out waiting for shutdown to finish, force closing", "timeout", b.shutdownTimeout) }

Does this make any sense to you?

hendoxc · 2026-03-13T14:13:00Z

pkg/beholder/dual_source_emitter.go

@@ -42,6 +43,7 @@ func NewDualSourceEmitter(chipIngressEmitter Emitter, otelCollectorEmitter Emitt
 		chipIngressEmitter:   chipIngressEmitter,
 		otelCollectorEmitter: otelCollectorEmitter,
 		log:                  logger,
+		nonBlockingEmitter:   nonBlockingChipIngress,
 		stopCh:               make(services.StopChan),
 	}, nil


it should always be non-blocking

The eventual implementation is always non-blocking. The name is not very descriptive?
would something likechipIngressBatchEmitterEnabled be better? This is essentially the feature flag being propagated.

hendoxc · 2026-03-13T14:13:34Z

pkg/beholder/dual_source_emitter.go

+	} else {
+		// Legacy ChipIngressEmitter.Emit is a synchronous gRPC call;
+		// fire-and-forget via goroutine to avoid blocking the caller.
+		if err := d.wg.TryAdd(1); err != nil {
+			return err
+		}
+		go func(ctx context.Context) {
+			defer d.wg.Done()
+			var cancel context.CancelFunc
+			ctx, cancel = d.stopCh.Ctx(ctx)
+			defer cancel()
+
+			if err := d.chipIngressEmitter.Emit(ctx, body, attrKVs...); err != nil {
+				d.log.Infof("failed to emit to chip ingress: %v", err)
+			}
+		}(context.WithoutCancel(ctx))
+	}


if we pass in the batch client, can simply just queue the message ?

I'm not sure I understand :/ Should we remove completely the previous implementation and use the batch client everywhere? If yes should we remove the feature flag as well?

disregard my comment

hendoxc · 2026-03-13T14:14:56Z

pkg/loop/config.go

-	envChipIngressInsecureConnection = "CL_CHIP_INGRESS_INSECURE_CONNECTION"
+	envChipIngressEndpoint             = "CL_CHIP_INGRESS_ENDPOINT"
+	envChipIngressInsecureConnection   = "CL_CHIP_INGRESS_INSECURE_CONNECTION"
+	envChipIngressBatchEmitterEnabled  = "CL_CHIP_INGRESS_BATCH_EMITTER_ENABLED"


maybe this should always be enabled

I think we were discussing with @pkcll to merge this initially with the flag disabled

makes sense

4of9 · 2026-03-13T15:44:04Z

pkg/beholder/chip_ingress_batch_emitter.go

+			if sendErr != nil {
+				e.metrics.eventsDropped.Add(context.Background(), 1, metricAttrs)
+			} else {
+				e.metrics.eventsSent.Add(context.Background(), 1, metricAttrs)


same as above

4of9 · 2026-03-13T15:44:20Z

pkg/beholder/chip_ingress_batch_emitter.go

+			}
+		})
+		if queueErr != nil {
+			e.metrics.eventsDropped.Add(context.Background(), 1, metricAttrs)


same as above

pkcll · 2026-03-13T16:03:35Z

pkg/chipingress/batch/client.go

Please add metrics to batch client to observe batching behavior (could be done in separate PR)

batch req size (in message) vs max match size

batch req size in bytes vs max grpc req size

req latency

[optional] report batch client configuration as a gauge metric like we do for beholder

pkcll · 2026-03-16T13:39:32Z

pkg/beholder/client.go

+			if err != nil {
+				return nil, fmt.Errorf("failed to create chip ingress batch emitter: %w", err)
+			}
+			if err = batchEmitterService.Start(context.Background()); err != nil {


You need to pass the parent component context to batchEmitterService.Start

pkcll · 2026-03-16T13:42:32Z

pkg/beholder/client.go

-		if err != nil {
-			return nil, fmt.Errorf("failed to create chip ingress emitter: %w", err)
+		var chipIngressEmitter Emitter
+		if cfg.ChipIngressBatchEmitterEnabled {


this code needs to be added to both grpc client and http beholder clients

pkg/beholder/httpclient.go currently has no chip ingress client.
Should we add support for it in this PR?

Please create a follow up ticket to sync httpclient with client.go implementation, they have diverged already and there are gaps

pkcll · 2026-03-16T13:45:21Z

pkg/beholder/chip_ingress_batch_emitter.go

+}
+
+// NewChipIngressBatchEmitter creates a batch emitter backed by the given chipingress client.
+func NewChipIngressBatchEmitter(client chipingress.Client, cfg Config, lggr logger.Logger) (*ChipIngressBatchEmitter, error) {


You need to be able to pass parent context to it to be able to gracefully start and stop.

pkcll · 2026-03-17T15:43:52Z

pkg/beholder/client.go

 	// and logs will be sent via OTLP using the regular Logger instead of calling Emit
 	emitter := NewMessageEmitter(messageLogger)
-
+	var batchEmitterService *ChipIngressBatchEmitter


To avoid confusion lets call ChipIngressBatchEmitter something with Service word in it
e.g ChipIngressBatchEmitterService
so that its clear its long running and implement Start/Stop

pkcll · 2026-03-17T15:44:56Z

pkg/beholder/chip_ingress_batch_emitter_service.go

pkg/beholder/chip_ingress_batch_emitter.go -> pkg/beholder/chip_ingress_batch_emitter_service.go ?

pkcll · 2026-03-17T15:47:28Z

pkg/beholder/client.go

-		if err != nil {
-			return nil, fmt.Errorf("failed to create chip ingress emitter: %w", err)
+		var chipIngressEmitter Emitter
+		if cfg.ChipIngressBatchEmitterEnabled {


Please create a follow up ticket to sync httpclient with client.go implementation, they have diverged already and there are gaps

pkcll · 2026-03-19T18:40:48Z

go.mod

 	gopkg.in/yaml.v2 v2.4.0 // indirect
 )
+
+replace github.com/smartcontractkit/chainlink-common/pkg/chipingress => ./pkg/chipingress


Please remove

jmank88 · 2026-03-27T14:44:36Z

pkg/loop/server.go

+	for i := len(s.managedServices) - 1; i >= 0; i-- {
+		s.Logger.ErrorIfFn(s.managedServices[i].Close, "Failed to close managed service")
+	}


Are we not closing the beholder client? Let's do that instead, and let it handle these.

Refactor beholder lifecycle so ownership is explicit and centralized around beholder.Client, and move chip-ingress batching runtime startup out of constructors. Main changes: - make beholder.Client the top-level lifecycle owner for the optional ChipIngressBatchEmitterService instead of exposing child services to callers - start the batch emitter from Client.Start(ctx) and shut it down from Client.Close(), keeping provider shutdown and chip client close ordering in one place - keep ManagedServices() only as a compatibility shim and stop relying on it for LOOP lifecycle management - update LOOP server startup/shutdown to manage the beholder client directly - remove constructor-time goroutine startup from the batch emitter service so constructors only wire objects - make emit-before-start / emit-after-close behavior explicit via service state - stop retaining caller request contexts past enqueue in async batching paths - align DualSourceEmitter flag naming with ChipIngressBatchEmitterEnabled to match the actual behavior contract - fold in the LOOP env config rebase fix so chip ingress and CRE settings are declared exactly once Chip-ingress batching: - keep pkg/chipingress/batch.Client as a non-service batching primitive with single-owner Start/Stop semantics - add batching metrics for request count/failures, batch size in messages, batch size in bytes, request latency, and batch config info - expose max gRPC request size as a batch client option for metric comparison Tests and validation: - add lifecycle coverage for beholder client and managed-services compatibility - add metric assertions for batch client and batch emitter using OTel metric collection - add benchmark coverage for batch queueing and emitter enqueue paths - verify go test ./pkg/loop, ./pkg/beholder, and pkg/chipingress ./batch

Copilot AI review requested due to automatic review settings February 27, 2026 14:43

thomaska requested a review from a team as a code owner February 27, 2026 14:43

product-security-plaid-production bot requested a review from 4of9 February 27, 2026 14:43

thomaska temporarily deployed to integration February 27, 2026 14:43 — with GitHub Actions Inactive

thomaska mentioned this pull request Feb 27, 2026

feat: add ChipIngress batch emitter support smartcontractkit/chainlink#21327

Draft

Copilot started reviewing on behalf of thomaska February 27, 2026 14:43 View session

Copilot AI reviewed Feb 27, 2026

View reviewed changes

thomaska temporarily deployed to integration February 27, 2026 15:11 — with GitHub Actions Inactive

thomaska temporarily deployed to integration February 27, 2026 18:59 — with GitHub Actions Inactive

smartcontractkit deleted a comment from github-actions bot Feb 27, 2026

thomaska commented Feb 27, 2026

View reviewed changes

fouadkada reviewed Mar 2, 2026

View reviewed changes

thomaska requested a review from a team as a code owner March 2, 2026 11:18

product-security-plaid-production bot requested a review from patrickhuie19 March 2, 2026 11:18

thomaska temporarily deployed to integration March 2, 2026 11:18 — with GitHub Actions Inactive

thomaska had a problem deploying to integration March 2, 2026 11:18 — with GitHub Actions Failure

fouadkada reviewed Mar 2, 2026

View reviewed changes

thomaska temporarily deployed to integration March 2, 2026 12:08 — with GitHub Actions Inactive

thomaska had a problem deploying to integration March 12, 2026 14:03 — with GitHub Actions Failure

thomaska had a problem deploying to integration March 12, 2026 22:06 — with GitHub Actions Failure

pkcll approved these changes Mar 12, 2026

View reviewed changes

hendoxc reviewed Mar 13, 2026

View reviewed changes

4of9 reviewed Mar 13, 2026

View reviewed changes

pkcll reviewed Mar 16, 2026

View reviewed changes

pkcll previously approved these changes Mar 16, 2026

View reviewed changes

pkcll reviewed Mar 16, 2026

View reviewed changes

thomaska dismissed pkcll’s stale review via 16aa709 March 16, 2026 17:49

thomaska had a problem deploying to integration March 16, 2026 17:49 — with GitHub Actions Failure

pkcll reviewed Mar 17, 2026

View reviewed changes

thomaska had a problem deploying to integration March 18, 2026 13:39 — with GitHub Actions Failure

thomaska had a problem deploying to integration March 18, 2026 21:25 — with GitHub Actions Failure

pkcll reviewed Mar 19, 2026

View reviewed changes

thomaska had a problem deploying to integration March 20, 2026 22:03 — with GitHub Actions Failure

jmank88 reviewed Mar 27, 2026

View reviewed changes

Conversation

thomaska commented Feb 27, 2026 • edited by pkcll Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changed

Metrics

Tests

Supports

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions bot commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ API Diff Results - github.com/smartcontractkit/chainlink-common

⚠️ Breaking Changes (2)

pkg/beholder (1)

pkg/beholder.Client (1)

✅ Compatible Changes (26)

pkg/beholder (2)

pkg/beholder.(*Client) (1)

pkg/beholder.BeholderClient (1)

pkg/beholder.Client (1)

pkg/beholder.Config (8)

pkg/beholder.writerClientConfig (8)

pkg/loop.EnvConfig (1)

pkg/services.HealthReporter (3)

pkg/services.Service (1)

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomaska Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

thomaska Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pkcll Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

thomaska commented Feb 27, 2026 •

edited by pkcll

Loading

github-actions bot commented Feb 27, 2026 •

edited

Loading

⚠️ API Diff Results - `github.com/smartcontractkit/chainlink-common`

`pkg/beholder` (1)

`pkg/beholder.Client` (1)

`pkg/beholder` (2)

`pkg/beholder.(*Client)` (1)

`pkg/beholder.BeholderClient` (1)

`pkg/beholder.Client` (1)

`pkg/beholder.Config` (8)

`pkg/beholder.writerClientConfig` (8)

`pkg/loop.EnvConfig` (1)

`pkg/services.HealthReporter` (3)

`pkg/services.Service` (1)

thomaska Mar 16, 2026 •

edited

Loading

thomaska Mar 16, 2026 •

edited

Loading

pkcll Mar 13, 2026 •

edited

Loading