Skip to content

CLD-769: Improve catalog client reliability#573

Merged
DimitriosNaikopoulos merged 4 commits into
mainfrom
CLD-769-catalog-client-reliability
Nov 18, 2025
Merged

CLD-769: Improve catalog client reliability#573
DimitriosNaikopoulos merged 4 commits into
mainfrom
CLD-769-catalog-client-reliability

Conversation

@DimitriosNaikopoulos

@DimitriosNaikopoulos DimitriosNaikopoulos commented Nov 12, 2025

Copy link
Copy Markdown
Contributor

CLD-769: Improve catalog client reliability

  • Add keepalive pings to ensure that a stream stays alive even when there is no traffic
  • Add grpc retry policy for retriable transient errors. This will allow the grpc client to retry connecting to catalog when a network error is received improving the user experience instead of instantly failing the workflow
  • Add connection object to the CatalogClient. This will allow us to close the grpc connection when needed. This works on top of closing our side of the stream with the CloseStream functions

@changeset-bot

changeset-bot Bot commented Nov 12, 2025

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 43d9545

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
chainlink-deployments-framework Minor

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@cl-sonarqube-production

Copy link
Copy Markdown

@DimitriosNaikopoulos DimitriosNaikopoulos marked this pull request as ready for review November 13, 2025 16:12
@DimitriosNaikopoulos DimitriosNaikopoulos requested review from a team as code owners November 13, 2025 16:12
Copilot AI review requested due to automatic review settings November 13, 2025 16:12

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the reliability of the catalog gRPC client by implementing connection management improvements and retry mechanisms for handling transient network errors. The changes help maintain long-lived bidirectional streams and provide graceful connection cleanup.

Key changes:

  • Added keepalive pings (20s intervals with 10s timeout) to maintain stream health during idle periods
  • Implemented gRPC retry policy with exponential backoff for transient errors (UNAVAILABLE, DEADLINE_EXCEEDED, INTERNAL, RESOURCE_EXHAUSTED)
  • Added connection lifecycle management with Close() method and connection tracking

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
datastore/catalog/remote/grpc.go Adds keepalive configuration, retry policy, connection field, and Close() method to CatalogClient
datastore/catalog/remote/grpc_test.go Adds test coverage for the new Close() method
.changeset/eleven-dragons-laugh.md Documents the changes for the changelog

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// Close closes the underlying gRPC connection.
func (c *CatalogClient) Close() error {
if c.cachedStream != nil {
return errors.New("stream is not closed")

Copilot AI Nov 13, 2025

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message 'stream is not closed' could be clearer about the expected action. Consider changing it to 'cannot close connection while stream is open; call CloseStream() first' to better guide the user on how to resolve this issue.

Suggested change
return errors.New("stream is not closed")
return errors.New("cannot close connection while stream is open; call CloseStream() first")

Copilot uses AI. Check for mistakes.

// Keepalive for long-lived bidirectional streams
// Ping every 20 seconds, wait up to 10 seconds for a response
opts = append(opts, grpc.WithKeepaliveParams(keepalive.ClientParameters{

@graham-chainlink graham-chainlink Nov 14, 2025

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe u may have done it, we should do the same on the server side as well?

eg

grpc.NewServer(
    grpc.KeepaliveParams(keepalive.ServerParameters{
        MaxConnectionIdle: 5 * time.Minute, // close connections after idle time
        MaxConnectionAge:  30 * time.Minute,
        Time:              2 * time.Minute, // send keepalive pings every 2 mins
        Timeout:           20 * time.Second,
    }),
    grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
        MinTime:             30 * time.Second, // minimum time between client pings
        PermitWithoutStream: true,
    }),
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!! Will add the same on the server side. I think I did not touch that part on the server side yet.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"google.golang.org/protobuf/proto"
)

const retryPolicy = `{

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought retrypolicy is only for unary connection, we are using bidirectional here, how would that work?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should only apply to the initial request for the creation of the stream. Retries afterwards cannot be handled automatically. This could solve issues with some network flake that could stop the creation of the stream and therefore stop the pipeline for no reason

}

// Close closes the underlying gRPC connection.
func (c *CatalogClient) Close() error {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking do we need to support concurrent call to Close or CloseStream?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so since we only open one stream and then we close it 🤔 I don't think we can use the client concurrently because of the cacheStream

.

Since we are using the same cached stream wouldn't we need to be able to enforce some sort of ordering in terms of reads and writes?

Just thinking if a changeset wants to update something and the use that something for something else if this can happen in parallel then we will have a configuration issue because of the data race 🤔

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sense, i dont think a stream can be used concurrently, each should open its own stream.

@DimitriosNaikopoulos DimitriosNaikopoulos added this pull request to the merge queue Nov 18, 2025
Merged via the queue into main with commit f7a31c2 Nov 18, 2025
16 checks passed
@DimitriosNaikopoulos DimitriosNaikopoulos deleted the CLD-769-catalog-client-reliability branch November 18, 2025 09:15
github-merge-queue Bot pushed a commit that referenced this pull request Nov 18, 2025
This PR was opened by the [Changesets
release](https://github.com/changesets/action) GitHub action. When
you're ready to do a release, you can merge this and the packages will
be published to npm automatically. If you're not ready to do a release
yet, that's fine, whenever you add more changesets to main, this PR will
be updated.


# Releases
## chainlink-deployments-framework@0.67.0

### Minor Changes

-
[#573](#573)
[`f7a31c2`](f7a31c2)
Thanks [@DimitriosNaikopoulos](https://github.com/DimitriosNaikopoulos)!
- add grpc keepalive, retries and connection closure functionality

-
[#580](#580)
[`0baab99`](0baab99)
Thanks [@jadepark-dev](https://github.com/jadepark-dev)! - expose TON
CTF configs to caller

-
[#577](#577)
[`a1074b1`](a1074b1)
Thanks [@jkongie](https://github.com/jkongie)! - Updates `go-ethereum`
to v1.16.7

-
[#579](#579)
[`5d15395`](5d15395)
Thanks [@giogam](https://github.com/giogam)! - feat: adds 'all'
datastore config option

---------

Co-authored-by: app-token-issuer-engops[bot] <144731339+app-token-issuer-engops[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants