CLD-769: Improve catalog client reliability#573
Conversation
🦋 Changeset detectedLatest commit: 43d9545 The changes in this PR will be included in the next version bump. This PR includes changesets to release 1 package
Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the reliability of the catalog gRPC client by implementing connection management improvements and retry mechanisms for handling transient network errors. The changes help maintain long-lived bidirectional streams and provide graceful connection cleanup.
Key changes:
- Added keepalive pings (20s intervals with 10s timeout) to maintain stream health during idle periods
- Implemented gRPC retry policy with exponential backoff for transient errors (UNAVAILABLE, DEADLINE_EXCEEDED, INTERNAL, RESOURCE_EXHAUSTED)
- Added connection lifecycle management with
Close()method and connection tracking
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| datastore/catalog/remote/grpc.go | Adds keepalive configuration, retry policy, connection field, and Close() method to CatalogClient |
| datastore/catalog/remote/grpc_test.go | Adds test coverage for the new Close() method |
| .changeset/eleven-dragons-laugh.md | Documents the changes for the changelog |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Close closes the underlying gRPC connection. | ||
| func (c *CatalogClient) Close() error { | ||
| if c.cachedStream != nil { | ||
| return errors.New("stream is not closed") |
There was a problem hiding this comment.
The error message 'stream is not closed' could be clearer about the expected action. Consider changing it to 'cannot close connection while stream is open; call CloseStream() first' to better guide the user on how to resolve this issue.
| return errors.New("stream is not closed") | |
| return errors.New("cannot close connection while stream is open; call CloseStream() first") |
|
|
||
| // Keepalive for long-lived bidirectional streams | ||
| // Ping every 20 seconds, wait up to 10 seconds for a response | ||
| opts = append(opts, grpc.WithKeepaliveParams(keepalive.ClientParameters{ |
There was a problem hiding this comment.
Maybe u may have done it, we should do the same on the server side as well?
eg
grpc.NewServer(
grpc.KeepaliveParams(keepalive.ServerParameters{
MaxConnectionIdle: 5 * time.Minute, // close connections after idle time
MaxConnectionAge: 30 * time.Minute,
Time: 2 * time.Minute, // send keepalive pings every 2 mins
Timeout: 20 * time.Second,
}),
grpc.KeepaliveEnforcementPolicy(keepalive.EnforcementPolicy{
MinTime: 30 * time.Second, // minimum time between client pings
PermitWithoutStream: true,
}),
)
There was a problem hiding this comment.
Good idea!! Will add the same on the server side. I think I did not touch that part on the server side yet.
There was a problem hiding this comment.
Ok added on the server as well https://github.com/smartcontractkit/op-catalog/pull/171
| "google.golang.org/protobuf/proto" | ||
| ) | ||
|
|
||
| const retryPolicy = `{ |
There was a problem hiding this comment.
i thought retrypolicy is only for unary connection, we are using bidirectional here, how would that work?
There was a problem hiding this comment.
This should only apply to the initial request for the creation of the stream. Retries afterwards cannot be handled automatically. This could solve issues with some network flake that could stop the creation of the stream and therefore stop the pipeline for no reason
| } | ||
|
|
||
| // Close closes the underlying gRPC connection. | ||
| func (c *CatalogClient) Close() error { |
There was a problem hiding this comment.
Just checking do we need to support concurrent call to Close or CloseStream?
There was a problem hiding this comment.
I don't think so since we only open one stream and then we close it 🤔 I don't think we can use the client concurrently because of the cacheStream
.Since we are using the same cached stream wouldn't we need to be able to enforce some sort of ordering in terms of reads and writes?
Just thinking if a changeset wants to update something and the use that something for something else if this can happen in parallel then we will have a configuration issue because of the data race 🤔
There was a problem hiding this comment.
make sense, i dont think a stream can be used concurrently, each should open its own stream.
This PR was opened by the [Changesets release](https://github.com/changesets/action) GitHub action. When you're ready to do a release, you can merge this and the packages will be published to npm automatically. If you're not ready to do a release yet, that's fine, whenever you add more changesets to main, this PR will be updated. # Releases ## chainlink-deployments-framework@0.67.0 ### Minor Changes - [#573](#573) [`f7a31c2`](f7a31c2) Thanks [@DimitriosNaikopoulos](https://github.com/DimitriosNaikopoulos)! - add grpc keepalive, retries and connection closure functionality - [#580](#580) [`0baab99`](0baab99) Thanks [@jadepark-dev](https://github.com/jadepark-dev)! - expose TON CTF configs to caller - [#577](#577) [`a1074b1`](a1074b1) Thanks [@jkongie](https://github.com/jkongie)! - Updates `go-ethereum` to v1.16.7 - [#579](#579) [`5d15395`](5d15395) Thanks [@giogam](https://github.com/giogam)! - feat: adds 'all' datastore config option --------- Co-authored-by: app-token-issuer-engops[bot] <144731339+app-token-issuer-engops[bot]@users.noreply.github.com>




CLD-769: Improve catalog client reliability
CatalogClient. This will allow us to close the grpc connection when needed. This works on top of closing our side of the stream with theCloseStreamfunctions