Skip to content

xds: XdsDepManager should ignore updates after shutdown#12034

Merged
ejona86 merged 2 commits intogrpc:masterfrom
ejona86:xdsdepman-shutdown-race
Apr 23, 2025
Merged

xds: XdsDepManager should ignore updates after shutdown#12034
ejona86 merged 2 commits intogrpc:masterfrom
ejona86:xdsdepman-shutdown-race

Conversation

@ejona86
Copy link
Copy Markdown
Member

@ejona86 ejona86 commented Apr 23, 2025

This prevents a NPE and subsequent channel panic when trying to build a config (because there are no watchers, so waitingOnResource==false) without any listener and route.

java.lang.NullPointerException: Cannot invoke "io.grpc.xds.XdsDependencyManager$RdsUpdateSupplier.getRdsUpdate()" because "routeSource" is null
    at io.grpc.xds.XdsDependencyManager.buildUpdate(XdsDependencyManager.java:295)
    at io.grpc.xds.XdsDependencyManager.maybePublishConfig(XdsDependencyManager.java:266)
    at io.grpc.xds.XdsDependencyManager$EdsWatcher.onChanged(XdsDependencyManager.java:899)
    at io.grpc.xds.XdsDependencyManager$EdsWatcher.onChanged(XdsDependencyManager.java:888)
    at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.notifyWatcher(XdsClientImpl.java:929)
    at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.lambda$onData$0(XdsClientImpl.java:837)
    at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:96)

I think this fully-fixes the problem today, but not tomorrow. subscribeToCluster() is racy as well, but not yet used.

This was noticed when idleTimeout was firing, with some other code calling getState(true) to wake the channel back up. That may have made this panic more visible than it would be otherwise, but that has not been investigated.

b/412474567

This prevents a NPE and subsequent channel panic when trying to build a
config (because there are no watchers, so waitingOnResource==false)
without any listener and route.
```
java.lang.NullPointerException: Cannot invoke "io.grpc.xds.XdsDependencyManager$RdsUpdateSupplier.getRdsUpdate()" because "routeSource" is null
    at io.grpc.xds.XdsDependencyManager.buildUpdate(XdsDependencyManager.java:295)
    at io.grpc.xds.XdsDependencyManager.maybePublishConfig(XdsDependencyManager.java:266)
    at io.grpc.xds.XdsDependencyManager$EdsWatcher.onChanged(XdsDependencyManager.java:899)
    at io.grpc.xds.XdsDependencyManager$EdsWatcher.onChanged(XdsDependencyManager.java:888)
    at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.notifyWatcher(XdsClientImpl.java:929)
    at io.grpc.xds.client.XdsClientImpl$ResourceSubscriber.lambda$onData$0(XdsClientImpl.java:837)
    at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:96)
```

I think this fully-fixes the problem today, but not tomorrow.
subscribeToCluster() is racy as well, but not yet used.

This was noticed when idleTimeout was firing, with some other code
calling getState(true) to wake the channel back up. That may have made
this panic more visible than it would be otherwise, but that has not
been investigated.

b/412474567
@ejona86 ejona86 requested a review from kannanjgithub April 23, 2025 01:43
Flow control prevents all but the first update from occurring before the
XdsClient unsubscriptions take effect.
@ejona86 ejona86 merged commit 25199e9 into grpc:master Apr 23, 2025
16 checks passed
@ejona86 ejona86 deleted the xdsdepman-shutdown-race branch April 23, 2025 16:18
@Override
public void onError(Status error) {
checkNotNull(error, "error");
if (cancelled) {
Copy link
Copy Markdown
Member

@shivaspeaks shivaspeaks Apr 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Eric, one small doubt in this. How did we decide to put this after checkNotNull? I thought if we have already cancelled the watcher, we should just return. Checking not null should probably go after cancelled check?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checkNotNull() is checking the API was called correctly. Enforcing API requirements is no less relevant if the watcher is cancelled, although it is true that this implementation need not the results. If it is null here, it is a bug that needs fixing, independent of XdsDependencyManager's state. The robustness principal has been largely found to be a mistake. It is better to be strict of inputs, so bugs have a short lifetime and the system works as one expects. I want things to fail as early as possible, as there they are most obvious.

(I'd agree that the ordering could have been the other way here, and I did consider it. But what value is skipping the null check?)

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 24, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants