Skip to content

fix(router): release executor schema refs on graphMux shutdown#3011

Open
arutkowski00 wants to merge 1 commit into
wundergraph:mainfrom
mondaycom:adamru/upstream-bugfix-executor-close
Open

fix(router): release executor schema refs on graphMux shutdown#3011
arutkowski00 wants to merge 1 commit into
wundergraph:mainfrom
mondaycom:adamru/upstream-bugfix-executor-close

Conversation

@arutkowski00

Copy link
Copy Markdown

Problem

After config hot-reload, the shut-down Executor retained federation/client schema AST and plan config references, keeping the previous graph generation alive.

Fix

Add Executor.Close() to nil schema, plan config, and resolver references. Call it from graphMux.Shutdown() after plan caches are closed.

Test plan

  • TestExecutorCloseReleasesSchemaReferences
  • go test ./router/core/...

Part of config hot-reload memory leak investigation (monday.com #3221).

Add Executor.Close() to nil federation/client schemas, plan config, and
resolver after the mux drains in-flight work and closes plan caches, so
the previous graph generation can be garbage-collected after hot-reload.
@arutkowski00 arutkowski00 requested a review from a team as a code owner June 24, 2026 21:13
@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Warning

Review limit reached

@arutkowski00, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 23 minutes and 54 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9cc0c63c-351e-4736-82e8-9c936c0851aa

📥 Commits

Reviewing files that changed from the base of the PR and between 7b40f14 and 46cfe20.

📒 Files selected for processing (3)
  • router/core/executor.go
  • router/core/executor_test.go
  • router/core/graph_server.go

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

Comment thread router/core/executor.go
Comment on lines +60 to +64
e.ClientSchema = nil
e.RouterSchema = nil
e.PlanConfig = plan.Configuration{}
e.RenameTypeNames = nil
e.Resolver = nil

@dkorittki dkorittki Jun 26, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is websockets use this Executor and they are closed async. When this function is called websocket connections are still alive (got signaled to close but might take a moment) and might use the Executor, which has just nilled its components. It can cause nil panics or unwanted errors at least. I need to check this in more depth and how to solve it.

Comment thread router/core/executor.go
e.RouterSchema = nil
e.PlanConfig = plan.Configuration{}
e.RenameTypeNames = nil
e.Resolver = nil

@dkorittki dkorittki Jun 29, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
e.Resolver = nil

I think we can exclude this one. Compared to the other fields it's not heavy and reduces the risk surface. Also this executor instance the pointer points to is referenced by other pointers in other places as well (the WS handler/graphql handler), which is still active when we are here. So GC can't clean it up immediately anyways.

EDIT: Also fixed by #3010 but review there still pending

@dkorittki

dkorittki commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

I did a deeper analysis and found other places this could lead to nil panics. A good risk reduction is described in #3011 (comment) but this alone is not sufficient. There are still races:

#1: In-flight request handling can now panic. Any request on the old graph mux unlucky enough to be in parse/normalize/validate/plan when we are here can hit it. Can happen on a busy server. These panics are recovered but technically this should not be hard to fix.

#2: When an active WS connection on the old graph mux initiates a new subscription it would panic, no recovery. Best thing would be the WS handler does not accept that anymore since we are about to close it anyways. At least we should have a recovery in place until we come up with a long term solution here.

#3: Persisted query cache rewarming can nil panic and crash the router. Rewarming happens on PQL manifest changes. If we are warming the cache (can take seconds) and then we are here the warming callback parses and plans its stored queries and that accesses these fields. Not so likely to happen but still it can crash the router. At the very least we should have a recovery until we have a long term solution.

So overall it seems to possible if we omit niling e.Resolver, which doesn't make sense anyway. I'll probably cherry-pick you commit into a new pull request and work on solving these issues myself

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants