Skip to content

fix: Do not clear the schema cache during retries#4869

Open
mkleczek wants to merge 1 commit into
PostgREST:mainfrom
mkleczek:push-olktzqvzkonu
Open

fix: Do not clear the schema cache during retries#4869
mkleczek wants to merge 1 commit into
PostgREST:mainfrom
mkleczek:push-olktzqvzkonu

Conversation

@mkleczek
Copy link
Copy Markdown
Collaborator

@mkleczek mkleczek commented May 4, 2026

DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.

retryingSchemaCacheLoad should not clear existing schema cache upon failure - there is no reason to do that.
If there is a communication issue with the database server or db is down, clients are going to get 502 anyway. If it was a glitch when loading the schema cache - the clients are going to use old (stale) schema cache for some time until next retry re-loads it successfully.

Fixes #4873

@mkleczek mkleczek requested a review from steve-chavez May 4, 2026 07:49
@mkleczek mkleczek force-pushed the push-olktzqvzkonu branch from c468e80 to 36be579 Compare May 4, 2026 17:35
@steve-chavez
Copy link
Copy Markdown
Member

Clarifying the motivation on #4873

Comment thread CHANGELOG.md Outdated
@mkleczek mkleczek force-pushed the push-olktzqvzkonu branch 2 times, most recently from 5201ea9 to 82cf2e7 Compare May 4, 2026 18:08
@steve-chavez
Copy link
Copy Markdown
Member

Follow up question on #4873 (comment), not sure if it's correct to do the fix like this.

Comment thread test/io/test_io.py
Comment thread test/io/fixtures/roles.sql
@mkleczek mkleczek force-pushed the push-olktzqvzkonu branch 3 times, most recently from 02f8fbc to 4d6a87d Compare May 12, 2026 06:29
@mkleczek
Copy link
Copy Markdown
Collaborator Author

Follow up question on #4873 (comment), not sure if it's correct to do the fix like this.

@steve-chavez - So what's the conclusion after #4873 (comment), #4873 (comment) and all discussions in #4873 ?

Are we going to proceed with this PR?

@steve-chavez
Copy link
Copy Markdown
Member

@mkleczek We need to clarify what are we going to do with the request "waiting" (not sure how to call this) mentioned on #4873 (comment) because there's also a scenario here where we void the schema cache and we force clients to wait while the schema cache is reloaded again. If that "waiting" is useless, let's settle it on #4873 (comment).

@mkleczek mkleczek force-pushed the push-olktzqvzkonu branch 7 times, most recently from 350e946 to f92122f Compare May 19, 2026 10:56
retryingSchemaCacheLoad should not clear existing schema cache upon failure - there is no reason to do that. If there is a communication issue with the database server or db is down, clients are going to get 502 anyway. If it was a glitch when loading the schema cache - the clients are going to use old (stale) schema cache for some time until next retry re-loads it successfully.
@mkleczek mkleczek force-pushed the push-olktzqvzkonu branch from f92122f to 3b60b73 Compare May 20, 2026 10:35
@steve-chavez
Copy link
Copy Markdown
Member

steve-chavez commented May 20, 2026

Now that we clarified the "waiting" was ineffective on #4937, let's proceed with reviewing this.

Comment thread test/io/test_io.py
Comment on lines 748 to +766
@@ -763,7 +763,7 @@ def test_admin_ready_includes_schema_cache_state(defaultenv, metapostgrest):
assert response.status_code == 503

response = postgrest.session.get("/projects", timeout=1)
assert response.status_code == 503
assert response.status_code == 200
Copy link
Copy Markdown
Member

@steve-chavez steve-chavez May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkleczek One thing that is confusing, this is the full test we're changing here:

def test_admin_ready_includes_schema_cache_state(defaultenv, metapostgrest):
"Should get a failed response from the admin server ready endpoint when the schema cache is not loaded"
role = "timeout_authenticator"
env = {
**defaultenv,
"PGUSER": role,
"PGRST_DB_ANON_ROLE": role,
"PGRST_INTERNAL_SCHEMA_CACHE_QUERY_SLEEP": "500",
}
with run(env=env) as postgrest:
# The schema cache query takes at least 500ms, due to PGRST_INTERNAL_SCHEMA_CACHE_QUERY_SLEEP above.
# Make it impossible to load the schema cache, by setting statement timeout to 400ms.
set_statement_timeout(metapostgrest, role, 400)
# force a reconnection so the new role setting is picked up
postgrest.process.send_signal(signal.SIGUSR1)
postgrest.wait_until_scache_starts_loading()
response = postgrest.admin.get("/ready", timeout=1)
assert response.status_code == 503
response = postgrest.session.get("/projects", timeout=1)
assert response.status_code == 503
reset_statement_timeout(metapostgrest, role)

So the /ready endpoint will give 503 but a request to /projects will give 200? An admin will not be able to correlate the time the health check was failing with the users' requests failing. I don't think this behavior is correct.

Copy link
Copy Markdown
Collaborator Author

@mkleczek mkleczek May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... but what is there to correlate if a request returns 200? There is no error in this case. Schema cache is stale but we still can handle the query.

But this is a good question: what should admin server return from /ready during schema cache reloading? 200 or 503?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is a good question: what should admin server return from /ready during schema cache reloading? 200 or 503?

Under this change, it looks we should return 200 on /ready once the startup scache load is done, no matter if it fails later.

And perhaps only return 503 if the db is down (detected by connection error or perhaps by pool metric).

But TBH not sure if any of the above are right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

A statement timeout can void the schema cache

3 participants