fix(dab): restart dab-postgres on-failure (mirror the dab-mongo fix)#12
Merged
Merged
Conversation
…lose the DB
The dab-postgres service had a pg_isready healthcheck and a service_healthy
startup gate but no restart policy (unlike dab-mongo). When postgres:17 dies
mid-run (crash/OOM) it is never restarted, the container leaves dab-net, and
`dab-postgres` stops resolving for the rest of the trial ('could not translate
host name dab-postgres'). For hybrid pg+duckdb datasets like PANCANCER_ATLAS the
clinical data then becomes unreachable and the agent can only abstain — observed
on PANCANCER_ATLAS q2/q3 (both 'UNABLE TO DETERMINE', reward 0), which alone
account for ~0.056 of stratified pass@1.
Add 'restart: on-failure' to dab-postgres, mirroring the dab-mongo fix, so the
populated data dir comes straight back up and main's healthcheck recovers.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR improves the reliability of the DAB plugin’s generated docker-compose stack by ensuring the dab-postgres container is automatically restarted if it crashes mid-run, matching the existing behavior for dab-mongo.
Changes:
- Add
restart: on-failureto the generateddab-postgresservice incompose.py. - Add a unit test asserting the postgres service includes the restart policy.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| packages/razorback-plugin-dab/src/razorback_plugin_dab/generate/compose.py | Adds restart: on-failure to the generated dab-postgres service to recover from mid-run crashes/OOMs. |
| packages/razorback-plugin-dab/tests/unit/test_compose_postgres.py | Adds a unit test to enforce the postgres restart policy in emitted compose YAML. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
dab-postgresservice in the DAB plugin's compose generation has apg_isreadyhealthcheck and aservice_healthystartup gate, but norestart:policy — unlikedab-mongo, which already hasrestart: on-failure.When
postgres:17dies mid-run (crash/OOM), it is never restarted, the container leavesdab-net, anddab-postgresstops resolving for the rest of the trial (could not translate host name "dab-postgres": Temporary failure in name resolution). For hybrid Postgres+DuckDB datasets like PANCANCER_ATLAS, the clinical data then becomes unreachable and the agent can only abstain.Observed impact
In a codex/gpt-5.5 full DAB run,
PANCANCER_ATLASq2 and q3 both returnedUNABLE TO DETERMINE(reward 0). Trial evidence:dab-postgres:5432reachable at start),could not translate host name "dab-postgres"on every psycopg2 retry,q1) in the same run reacheddab-postgresfine minutes earlier → transient, restart-drops-DNS signature.These two queries alone account for ~0.056 of stratified pass@1 (≈ the whole observed gap vs the Opus baseline on this surface).
Fix
Add
"restart": "on-failure"to thedab-postgresservice, mirroring the existingdab-mongofix. The populated data dir comes straight back up andmain'sservice_healthygate recovers instead of the trial losing the DB for good. No other change — the healthcheck anddepends_on: {condition: service_healthy}were already present.Test
Adds
test_postgres_has_restart_policy(mirrorstest_mongo_has_restart_and_cache_cap).test_compose_postgres.py+test_compose_mongo.py→ 11 passed.🤖 Generated with Claude Code