feat: add init container to web deployment for database migrations#274
feat: add init container to web deployment for database migrations#274chaholl wants to merge 4 commits into
Conversation
- Add init container to handle database migrations and initialization - Prevents health check timeouts by separating initialization from web serving - Init container runs the full entrypoint script then exits cleanly - Preserves existing extraInitContainers functionality - Add comprehensive test coverage for init container functionality Resolves startup issues where health checks were failing due to time-consuming database migrations running in the main container.
| image: "{{ .Values.langfuse.web.image.repository }}:{{ coalesce .Values.langfuse.web.image.tag .Values.langfuse.image.tag .Chart.AppVersion }}" | ||
| imagePullPolicy: {{ .Values.langfuse.web.image.pullPolicy | default .Values.langfuse.image.pullPolicy }} | ||
| # Run the full entrypoint script for initialization (DB setup, migrations, etc.) then exit | ||
| command: ["echo", "Init completed successfully"] |
There was a problem hiding this comment.
Init container command only echoes a message. To perform DB migrations as intended, consider running the full entrypoint script or allowing an override through values.
There was a problem hiding this comment.
since script is called on ENTRYPOINT in the docker image, we don't need to change it (and changing it may cause problems).
# Docker ENTRYPOINT (dumb-init) is covered by semantic versioning, not the entrypoint.sh itself
# Reasoning: ENTRYPOINT is overridden by some self-hosted deployments, thus changing this is breaking
ENTRYPOINT ["dumb-init", "--", "./web/entrypoint.sh"]
At the end of the script it executes the command args:
echo "[DEBUG] Entrypoint script completed successfully. Running CMD: $@"
exec "$@"
There was a problem hiding this comment.
Arguably, we could change the command on the main container to skip running entrypoint.sh on start up but it may be a pain to maintain that logic since it's non-trivial:
# startup command - use dd-trace if NEXT_PUBLIC_LANGFUSE_CLOUD_REGION is configured
CMD if [ -n "$NEXT_PUBLIC_LANGFUSE_CLOUD_REGION" ]; then \
node --import dd-trace/initialize.mjs ./web/server.js --keepAliveTimeout 110000; \
else \
node ./web/server.js --keepAliveTimeout 110000; \
fi
There was a problem hiding this comment.
I made anther minor tweak. Even with the init container, the liveness prob is still kicking in too quickly and killing the web pod. Adjusting the default to 60s fixes that issue.
- Changed from 'command' to 'args' to allow entrypoint script to run - This ensures the full initialization logic executes properly - The entrypoint script handles environment setup, DB migrations, etc. - Container exits cleanly after running 'echo Init completed successfully'
- Update default initialDelaySeconds from 20s to 60s in both values.yaml and deployment template - Gives main container more time to start after init container completes - Prevents premature health check failures during startup
| path: "/api/public/health" | ||
| # -- Initial delay seconds for livenessProbe. | ||
| initialDelaySeconds: 20 | ||
| initialDelaySeconds: 60 |
There was a problem hiding this comment.
Is this necessary if an init container is in place? My understanding is that the health check would only start once the main container in the pod becomes active.
There was a problem hiding this comment.
that's what I thought as well! :)
But unfortunately the main container is still talking longer than 20s to spin up and become ready to serve the health api. Maybe because it's still running the entrypoint script
|
@chaholl We wondered whether chart hooks are a more applicable function here to schedule a single upgrade job before updating the containers. The init container may still suffer from race conditions as each pod attempts to start simultaneously, i.e. it would only address parts of the problem. An additional step as part of this PR could be too actively disable migrations in the main container, but that's probably not necessary as they'll complete quickly. |
|
For sure, chart hooks would be much nicer. But we're still running entrypoint.sh on main container start. We'd need to 'fix' that to really gain the benefit of running an upgrade job vs multiple init containers. |
- Auto-generated documentation using helm-docs - Reflects updated initialDelaySeconds from 20s to 60s
If we were to set |
|
let me test that with the original 20s value and we'll see how it goes |
|
@chaholl Thank you so much! Let me know how it goes. |
|
Had a look into this. Init containers are the only option without causing a world of pain! The problem with chart hooks is that they'll run either before we install the rest of the chart, or afterwards. Before is too soon because the databases aren't deployed, after is too late because the web has already been deployed. It would be easy enough if there were no subcharts, we'd just set the hook-weights, but with subcharts, it depends on what's exposed as values and whether they use hooks internally. If we stick with the init containers, the locking mechanism in prisma should prevent trouble |
|
I did some further investigation into this since the startup time is pretty slow for a next.js app: The problem is caused by Prisma connection pool startup. Now, of course, we need a database connection, so there isn't much we can do about that! However, a lot of the initialization stuff that happens, like creating an initial org, slows things down even more. Since it's happening in: https://github.com/langfuse/langfuse/blob/main/web/src/initialize.ts, that stuff is blocking the webserver from starting. So the probe endpoint isn't available to return a health indicator. I found that adding this to values got it to start more reliably: I guess an existing external database is faster to connect, so you probably don't notice this problem in the production cloud environment, but since the database client is initialized at load time, it'll try to connect straight away and potentially block other things from happening: None of this directly affects this PR, I just include it to justify the long initial delay before the probe kicks in. |
Summary
This PR adds an initialization container to the web deployment to resolve health check timeout issues during startup.
Problem
The main web container performs database migrations and other initialization tasks on startup, which can take a significant amount of time. This causes health check probes to timeout and the pod to be killed before it can fully start.
Solution
Changes
charts/langfuse/templates/web/deployment.yaml- Added init container configurationcharts/langfuse/tests/init-container_test.yaml- Comprehensive test coverageKey Benefits
extraInitContainersfunctionalityTesting
Closes #272
Important
Adds an init container to
deployment.yamlfor database migrations, preventing health check timeouts, with tests ininit-container_test.yaml.langfuse-web-inittodeployment.yamlfor database migrations and initialization tasks.echo "Init completed successfully".extraInitContainersfunctionality.init-container_test.yamlfor testing init container configuration and functionality.This description was created by
for ee806f0. You can customize this summary. It will automatically update as commits are pushed.