feat: add init container to web deployment for database migrations by chaholl · Pull Request #274 · langfuse/langfuse-k8s

chaholl · 2025-11-14T12:07:19Z

Summary

This PR adds an initialization container to the web deployment to resolve health check timeout issues during startup.

Problem

The main web container performs database migrations and other initialization tasks on startup, which can take a significant amount of time. This causes health check probes to timeout and the pod to be killed before it can fully start.

Solution

Add an init container that runs the full entrypoint script for initialization
The init container handles database migrations, environment setup, and all initialization logic
Main container starts only after initialization is complete, allowing it to pass health checks immediately

Changes

Modified: charts/langfuse/templates/web/deployment.yaml - Added init container configuration
Added: charts/langfuse/tests/init-container_test.yaml - Comprehensive test coverage

Key Benefits

✅ Prevents health check timeouts during startup
✅ Separates initialization concerns from web serving
✅ Runs full entrypoint script with proper error handling
✅ Preserves existing extraInitContainers functionality
✅ Fully tested with no regressions

Testing

All existing tests pass (50/50)
New comprehensive test coverage for init container functionality
Helm linting passes without issues

Closes #272

Important

Adds an init container to deployment.yaml for database migrations, preventing health check timeouts, with tests in init-container_test.yaml.

Behavior:
- Adds init container langfuse-web-init to deployment.yaml for database migrations and initialization tasks.
- Main container starts post-initialization, preventing health check timeouts.
Configuration:
- Init container runs full entrypoint script and exits with echo "Init completed successfully".
- Uses same image as main container and includes environment variables.
- Preserves extraInitContainers functionality.
Testing:
- Adds init-container_test.yaml for testing init container configuration and functionality.
- Tests ensure init container exists, uses correct image, and handles custom init containers.

^{This description was created by}^{for ee806f0. You can customize this summary. It will automatically update as commits are pushed.}

- Add init container to handle database migrations and initialization - Prevents health check timeouts by separating initialization from web serving - Init container runs the full entrypoint script then exits cleanly - Preserves existing extraInitContainers functionality - Add comprehensive test coverage for init container functionality Resolves startup issues where health checks were failing due to time-consuming database migrations running in the main container.

CLAassistant · 2025-11-14T12:07:25Z

All committers have signed the CLA.

ellipsis-dev · 2025-11-14T12:09:32Z

+          image: "{{ .Values.langfuse.web.image.repository }}:{{ coalesce .Values.langfuse.web.image.tag .Values.langfuse.image.tag .Chart.AppVersion }}"
+          imagePullPolicy: {{ .Values.langfuse.web.image.pullPolicy | default .Values.langfuse.image.pullPolicy }}
+          # Run the full entrypoint script for initialization (DB setup, migrations, etc.) then exit
+          command: ["echo", "Init completed successfully"]


Init container command only echoes a message. To perform DB migrations as intended, consider running the full entrypoint script or allowing an override through values.

since script is called on ENTRYPOINT in the docker image, we don't need to change it (and changing it may cause problems).

# Docker ENTRYPOINT (dumb-init) is covered by semantic versioning, not the entrypoint.sh itself # Reasoning: ENTRYPOINT is overridden by some self-hosted deployments, thus changing this is breaking ENTRYPOINT ["dumb-init", "--", "./web/entrypoint.sh"]

At the end of the script it executes the command args:

echo "[DEBUG] Entrypoint script completed successfully. Running CMD: $@" exec "$@"

Arguably, we could change the command on the main container to skip running entrypoint.sh on start up but it may be a pain to maintain that logic since it's non-trivial:

# startup command - use dd-trace if NEXT_PUBLIC_LANGFUSE_CLOUD_REGION is configured CMD if [ -n "$NEXT_PUBLIC_LANGFUSE_CLOUD_REGION" ]; then \ node --import dd-trace/initialize.mjs ./web/server.js --keepAliveTimeout 110000; \ else \ node ./web/server.js --keepAliveTimeout 110000; \ fi

I made anther minor tweak. Even with the init container, the liveness prob is still kicking in too quickly and killing the web pod. Adjusting the default to 60s fixes that issue.

- Changed from 'command' to 'args' to allow entrypoint script to run - This ensures the full initialization logic executes properly - The entrypoint script handles environment setup, DB migrations, etc. - Container exits cleanly after running 'echo Init completed successfully'

- Update default initialDelaySeconds from 20s to 60s in both values.yaml and deployment template - Gives main container more time to start after init container completes - Prevents premature health check failures during startup

Steffen911 · 2025-11-14T14:25:14Z

      path: "/api/public/health"
      # -- Initial delay seconds for livenessProbe.
-      initialDelaySeconds: 20
+      initialDelaySeconds: 60


Is this necessary if an init container is in place? My understanding is that the health check would only start once the main container in the pod becomes active.

that's what I thought as well! :)

But unfortunately the main container is still talking longer than 20s to spin up and become ready to serve the health api. Maybe because it's still running the entrypoint script

Steffen911 · 2025-11-14T14:27:41Z

@chaholl We wondered whether chart hooks are a more applicable function here to schedule a single upgrade job before updating the containers. The init container may still suffer from race conditions as each pod attempts to start simultaneously, i.e. it would only address parts of the problem.
What do you think about those approaches?

An additional step as part of this PR could be too actively disable migrations in the main container, but that's probably not necessary as they'll complete quickly.

chaholl · 2025-11-14T14:32:45Z

For sure, chart hooks would be much nicer. But we're still running entrypoint.sh on main container start. We'd need to 'fix' that to really gain the benefit of running an upgrade job vs multiple init containers.

- Auto-generated documentation using helm-docs - Reflects updated initialDelaySeconds from 20s to 60s

Steffen911 · 2025-11-14T14:35:57Z

For sure, chart hooks would be much nicer. But we're still running entrypoint.sh on main container start. We'd need to 'fix' that to really gain the benefit of running an upgrade job vs multiple init containers.

If we were to set LANGFUSE_AUTO_CLICKHOUSE_MIGRATION_DISABLED=true and LANGFUSE_AUTO_POSTGRES_MIGRATION_DISABLED=true the migrations should be skipped within the entrypoint. In that case, the central init job should work well.

chaholl · 2025-11-14T14:37:04Z

let me test that with the original 20s value and we'll see how it goes

Steffen911 · 2025-11-14T14:38:44Z

@chaholl Thank you so much! Let me know how it goes.
If this doesn't have a large benefit, I'll be happy to test and merge this PR early next week as this is a solid improvement already.

chaholl · 2025-11-14T15:10:29Z

Had a look into this. Init containers are the only option without causing a world of pain!

The problem with chart hooks is that they'll run either before we install the rest of the chart, or afterwards. Before is too soon because the databases aren't deployed, after is too late because the web has already been deployed. It would be easy enough if there were no subcharts, we'd just set the hook-weights, but with subcharts, it depends on what's exposed as values and whether they use hooks internally.

If we stick with the init containers, the locking mechanism in prisma should prevent trouble

chaholl · 2025-11-17T11:27:11Z

I did some further investigation into this since the startup time is pretty slow for a next.js app:

▲ Next.js 15.5.4

Local: http://localhost:3000
Network: http://0.0.0.0:3000
✓ Starting...
✓ Ready in 31.3s

The problem is caused by Prisma connection pool startup. Now, of course, we need a database connection, so there isn't much we can do about that! However, a lot of the initialization stuff that happens, like creating an initial org, slows things down even more.

Since it's happening in: https://github.com/langfuse/langfuse/blob/main/web/src/initialize.ts, that stuff is blocking the webserver from starting. So the probe endpoint isn't available to return a health indicator.

I found that adding this to values got it to start more reliably:

langfuse:
  additionalEnv:    
    - name: NEXT_PUBLIC_LANGFUSE_RUN_NEXT_INIT
      value: "false"

I guess an existing external database is faster to connect, so you probably don't notice this problem in the production cloud environment, but since the database client is initialized at load time, it'll try to connect straight away and potentially block other things from happening:

https://github.com/langfuse/langfuse/blob/3042f1aef011cd7c4401aa23055c0f460fca219c/packages/shared/src/db.ts#L111C1-L112C66

None of this directly affects this PR, I just include it to justify the long initial delay before the probe kicks in.

chaholl requested a review from Steffen911 as a code owner November 14, 2025 12:07

dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Nov 14, 2025

ellipsis-dev Bot reviewed Nov 14, 2025

View reviewed changes

chaholl added 2 commits November 14, 2025 12:25

fix: increase liveness probe initial delay to 60s

23fd646

- Update default initialDelaySeconds from 20s to 60s in both values.yaml and deployment template - Gives main container more time to start after init container completes - Prevents premature health check failures during startup

Steffen911 reviewed Nov 14, 2025

View reviewed changes

docs: update chart documentation for liveness probe changes

552cdf1

- Auto-generated documentation using helm-docs - Reflects updated initialDelaySeconds from 20s to 60s

Conversation

chaholl commented Nov 14, 2025 • edited by ellipsis-dev Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Key Benefits

Testing

Uh oh!

CLAassistant commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ellipsis-dev Bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

chaholl Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

chaholl Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

chaholl Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Steffen911 Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

chaholl Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

Steffen911 commented Nov 14, 2025

Uh oh!

chaholl commented Nov 14, 2025

Uh oh!

Steffen911 commented Nov 14, 2025

Uh oh!

chaholl commented Nov 14, 2025

Uh oh!

Steffen911 commented Nov 14, 2025

Uh oh!

chaholl commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chaholl commented Nov 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chaholl commented Nov 14, 2025 •

edited by ellipsis-dev Bot

Loading

CLAassistant commented Nov 14, 2025 •

edited

Loading

chaholl commented Nov 14, 2025 •

edited

Loading