Skip to content

feat: add init container to web deployment for database migrations#274

Open
chaholl wants to merge 4 commits into
langfuse:mainfrom
chaholl:feature/web-init-container
Open

feat: add init container to web deployment for database migrations#274
chaholl wants to merge 4 commits into
langfuse:mainfrom
chaholl:feature/web-init-container

Conversation

@chaholl
Copy link
Copy Markdown

@chaholl chaholl commented Nov 14, 2025

Summary

This PR adds an initialization container to the web deployment to resolve health check timeout issues during startup.

Problem

The main web container performs database migrations and other initialization tasks on startup, which can take a significant amount of time. This causes health check probes to timeout and the pod to be killed before it can fully start.

Solution

  • Add an init container that runs the full entrypoint script for initialization
  • The init container handles database migrations, environment setup, and all initialization logic
  • Main container starts only after initialization is complete, allowing it to pass health checks immediately

Changes

  • Modified: charts/langfuse/templates/web/deployment.yaml - Added init container configuration
  • Added: charts/langfuse/tests/init-container_test.yaml - Comprehensive test coverage

Key Benefits

  • ✅ Prevents health check timeouts during startup
  • ✅ Separates initialization concerns from web serving
  • ✅ Runs full entrypoint script with proper error handling
  • ✅ Preserves existing extraInitContainers functionality
  • ✅ Fully tested with no regressions

Testing

  • All existing tests pass (50/50)
  • New comprehensive test coverage for init container functionality
  • Helm linting passes without issues

Closes #272


Important

Adds an init container to deployment.yaml for database migrations, preventing health check timeouts, with tests in init-container_test.yaml.

  • Behavior:
    • Adds init container langfuse-web-init to deployment.yaml for database migrations and initialization tasks.
    • Main container starts post-initialization, preventing health check timeouts.
  • Configuration:
    • Init container runs full entrypoint script and exits with echo "Init completed successfully".
    • Uses same image as main container and includes environment variables.
    • Preserves extraInitContainers functionality.
  • Testing:
    • Adds init-container_test.yaml for testing init container configuration and functionality.
    • Tests ensure init container exists, uses correct image, and handles custom init containers.

This description was created by Ellipsis for ee806f0. You can customize this summary. It will automatically update as commits are pushed.

- Add init container to handle database migrations and initialization
- Prevents health check timeouts by separating initialization from web serving
- Init container runs the full entrypoint script then exits cleanly
- Preserves existing extraInitContainers functionality
- Add comprehensive test coverage for init container functionality

Resolves startup issues where health checks were failing due to
time-consuming database migrations running in the main container.
@chaholl chaholl requested a review from Steffen911 as a code owner November 14, 2025 12:07
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Nov 14, 2025

CLA assistant check
All committers have signed the CLA.

@dosubot dosubot Bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Nov 14, 2025
image: "{{ .Values.langfuse.web.image.repository }}:{{ coalesce .Values.langfuse.web.image.tag .Values.langfuse.image.tag .Chart.AppVersion }}"
imagePullPolicy: {{ .Values.langfuse.web.image.pullPolicy | default .Values.langfuse.image.pullPolicy }}
# Run the full entrypoint script for initialization (DB setup, migrations, etc.) then exit
command: ["echo", "Init completed successfully"]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Init container command only echoes a message. To perform DB migrations as intended, consider running the full entrypoint script or allowing an override through values.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since script is called on ENTRYPOINT in the docker image, we don't need to change it (and changing it may cause problems).

# Docker ENTRYPOINT (dumb-init) is covered by semantic versioning, not the entrypoint.sh itself
# Reasoning: ENTRYPOINT is overridden by some self-hosted deployments, thus changing this is breaking
ENTRYPOINT ["dumb-init", "--", "./web/entrypoint.sh"]

At the end of the script it executes the command args:

echo "[DEBUG] Entrypoint script completed successfully. Running CMD: $@"
exec "$@"

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arguably, we could change the command on the main container to skip running entrypoint.sh on start up but it may be a pain to maintain that logic since it's non-trivial:

# startup command - use dd-trace if NEXT_PUBLIC_LANGFUSE_CLOUD_REGION is configured
CMD if [ -n "$NEXT_PUBLIC_LANGFUSE_CLOUD_REGION" ]; then \
    node --import dd-trace/initialize.mjs ./web/server.js --keepAliveTimeout 110000; \
    else \
    node ./web/server.js --keepAliveTimeout 110000; \
    fi

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made anther minor tweak. Even with the init container, the liveness prob is still kicking in too quickly and killing the web pod. Adjusting the default to 60s fixes that issue.

- Changed from 'command' to 'args' to allow entrypoint script to run
- This ensures the full initialization logic executes properly
- The entrypoint script handles environment setup, DB migrations, etc.
- Container exits cleanly after running 'echo Init completed successfully'
- Update default initialDelaySeconds from 20s to 60s in both values.yaml and deployment template
- Gives main container more time to start after init container completes
- Prevents premature health check failures during startup
path: "/api/public/health"
# -- Initial delay seconds for livenessProbe.
initialDelaySeconds: 20
initialDelaySeconds: 60
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary if an init container is in place? My understanding is that the health check would only start once the main container in the pod becomes active.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's what I thought as well! :)

But unfortunately the main container is still talking longer than 20s to spin up and become ready to serve the health api. Maybe because it's still running the entrypoint script

@Steffen911
Copy link
Copy Markdown
Member

@chaholl We wondered whether chart hooks are a more applicable function here to schedule a single upgrade job before updating the containers. The init container may still suffer from race conditions as each pod attempts to start simultaneously, i.e. it would only address parts of the problem.
What do you think about those approaches?

An additional step as part of this PR could be too actively disable migrations in the main container, but that's probably not necessary as they'll complete quickly.

@chaholl
Copy link
Copy Markdown
Author

chaholl commented Nov 14, 2025

For sure, chart hooks would be much nicer. But we're still running entrypoint.sh on main container start. We'd need to 'fix' that to really gain the benefit of running an upgrade job vs multiple init containers.

- Auto-generated documentation using helm-docs
- Reflects updated initialDelaySeconds from 20s to 60s
@Steffen911
Copy link
Copy Markdown
Member

For sure, chart hooks would be much nicer. But we're still running entrypoint.sh on main container start. We'd need to 'fix' that to really gain the benefit of running an upgrade job vs multiple init containers.

If we were to set LANGFUSE_AUTO_CLICKHOUSE_MIGRATION_DISABLED=true and LANGFUSE_AUTO_POSTGRES_MIGRATION_DISABLED=true the migrations should be skipped within the entrypoint. In that case, the central init job should work well.

@chaholl
Copy link
Copy Markdown
Author

chaholl commented Nov 14, 2025

let me test that with the original 20s value and we'll see how it goes

@Steffen911
Copy link
Copy Markdown
Member

@chaholl Thank you so much! Let me know how it goes.
If this doesn't have a large benefit, I'll be happy to test and merge this PR early next week as this is a solid improvement already.

@chaholl
Copy link
Copy Markdown
Author

chaholl commented Nov 14, 2025

Had a look into this. Init containers are the only option without causing a world of pain!

The problem with chart hooks is that they'll run either before we install the rest of the chart, or afterwards. Before is too soon because the databases aren't deployed, after is too late because the web has already been deployed. It would be easy enough if there were no subcharts, we'd just set the hook-weights, but with subcharts, it depends on what's exposed as values and whether they use hooks internally.

If we stick with the init containers, the locking mechanism in prisma should prevent trouble

@chaholl
Copy link
Copy Markdown
Author

chaholl commented Nov 17, 2025

I did some further investigation into this since the startup time is pretty slow for a next.js app:

▲ Next.js 15.5.4

Local: http://localhost:3000
Network: http://0.0.0.0:3000
✓ Starting...
✓ Ready in 31.3s

The problem is caused by Prisma connection pool startup. Now, of course, we need a database connection, so there isn't much we can do about that! However, a lot of the initialization stuff that happens, like creating an initial org, slows things down even more.

Since it's happening in: https://github.com/langfuse/langfuse/blob/main/web/src/initialize.ts, that stuff is blocking the webserver from starting. So the probe endpoint isn't available to return a health indicator.

I found that adding this to values got it to start more reliably:

langfuse:
  additionalEnv:    
    - name: NEXT_PUBLIC_LANGFUSE_RUN_NEXT_INIT
      value: "false"  

I guess an existing external database is faster to connect, so you probably don't notice this problem in the production cloud environment, but since the database client is initialized at load time, it'll try to connect straight away and potentially block other things from happening:

https://github.com/langfuse/langfuse/blob/3042f1aef011cd7c4401aa23055c0f460fca219c/packages/shared/src/db.ts#L111C1-L112C66

None of this directly affects this PR, I just include it to justify the long initial delay before the probe kicks in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: Installing the helm chart for a new self-hosted deployment leaves the database in an ususeable state

3 participants