| layout | default |
|---|---|
| title | Langfuse Tutorial - Chapter 3: Prompt Management |
| nav_order | 3 |
| has_children | false |
| parent | Langfuse Tutorial |
Welcome to Chapter 3: Prompt Management. In this part of Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.
Version, release, and A/B test prompts directly from Langfuse.
Prompts are the most frequently changed part of any LLM application. A small wording tweak can dramatically affect quality, cost, and safety. Langfuse lets you store prompts centrally, version them automatically, attach labels for release management, and fetch them at runtime -- all without redeploying your application. In this chapter you will:
- Understand the prompt lifecycle from creation to monitoring.
- Work with both Chat and Text prompt types.
- Use variables and templates effectively.
- Fetch and cache prompts in your application.
- Monitor prompt performance across versions.
- Walk through a complete prompt iteration workflow.
- Learn prompt engineering best practices within Langfuse.
A prompt in Langfuse goes through a clear lifecycle. Understanding these stages helps you manage changes safely:
graph LR
A[Create Prompt] --> B[New Version<br/>auto-incremented]
B --> C[Assign Label<br/>e.g. staging]
C --> D[Deploy to App<br/>SDK fetches by label]
D --> E[Monitor Performance<br/>scores, cost, latency]
E -->|iterate| B
style A fill:#e0f2fe,stroke:#0284c7
style B fill:#fef3c7,stroke:#d97706
style C fill:#f3e8ff,stroke:#9333ea
style D fill:#dcfce7,stroke:#16a34a
style E fill:#fce7f3,stroke:#db2777
- Create -- Define a prompt in the Langfuse UI or via the API. Give it a descriptive name like
support_replyorsummarizer_v2. - Version -- Every edit creates a new, immutable version. Versions are auto-incremented integers (1, 2, 3, ...).
- Label -- Assign labels like
production,staging, orbetato specific versions. Labels are movable pointers -- you can relabel instantly to roll back. - Deploy -- Your application fetches the prompt by name and label at runtime. No redeploy needed.
- Monitor -- Attach the
prompt_versionto your traces and use Langfuse analytics to compare quality, cost, and latency across versions.
Langfuse supports two prompt types. Choosing the right one depends on how your LLM expects its input.
A text prompt is a single string with optional {{variables}}. It is ideal for completion-style models or when you build your own message array.
You are a helpful customer support agent for {{company_name}}.
The customer's name is {{customer_name}} and their issue is: {{issue}}.
Respond politely and provide a clear solution based on the following context:
{{context}}
When you call prompt.compile(...), Langfuse returns the rendered string.
A chat prompt is an array of messages, each with a role and content. This maps directly to the format expected by OpenAI, Anthropic, and most chat-based APIs.
[
{
"role": "system",
"content": "You are a support agent for {{company_name}}. Be concise and helpful."
},
{
"role": "user",
"content": "Hi, my name is {{customer_name}}. I have an issue with {{issue}}.\n\nContext: {{context}}"
}
]When you call prompt.compile(...) on a chat prompt, Langfuse returns a list of message dictionaries ready to pass to your LLM client.
| Use case | Prompt type |
|---|---|
| OpenAI / Anthropic chat completions | Chat |
| Single-prompt completions or embeddings | Text |
| Complex multi-turn templates | Chat |
| Simple string interpolation | Text |
You can create prompts through the UI or the SDK.
- Navigate to Prompts in the sidebar.
- Click New Prompt.
- Choose Chat or Text type.
- Enter a name (e.g.,
support_reply). - Write your template with
{{variable}}placeholders. - Click Save -- this creates version 1.
from langfuse import Langfuse
langfuse = Langfuse()
# Create a chat prompt
langfuse.create_prompt(
name="support_reply",
type="chat",
prompt=[
{
"role": "system",
"content": "You are a support agent for {{company_name}}. Be helpful and concise.",
},
{
"role": "user",
"content": "Customer {{customer_name}} asks: {{issue}}\n\nContext: {{context}}",
},
],
labels=["staging"], # immediately label this version
config={
"model": "gpt-4o-mini",
"temperature": 0.3,
"max_tokens": 500,
},
)The optional config object lets you store model parameters alongside the prompt. Your application can read these at runtime to stay in sync.
Variables are the bridge between your static prompt template and the dynamic data in each request. They use double-curly-brace syntax: {{variable_name}}.
When you write a prompt template, any {{token}} becomes a variable. Langfuse automatically detects them and lists them in the UI.
The compile method replaces variables with the values you provide:
prompt = langfuse.get_prompt("support_reply", label="production")
messages = prompt.compile(
customer_name="Alex",
issue="billing error on invoice #789",
context="We refunded invoice #789 and updated the payment method on file.",
company_name="Acme Corp",
)
# messages is now a list of dicts ready for your LLM client- Name variables descriptively:
{{customer_name}}is better than{{name}}. - Document expected types: In the prompt description field, note whether a variable expects a string, a list, or structured data.
- Provide defaults in your code: If a variable might be missing, handle it gracefully before calling
compile. - Avoid secrets: Never pass API keys, passwords, or tokens as prompt variables. They would be stored in Langfuse.
The config field on a prompt is a free-form dictionary. A common pattern is to store model settings there:
prompt = langfuse.get_prompt("support_reply", label="production")
# Read model config from the prompt
model = prompt.config.get("model", "gpt-4o-mini")
temperature = prompt.config.get("temperature", 0.7)
max_tokens = prompt.config.get("max_tokens", 500)
messages = prompt.compile(customer_name="Alex", issue="billing", context="...", company_name="Acme")
resp = client.chat.completions.create(
model=model,
temperature=temperature,
max_tokens=max_tokens,
messages=messages,
)This way, non-engineers on your team can adjust model parameters from the Langfuse UI without touching code.
Every time you save a prompt (in the UI or via create_prompt), Langfuse creates a new immutable version. Versions are integers that increment automatically. You can never edit an existing version -- only create a new one.
Labels are movable pointers to versions. Think of them like Git tags that you can reassign:
production-- the version your live application uses.staging-- the version being tested before promotion.beta-- an experimental version for a subset of users.latest-- Langfuse automatically assigns this to the newest version.
To promote a staging prompt to production, simply move the production label to the staging version. Instant rollout, instant rollback.
Always record which prompt version generated each response. This makes performance comparison possible:
prompt = langfuse.get_prompt("support_reply", label="production")
messages = prompt.compile(customer_name="Alex", issue="billing", context="...", company_name="Acme")
trace = langfuse.trace(name="support-query", user_id="user_123")
span = trace.span(
name="support-llm",
input=messages,
metadata={
"prompt_name": prompt.name,
"prompt_version": prompt.version,
"prompt_label": "production",
},
)Want to test two prompt versions head-to-head? Assign different labels and split traffic in your code:
import hashlib
def get_prompt_label(user_id: str) -> str:
"""Deterministic split: same user always gets the same variant."""
hash_val = int(hashlib.md5(user_id.encode()).hexdigest(), 16)
return "production" if hash_val % 2 == 0 else "beta"
label = get_prompt_label(user_id)
prompt = langfuse.get_prompt("support_reply", label=label)
# Tag the trace so you can filter by variant later
trace = langfuse.trace(
name="support-query",
user_id=user_id,
tags=[f"prompt-variant:{label}"],
)Then in the Langfuse dashboard, filter traces by the prompt-variant:production and prompt-variant:beta tags to compare scores, latency, and cost side by side.
Fetching prompts from Langfuse on every request adds a network call. The SDK includes built-in caching to minimize this overhead.
The Python SDK caches prompts in memory with a default TTL (time-to-live). When you call get_prompt, it returns the cached version if the TTL has not expired.
# Cache for 5 minutes (300 seconds)
prompt = langfuse.get_prompt("support_reply", label="production", cache_ttl_seconds=300)
# Force a fresh fetch (bypass cache)
prompt = langfuse.get_prompt("support_reply", label="production", cache_ttl_seconds=0)| Environment | Recommended TTL | Reason |
|---|---|---|
| Development | 0 (no cache) | See prompt changes immediately. |
| Staging | 30-60 seconds | Quick iteration, but reduce API calls. |
| Production | 300-600 seconds | Stable prompts; minimize latency. |
For latency-sensitive applications, fetch your prompts during initialization:
# At application startup
prompts = {
"support_reply": langfuse.get_prompt("support_reply", label="production"),
"summarizer": langfuse.get_prompt("summarizer", label="production"),
}
# During request handling -- already cached
messages = prompts["support_reply"].compile(...)Connecting prompts to evaluation (covered in detail in Chapter 4) closes the feedback loop. Here is the pattern:
- Tag traces with prompt metadata -- name, version, label.
- Score traces -- either with LLM-as-judge or human feedback.
- Filter by prompt version in the Langfuse dashboard to compare.
This lets you answer questions like:
- Did version 5 of
support_replyimprove helpfulness scores compared to version 4? - Which prompt variant has lower cost per successful response?
- Are there regressions in safety scores after the latest prompt edit?
Let's walk through a realistic workflow where you iterate on a prompt, test it, and promote it to production.
langfuse.create_prompt(
name="ticket_classifier",
type="chat",
prompt=[
{
"role": "system",
"content": (
"Classify the support ticket into one of these categories: "
"billing, technical, account, other.\n"
"Respond with only the category name."
),
},
{"role": "user", "content": "{{ticket_text}}"},
],
labels=["production"],
config={"model": "gpt-4o-mini", "temperature": 0},
)After monitoring, you notice that tickets about refunds are being classified as other instead of billing. Time to iterate.
langfuse.create_prompt(
name="ticket_classifier",
type="chat",
prompt=[
{
"role": "system",
"content": (
"Classify the support ticket into one of these categories: "
"billing (includes refunds, invoices, payments), "
"technical (includes bugs, errors, integrations), "
"account (includes login, password, profile), "
"other.\n"
"Respond with only the category name in lowercase."
),
},
{"role": "user", "content": "{{ticket_text}}"},
],
labels=["staging"],
config={"model": "gpt-4o-mini", "temperature": 0},
)This creates version 2 with the staging label. Version 1 still has production.
prompt = langfuse.get_prompt("ticket_classifier", label="staging")
test_tickets = [
"I need a refund for my last invoice",
"The API returns a 500 error",
"I cannot log into my account",
"I want to request a feature",
]
for ticket in test_tickets:
messages = prompt.compile(ticket_text=ticket)
resp = client.chat.completions.create(model="gpt-4o-mini", messages=messages, temperature=0)
category = resp.choices[0].message.content.strip()
trace = langfuse.trace(name="classifier-test", tags=["staging-test"])
trace.span(
name="classify",
input=ticket,
output=category,
metadata={"prompt_version": prompt.version},
)
print(f"Ticket: {ticket} -> {category}")
langfuse.flush()If the staging results look good, promote by moving the production label:
# In the Langfuse UI: go to Prompts > ticket_classifier > Version 2 > Add label "production"
# Or via SDK:
langfuse.create_prompt(
name="ticket_classifier",
type="chat",
prompt=[...], # same content as version 2
labels=["production"], # this version now gets the production label
config={"model": "gpt-4o-mini", "temperature": 0},
)Your live application immediately picks up the new version on the next cache refresh -- no redeploy required.
These tips will help you get the most out of Langfuse prompt management:
- Put instructions in the
systemmessage. - Put user input in the
usermessage. - Use numbered lists or bullet points for multi-step instructions.
- Separate concerns: one prompt per task (classify, summarize, generate).
- Name prompts by function:
ticket_classifier,support_reply,doc_summarizer. - Avoid generic names like
prompt_1ortest. - Use the description field to document what the prompt does and what variables it expects.
- Only parameterize what actually changes between requests.
- Hard-code instructions, formatting rules, and output schemas in the template itself.
- The fewer variables, the less room for injection or misuse.
- Always test new prompt versions on a sample of real inputs before moving the
productionlabel. - Use Langfuse evaluation (Chapter 4) to compare scores between versions.
- Keep at least one known-good version labeled
productionat all times.
- Write a brief note in the prompt description when you create a new version explaining what changed and why.
- Clean up old labels that are no longer in use.
- Review prompt performance weekly to catch regressions early.
- The prompt lifecycle: Create, Version, Label, Deploy, Monitor.
- The difference between Chat and Text prompt types.
- How to use variables, templates, and the config object.
- How to cache prompts for performance.
- How to link prompt versions to traces for performance monitoring.
- A complete prompt iteration workflow from creation through production promotion.
- Prompt engineering best practices within Langfuse.
| Previous: Chapter 2 -- Tracing Fundamentals | Next: Chapter 4 -- Evaluation |
This chapter is expanded to v1-style depth for production-grade learning and implementation quality.
- tutorial: Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations
- tutorial slug: langfuse-tutorial
- chapter focus: Chapter 3: Prompt Management
- system context: Langfuse Tutorial
- objective: move from surface-level usage to repeatable engineering operation
- Define the runtime boundary for
Chapter 3: Prompt Management. - Separate control-plane decisions from data-plane execution.
- Capture input contracts, transformation points, and output contracts.
- Trace state transitions across request lifecycle stages.
- Identify extension hooks and policy interception points.
- Map ownership boundaries for team and automation workflows.
- Specify rollback and recovery paths for unsafe changes.
- Track observability signals for correctness, latency, and cost.
| Decision Area | Low-Risk Path | High-Control Path | Tradeoff |
|---|---|---|---|
| Runtime mode | managed defaults | explicit policy config | speed vs control |
| State handling | local ephemeral | durable persisted state | simplicity vs auditability |
| Tool integration | direct API use | mediated adapter layer | velocity vs governance |
| Rollout method | manual change | staged + canary rollout | effort vs safety |
| Incident response | best effort logs | runbooks + SLO alerts | cost vs reliability |
| Failure Mode | Early Signal | Root Cause Pattern | Countermeasure |
|---|---|---|---|
| stale context | inconsistent outputs | missing refresh window | enforce context TTL and refresh hooks |
| policy drift | unexpected execution | ad hoc overrides | centralize policy profiles |
| auth mismatch | 401/403 bursts | credential sprawl | rotation schedule + scope minimization |
| schema breakage | parser/validation errors | unmanaged upstream changes | contract tests per release |
| retry storms | queue congestion | no backoff controls | jittered backoff + circuit breakers |
| silent regressions | quality drop without alerts | weak baseline metrics | eval harness with thresholds |
- Establish a reproducible baseline environment.
- Capture chapter-specific success criteria before changes.
- Implement minimal viable path with explicit interfaces.
- Add observability before expanding feature scope.
- Run deterministic tests for happy-path behavior.
- Inject failure scenarios for negative-path validation.
- Compare output quality against baseline snapshots.
- Promote through staged environments with rollback gates.
- Record operational lessons in release notes.
- chapter-level assumptions are explicit and testable
- API/tool boundaries are documented with input/output examples
- failure handling includes retry, timeout, and fallback policy
- security controls include auth scopes and secret rotation plans
- observability includes logs, metrics, traces, and alert thresholds
- deployment guidance includes canary and rollback paths
- docs include links to upstream sources and related tracks
- post-release verification confirms expected behavior under load
- LiteLLM Tutorial
- LangChain Tutorial
- LlamaIndex Tutorial
- Vercel AI SDK Tutorial
- Chapter 1: Getting Started
- Build a minimal end-to-end implementation for
Chapter 3: Prompt Management. - Add instrumentation and measure baseline latency and error rate.
- Introduce one controlled failure and confirm graceful recovery.
- Add policy constraints and verify they are enforced consistently.
- Run a staged rollout and document rollback decision criteria.
- Which execution boundary matters most for this chapter and why?
- What signal detects regressions earliest in your environment?
- What tradeoff did you make between delivery speed and governance?
- How would you recover from the highest-impact failure mode?
- What must be automated before scaling to team-wide adoption?
- tutorial context: Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations
- trigger condition: incoming request volume spikes after release
- initial hypothesis: identify the smallest reproducible failure boundary
- immediate action: protect user-facing stability before optimization work
- engineering control: introduce adaptive concurrency limits and queue bounds
- verification target: latency p95 and p99 stay within defined SLO windows
- rollback trigger: pre-defined quality gate fails for two consecutive checks
- communication step: publish incident status with owner and ETA
- learning capture: add postmortem and convert findings into automated tests
- tutorial context: Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations
- trigger condition: tool dependency latency increases under concurrency
- initial hypothesis: identify the smallest reproducible failure boundary
- immediate action: protect user-facing stability before optimization work
- engineering control: enable staged retries with jitter and circuit breaker fallback
- verification target: error budget burn rate remains below escalation threshold
- rollback trigger: pre-defined quality gate fails for two consecutive checks
- communication step: publish incident status with owner and ETA
- learning capture: add postmortem and convert findings into automated tests
Most teams struggle here because the hard part is not writing more code, but deciding clear boundaries for prompt, langfuse, label so behavior stays predictable as complexity grows.
In practical terms, this chapter helps you avoid three common failures:
- coupling core logic too tightly to one implementation path
- missing the handoff boundaries between setup, execution, and validation
- shipping changes without clear rollback or observability strategy
After working through this chapter, you should be able to reason about Chapter 3: Prompt Management as an operating subsystem inside Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations, with explicit contracts for inputs, state transitions, and outputs.
Use the implementation notes around name, production, messages as your checklist when adapting these patterns to your own repository.
Under the hood, Chapter 3: Prompt Management usually follows a repeatable control path:
- Context bootstrap: initialize runtime config and prerequisites for
prompt. - Input normalization: shape incoming data so
langfusereceives stable contracts. - Core execution: run the main logic branch and propagate intermediate state through
label. - Policy and safety checks: enforce limits, auth scopes, and failure boundaries.
- Output composition: return canonical result payloads for downstream consumers.
- Operational telemetry: emit logs/metrics needed for debugging and performance tuning.
When debugging, walk this sequence in order and confirm each stage has explicit success/failure conditions.
Use the following upstream sources to verify implementation details while reading this chapter:
- Langfuse Repository
Why it matters: authoritative reference on
Langfuse Repository(github.com). - Langfuse Releases
Why it matters: authoritative reference on
Langfuse Releases(github.com). - Langfuse Docs
Why it matters: authoritative reference on
Langfuse Docs(langfuse.com).
Suggested trace strategy:
- search upstream code for
promptandlangfuseto map concrete implementation paths - compare docs claims against actual runtime/config code before reusing patterns in production