Skip to content

docs(blog): add engineering post on the code evaluators runtime#3072

Open
wochinge wants to merge 8 commits into
mainfrom
feature/lf-2189-write-blog-post-about-code-based-evals
Open

docs(blog): add engineering post on the code evaluators runtime#3072
wochinge wants to merge 8 commits into
mainfrom
feature/lf-2189-write-blog-post-about-code-based-evals

Conversation

@wochinge

@wochinge wochinge commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Adds the engineering blog post "Designing the runtime for Langfuse code evaluators" (/blog/2026-06-12-code-evaluators-execution-model): requirements, the options we ruled out, the Lambda + tenant-isolation design, and how it held up in production.
  • Replaces the earlier draft ("running customer code safely") based on review feedback: descriptive title, less incident-led, more technical depth, magnitudes instead of exact tenant/run numbers.

Linear

@vercel

vercel Bot commented Jun 8, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
langfuse-docs Ready Ready Preview, Comment Jun 12, 2026 2:52pm

Request Review

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. docs labels Jun 8, 2026
@github-actions

github-actions Bot commented Jun 8, 2026

Copy link
Copy Markdown

@claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, straightforward blog post addition.

Extended reasoning...

Overview

This PR adds a single MDX blog post (content/blog/2026-06-08-running-customer-code-safely.mdx) plus two PNG screenshots for redacted GuardDuty findings. No code, components, or build configuration changes.

Security risks

None for this docs repo. The post discusses Langfuse's production security architecture for running untrusted user code, but the content itself is descriptive prose — no secrets, credentials, or sensitive endpoints exposed (GuardDuty screenshots are described as redacted). The author key tobiaswochinger is already present in data/authors.json, and all referenced components (BlogHeader, Video, Frame) are standard MDX components used throughout the blog.

Level of scrutiny

Low. This is a pure content addition to a docs site — no logic, no configuration that affects production, no shared component changes. The Vercel preview build provides immediate visual verification, and the worst-case failure mode is a broken page render that's easy to spot and revert.

Other factors

No bugs were flagged by the bug hunting system. The PR description explicitly notes the intentional omission of ogImage to avoid a missing-asset reference. Prose, links, and Mermaid diagram all look internally consistent.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs-only blog post addition, looks good.

Extended reasoning...

Overview

This PR adds a single engineering blog post (content/blog/2026-06-08-running-customer-code-safely.mdx) plus two redacted GuardDuty screenshots under public/images/blog/2026-06-08-running-customer-code-safely/. No functional code, no schema, no config — just an MDX file using already-established components (BlogHeader, globally-registered Frame/Video, mermaid code block).

Security risks

None. The post is content only and renders through the standard MDX pipeline. The screenshots are pre-redacted GuardDuty captures (already vetted as part of authoring). The content discusses Langfuse's own security model — no secrets or credentials in the post itself.

Level of scrutiny

Low. Frontmatter, BlogHeader props, and component usage match existing 2026 posts. Image paths in the diff resolve to the two new files included in the same PR. No build-affecting changes.

Other factors

The only finding from the bug hunter is a nit about omitting the date= prop on BlogHeader. It is purely cosmetic (an empty span renders instead of a visible date string), the bug hunter acknowledges other recent engineering posts also omit it, and it does not block the build or feed indexing — the frontmatter date is still authoritative. Not worth blocking on.

Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wochinge wochinge force-pushed the feature/lf-2189-write-blog-post-about-code-based-evals branch from 76afd47 to 1b1b216 Compare June 8, 2026 21:47
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d52abcfc24

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Comment thread content/blog/2026-06-08-running-customer-code-safely.mdx Outdated
Replace the MDX-breaking {{...}} tokens with magnitude-only figures so
the post renders, and add a parallel skeleton draft of the
engineering-focused rewrite for side-by-side iteration.
Comment thread content/blog/2026-06-10-code-evaluators-execution-model.mdx Outdated
wochinge added 2 commits June 11, 2026 11:16
First manual pass over the skeleton: rewritten options section,
least-privilege stance, and production-incident recap.
@wochinge wochinge changed the title docs(blog): add post on running untrusted code for code evaluators docs(blog): add engineering post on the code evaluators runtime Jun 11, 2026
Comment thread content/blog/2026-06-12-code-evaluators-execution-model.mdx
Comment thread content/blog/2026-06-12-code-evaluators-execution-model.mdx Outdated
Drop the original running-customer-code-safely draft in favor of the
rewritten engineering post; include redacted GuardDuty screenshot.
@wochinge wochinge force-pushed the feature/lf-2189-write-blog-post-about-code-based-evals branch from 28c6a23 to 7d1822d Compare June 11, 2026 12:26
Comment thread content/blog/2026-06-12-code-evaluators-execution-model.mdx Outdated
Comment thread content/blog/2026-06-12-code-evaluators-execution-model.mdx Outdated
Fix grammar, align limits with the docs table (separate 256 KB result
cap), straighten apostrophes, and rename the image folder to match the
post slug.
Comment thread content/blog/2026-06-12-code-evaluators-execution-model.mdx Outdated
Comment thread content/blog/2026-06-12-code-evaluators-execution-model.mdx
Comment thread content/blog/2026-06-12-code-evaluators-execution-model.mdx Outdated
@hassiebp hassiebp self-requested a review June 12, 2026 09:42

@hassiebp hassiebp left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really like this iteration, thanks Tobi! Can you please cycle in @annabellscha to get this ready for website + HackerNews? Marc also wanted to do a loop here.

@dosubot dosubot Bot added the lgtm This PR has been approved by a maintainer label Jun 12, 2026
Explain why in-process isolation is hard to get right (single
enforcement layer in a shared process), spell out DSL, and soften the
agent-sandbox contrast per review feedback.

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — prior review feedback and bot findings are addressed in the latest revisions.

Extended reasoning...

Overview

This PR adds a single engineering blog post — content/blog/2026-06-12-code-evaluators-execution-model.mdx — plus two GuardDuty screenshots under public/images/blog/2026-06-12-code-evaluators-execution-model/, and removes a stray out/_headers file that was accidentally tracked (the out/ directory is gitignored). No code logic, routing, build config, or component changes.

Security risks

None meaningful. The content is prose plus two screenshots that the author already scrubbed of identifying ASN/org information per Steffen911's earlier request. The out/_headers deletion does not affect the Vercel deployment (the primary deploy target shown in the PR timeline); on Cloudflare static-export deployments it would remove a X-Robots-Tag: noindex header, but out/ is build output that should not be in version control in the first place — cleaning it up is correct.

Level of scrutiny

Low. This is a docs-only change in content/blog/ — a directory whose only consumer is the fumadocs-mdx blog source. The post has been through multiple reviewer rounds (annabellscha, AkioNuernberger, Steffen911, hassiebp) and multiple bot rounds (mine, codex). The latest commits explicitly target review feedback: title rephrased away from incident framing, exact tenant/run numbers genericized, DSL defined, in-process-isolation paragraph expanded with concrete failure modes, size limits now match content/docs/evaluation/evaluation-methods/code-evaluators.mdx:476-478 exactly, image folder renamed to match post slug, smart quotes normalized, pip installuv add suggestion incorporated. Spot-checking the file confirms no remaining curly quotes (grep for U+2018/19/201C/201D returns nothing) and no unresolved {{…}} placeholders.

Other factors

The PR carries the lgtm label and hassiebp's two outstanding inline asks ("Define what DSL abbreviated for", "Why exactly are they hard to get right?") were both replied to with "done" by the author, with corresponding edits visible in the diff. The bug-hunting system found no bugs in this iteration. Since substantive changes have landed since my prior reviews and all flagged issues are resolved, a final approval is the right action rather than staying silent.

@annabellscha annabellscha left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comments, other than that LGTM


## Why code evaluators

Evaluation is how you find out whether your LLM app is actually any good. [Reading traces yourself](/blog/2026-06-09-ai-is-eating-ai-engineering) is where your opinion of your application gets formed, and nothing replaces it. But nobody reads millions of traces by hand. The standard move is to turn the error modes you found while reading into [LLM-as-a-judge evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge) that score every trace for things like agent helpfulness, conciseness, or tone.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard move is to turn the error modes you found

mention Error analysis and link to error analysis academy page

Comment on lines +30 to +36
Evaluation is how you find out whether your LLM app is actually any good. [Reading traces yourself](/blog/2026-06-09-ai-is-eating-ai-engineering) is where your opinion of your application gets formed, and nothing replaces it. But nobody reads millions of traces by hand. The standard move is to turn the error modes you found while reading into [LLM-as-a-judge evaluators](/docs/evaluation/evaluation-methods/llm-as-a-judge) that score every trace for things like agent helpfulness, conciseness, or tone.

But a lot of what teams want to check is not subjective at all. Is the output valid JSON? Does the answer match the expected value? A model is an expensive, non-deterministic way to answer questions that a few lines of code answer perfectly every time.

That is what code evaluators are: you provide the Python or TypeScript code to evaluate observations, configure when it should be triggered, and Langfuse runs it for you at scale.

Here is the "is the output valid JSON" check from above as an actual evaluator:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's look at the example from above as an actual evaluator:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

docs lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants