Skip to content

Commit 90b2a83

Browse files
committed
init non-determinism creep post
1 parent 018b942 commit 90b2a83

7 files changed

Lines changed: 87 additions & 22 deletions

File tree

docs/index.html.pm

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#lang pollen
22

3+
◊(require "projects/caffeine.rkt" "projects/manatee.rkt")
4+
35
◊h2{About}
46

57
◊cyan{We are an independent research lab focused on exploring interesting ideas in the interaction between system reliability and programming languages.}
@@ -8,12 +10,14 @@ Founded by ◊link["https://www.linkedin.com/in/robertdurst/"]{Rob Durst} in 202
810

911
◊h2{Projects}
1012

11-
◊project["https://caffeine-lang.run/" "Caffeine" "Active Development" "https://caffeine-lang.run/images/temp_caffeine_icon.png" "Gleam" "https://gleam.run/"]{A programming language for generating reliability artifacts from service expectation definitions. Grounded in assume/guarantee contracts, Caffeine helps developers and AI agents assert reasonable system construction at design time ◊em{and} in production. We ◊link["/status"]{dogfood it} here.}
13+
|caffeine|
14+
15+
|manatee|
1216

1317
◊h2{Notes}
1418

1519
◊note-list{
16-
note["2026-05-31" "/notes/2026-05-31-caffeine-relays.html"]{New Caffeine Feature: Relays}
20+
note["2026-06-14" "/notes/2026-06-14-non-determinism-creep.html"]{Non-Determinism Creep}
1721
}
1822

1923
◊small{◊link["/notes"]{all notes →}}

docs/index.ptree

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,4 @@ index.html
44
manifesto.html
55
status.html
66
notes/index.html
7-
notes/2026-05-31-caffeine-relays.html
7+
notes/2026-06-14-non-determinism-creep.html

docs/manatee_logo.png

297 KB
Loading

docs/notes/2026-06-14-non-determinism-creep.html.pm

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,9 @@
77

88
◊h2{Non-Determinism Creep}
99

10-
Outside of Brickell Research, I hold a day job where I have spent some time working on an important, yet a surfaces value trivial agentic feature. In order to seaprate this post here from that work, I will make up a scenario that captures the essence of the problem.
10+
Outside of Brickell Research, I hold a day job where I have spent some time working on an important, yet superficially trivial agentic feature. In order to separate this post here from that work, I will make up a scenario that captures the essence of the problem.
1111

12-
I have spent a fair amount of time the past few days watching the world cup on Fubo and for one of the games I was watching, I acidentally recorded the game. While not teribly useful for me - the games are in friendly timezones to myself - for folks on the other side of the globe, watching live games is likely impossible. Realizing this myself, let's say that Fubo has an api for game recording and I am interested in building an agentic AI voice agent that can help schedule the recordings.
12+
I have spent a fair amount of time the past few days watching the world cup on Fubo and for one of the games I was watching, I accidentally recorded the game. While not terribly useful for me - the games are in friendly timezones to myself - for folks on the other side of the globe, watching live games is likely impossible. Realizing this myself, let's say that Fubo has an api for game recording and I am interested in building an agentic AI voice agent that can help schedule the recordings.
1313
1414
Now funny enough, if you're a cranky realist like myself, your first instinct might be "can't we just build a simple with deterministic drop downs that the user can select from?" Easy right? Well it turns out that this feature is ingrained within a football fan immersive voice experience hotline. This hotline helps troubled football fans decide which team is for them by reasoning, live, through a set of non-deterministic questions proposed by an agent. Then, at the end of the call we setup the game recording to show the game at their time of choosing.
1515

@@ -24,22 +24,22 @@ Maybe this scenario feels a bit contrived? While it is, bear with me. TLDR we've
2424
2525
◊h3{A Beginning}
2626
27-
Ok, so regardles of whatever fun survey we prompt the agent to help the user hone in on their preffered team, the agent needs to be able to work with the user to establish at what time they'd like to watch the game. Prompting this isn't inherently hard, nor particularly interesting. The hard bit is taking the time the user agreed upon and then scheduling the recording to be available at that time.
27+
Ok, so regardless of whatever fun survey we prompt the agent to help the user hone in on their preferred team, the agent needs to be able to work with the user to establish at what time they'd like to watch the game. Prompting this isn't inherently hard, nor particularly interesting. The hard bit is taking the time the user agreed upon and then scheduling the recording to be available at that time.
2828
2929
For starters, let's say the call itself is a one-shot, black box call and all we get at the end is a transcript. This was more or less where I started my journey at work.
3030

3131
Well, now we have the issue of taking a transcript and figuring out how to interpret at what time the user and the agent agreed upon.
3232

3333
◊code-block{
34-
let transcripts = call(agent, patient)
34+
let transcripts = call(agent, user)
3535
let time_to_schedule_recording = extract_time(transcript)
3636
}
3737

3838
At the beginning, the call volume was low enough that we just had a human operator handle the calls and extract the time manually. All worked well.
3939

4040
◊h3{Scaling}
4141

42-
However, our service was so awesome that Telemundo picked it up and started to grow the call volume. Furthermore our Spanish speaker base was growing rapidly and none of our human operators were billingual.
42+
However, our service was so awesome that Telemundo picked it up and started to grow the call volume. Furthermore our Spanish speaker base was growing rapidly and none of our human operators were bilingual.
4343

4444
So we started to automate the call handling process using AI models.
4545

@@ -54,13 +54,13 @@ Operating the call for a few weeks like this, we started to experience problems:
5454
◊ol{
5555
◊li{Misinterpretations of the time and the team from what the user actually selected.}
5656
◊li{Evals non-deterministically failing with false positives and false negatives.}
57-
◊li{Pure hallucinations around the user's preferences. For some reason the model interpretted the user's preference as England when no such team was selected.}
58-
◊li{Randomly interpretting the result as "no time selected" if a user said they were super busy at the end of the call, even though they actually wanted to watch the game and in the conversation had already agreed upon a time; misinterpretation of intent.}
57+
◊li{Pure hallucinations around the user's preferences. For some reason the model interpreted the user's preference as England when no such team was selected.}
58+
◊li{Randomly interpreting the result as "no time selected" if a user said they were super busy at the end of the call, even though they actually wanted to watch the game and in the conversation had already agreed upon a time; misinterpretation of intent.}
5959
}
6060
61-
And thus we enterered a frustrating period of whack-a-mole, made worse by the fact that we did not feel we could trust our evals. Before long we were back to reading transcripts...
61+
And thus we entered a frustrating period of whack-a-mole, made worse by the fact that we did not feel we could trust our evals. Before long we were back to reading transcripts...
6262
63-
◊h3{Seeing the Forrest for the Trees}
63+
◊h3{Seeing the Forest for the Trees}
6464
6565
Now going back to our pseudocode, consider the system we'd built:
6666

@@ -80,12 +80,12 @@ Now going back to our pseudocode, consider the system we'd built:
8080
run_evals(evals)
8181
}
8282

83-
Running through this, we realized that every single time we relied on a model for interpretation, we were introducing non-determinism into our system. And that in itself wasn't necesarily an issue, it was the fact that we completely relied on the model to interpret the user's intent in order for the system to function as intended.
83+
Running through this, we realized that every single time we relied on a model for interpretation, we were introducing non-determinism into our system. And that in itself wasn't necessarily an issue, it was the fact that we completely relied on the model to interpret the user's intent in order for the system to function as intended.
8484

8585
Yet this was hardly realistic... right? Consider the above but annotated with the potential failures.
8686

8787
◊ol{
88-
◊li{Original call: most discuss team preference and time with the user}
88+
◊li{Original call: must discuss team preference and time with the user}
8989
◊li{Recording time extraction: re-interprets the user's intent to extract the recording time}
9090
◊li{Team preference extraction: re-interprets the user's intent to extract the team preference}
9191
◊li{Recording time eval: re-interprets the model's output to verify the recording was scheduled correctly}
@@ -112,25 +112,25 @@ Let's just say, in order to assign some numbers, correctness looks like:
112112
team_preference_eval = 99%
113113
odds_of_correct_evals = 97 * 99 = 96.0%
114114

115-
odds_of_total_correctnes = 97.2 * 96.0 = 93.3%
115+
odds_of_total_correctness = 97.2 * 96.0 = 93.3%
116116
}
117117

118118
So in this case, the model is correct 93.3% of the time, or for the 2,000 calls we make a week, 133 will be either incorrect or the evals will be false positives/negatives. That is untenable. Not only will your users lose confidence in your feature, but your internal ops team will lose faith in the evals.
119119

120-
And this, for all intent and purposes, is a pretty simple example.
120+
And this, for all intents and purposes, is a pretty simple example.
121121

122122
◊h3{Stop Re-Interpreting}
123123

124124
A critical insight is that within the call we have two things at the time of scheduling the recording: (1) we have interpretted the time and (2) we have the ability to go back and forth with the user, live in case of booking failure or the need to adjust. Thus, if we give the agent a tool, we can take advantage of those two things to fundamentally stop re-interpreting.
125125

126126
Furthermore, since we control the tool to schedule, we can put safety guards there, ensuring the booked time and booked team are both valid.
127127

128-
Thus, we can actually drammatically simplify the non-determinism creep! And honestly we don't need these specific evals anymore as we can trust the tool to schedule correctly (within reason).
128+
Thus, we can actually dramatically simplify the non-determinism creep! And honestly we don't need these specific evals anymore as we can trust the tool to schedule correctly (within reason).
129129
130130
◊code-block{
131131
original_call = 99.9%
132132
133-
odds_of_total_correctnes = 99.9%
133+
odds_of_total_correctness = 99.9%
134134
}
135135
136136
In the context of the 2,000 call volume, we now expect only 2 incorrect calls per week! Now we're starting to talk about a reasonably reliable system.

docs/pollen.rkt

Lines changed: 27 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,16 @@
55

66
(module setup racket/base
77
(provide (all-defined-out))
8-
(define poly-targets '(html)))
8+
(require racket/runtime-path racket/path)
9+
(define poly-targets '(html))
10+
;; Pollen's cache keys off each page's mtime, its template, and pollen.rkt —
11+
;; but NOT modules pulled in via ◊(require ...). Watch every project card so
12+
;; editing projects/*.rkt invalidates the cache and the page re-renders.
13+
(define-runtime-path projects-dir "projects")
14+
(define cache-watchlist
15+
(for/list ([f (in-directory projects-dir)]
16+
#:when (path-has-extension? f #".rkt"))
17+
f)))
918

1019
;; Root function — auto-wraps paragraphs
1120
(define (root . elements)
@@ -38,13 +47,27 @@
3847
(define (link url . text)
3948
`(a ((href ,url)) ,@text))
4049

41-
;; Project card
42-
(define (project url name status img lang lang-url . description)
50+
;; Project card. Each project lives in its own file under projects/ and calls
51+
;; this to produce its card; index.html.pm requires those files and drops the
52+
;; cards in. Description is the rest arg so markup (em, link, …) still works.
53+
(define (project #:url url
54+
#:name name
55+
#:status status
56+
#:img img
57+
#:lang lang
58+
#:lang-url lang-url
59+
#:adoption [adoption #f]
60+
. description)
4361
`(div ((class "project"))
4462
(img ((src ,img) (alt ,name) (class "project-logo")))
4563
(h3 (a ((href ,url)) ,name))
4664
(p ((class "project-desc")) ,@description)
47-
(p ((class "project-status")) (span ((class "pink")) "Status: ") (em ,status) " · Written in " (a ((href ,lang-url)) ,lang))))
65+
(p ((class "project-status"))
66+
(span ((class "pink")) "Status: ") (em ,status)
67+
,@(if adoption
68+
`(" · " (span ((class "pink")) "Adoption: ") (em ,adoption))
69+
'())
70+
" · Written in " (a ((href ,lang-url)) ,lang))))
4871

4972
;; Datadog dashboard embed
5073
(define (datadog-embed src)

docs/projects/caffeine.rkt

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
#lang racket/base
2+
3+
;; Caffeine project card. Edit the fields below in isolation.
4+
(require "../pollen.rkt")
5+
(provide caffeine)
6+
7+
(define caffeine
8+
(project
9+
#:url "https://caffeine-lang.run/"
10+
#:name "Caffeine"
11+
#:status "Active Development"
12+
#:img "https://caffeine-lang.run/images/temp_caffeine_icon.png"
13+
#:lang "Gleam"
14+
#:lang-url "https://gleam.run/"
15+
#:adoption "Production (industry)"
16+
;; Description — plain strings, with markup as txexprs / link helper:
17+
"A programming language for generating reliability artifacts from service "
18+
"expectation definitions. Grounded in assume/guarantee contracts, Caffeine "
19+
"helps developers and AI agents assert reasonable system construction at "
20+
"design time " '(em "and") " in production. We "
21+
(link "/status" "dogfood it") " here."))

docs/projects/manatee.rkt

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
#lang racket/base
2+
3+
;; Manatee project card. Edit the fields below in isolation.
4+
(require "../pollen.rkt")
5+
(provide manatee)
6+
7+
(define manatee
8+
(project
9+
#:url "https://github.com/Brickell-Research/manatee"
10+
#:name "Manatee"
11+
#:status "Active Development"
12+
#:img "/manatee_logo.png"
13+
#:lang "Gleam"
14+
#:lang-url "https://gleam.run/"
15+
#:adoption "Experimental"
16+
;; Description — plain strings, with markup as txexprs / link helper:
17+
"A small runtime and DSL for analyzing the reliability of agentic systems. Specifically focused on the concept of non-determinism creep and compounding uncertainty (a.k.a. tolerance stacking in more traditional mechanical engineering)."))

0 commit comments

Comments
 (0)