You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
◊cyan{We are an independent research lab focused on exploring interesting ideas in the interaction between system reliability and programming languages.}
@@ -8,12 +10,14 @@ Founded by ◊link["https://www.linkedin.com/in/robertdurst/"]{Rob Durst} in 202
8
10
9
11
◊h2{Projects}
10
12
11
-
◊project["https://caffeine-lang.run/""Caffeine""Active Development""https://caffeine-lang.run/images/temp_caffeine_icon.png""Gleam""https://gleam.run/"]{A programming language for generating reliability artifacts from service expectation definitions. Grounded in assume/guarantee contracts, Caffeine helps developers and AI agents assert reasonable system construction at design time ◊em{and} in production. We ◊link["/status"]{dogfood it} here.}
Copy file name to clipboardExpand all lines: docs/notes/2026-06-14-non-determinism-creep.html.pm
+15-15Lines changed: 15 additions & 15 deletions
Original file line number
Diff line number
Diff line change
@@ -7,9 +7,9 @@
7
7
8
8
◊h2{Non-Determinism Creep}
9
9
10
-
Outside of Brickell Research, I hold a day job where I have spent some time working on an important, yet a surfaces value trivial agentic feature. In order to seaprate this post here from that work, I will make up a scenario that captures the essence of the problem.
10
+
Outside of Brickell Research, I hold a day job where I have spent some time working on an important, yet superficially trivial agentic feature. In order to separate this post here from that work, I will make up a scenario that captures the essence of the problem.
11
11
12
-
I have spent a fair amount of time the past few days watching the world cup on Fubo andfor one of the games I was watching, I acidentally recorded the game. While notteribly useful for me - the games are in friendly timezones to myself - for folks on the other side of the globe, watching live games is likely impossible. Realizing this myself, let's say that Fubo has an api for game recording and I am interested in building an agentic AI voice agent that can help schedule the recordings.
12
+
I have spent a fair amount of time the past few days watching the world cup on Fubo andfor one of the games I was watching, I accidentally recorded the game. While notterribly useful for me - the games are in friendly timezones to myself - for folks on the other side of the globe, watching live games is likely impossible. Realizing this myself, let's say that Fubo has an api for game recording and I am interested in building an agentic AI voice agent that can help schedule the recordings.
13
13
14
14
Now funny enough, if you're a cranky realist like myself, your first instinct might be "can't we just build a simple with deterministic drop downs that the user can select from?" Easy right? Well it turns out that this feature is ingrained within a football fan immersive voice experience hotline. This hotline helps troubled football fans decide which team is for them by reasoning, live, through a set of non-deterministic questions proposed by an agent. Then, at the end of the call we setup the game recording to show the game at their time of choosing.
15
15
@@ -24,22 +24,22 @@ Maybe this scenario feels a bit contrived? While it is, bear with me. TLDR we've
24
24
25
25
◊h3{A Beginning}
26
26
27
-
Ok, so regardles of whatever fun survey we prompt the agent to help the user hone in on their preffered team, the agent needs to be able to work with the user to establish at what time they'd like to watch the game. Prompting this isn't inherently hard, nor particularly interesting. The hard bit is taking the time the user agreed upon and then scheduling the recording to be available at that time.
27
+
Ok, so regardless of whatever fun survey we prompt the agent to help the user hone in on their preferred team, the agent needs to be able to work with the user to establish at what time they'd like to watch the game. Prompting this isn't inherently hard, nor particularly interesting. The hard bit is taking the time the user agreed upon and then scheduling the recording to be available at that time.
28
28
29
29
For starters, let'ssay the call itself is a one-shot, black box call and all we get at the end is a transcript. This was more or less where I started my journey at work.
30
30
31
31
Well, now we have the issue of taking a transcript and figuring out how to interpret at what time the user and the agent agreed upon.
32
32
33
33
◊code-block{
34
-
let transcripts = call(agent, patient)
34
+
let transcripts = call(agent, user)
35
35
let time_to_schedule_recording = extract_time(transcript)
36
36
}
37
37
38
38
At the beginning, the call volume was low enough that we just had a human operator handle the calls and extract the time manually. All worked well.
39
39
40
40
◊h3{Scaling}
41
41
42
-
However, our service was so awesome that Telemundo picked it up and started to grow the call volume. Furthermore our Spanish speaker base was growing rapidly and none of our human operators were billingual.
42
+
However, our service was so awesome that Telemundo picked it up and started to grow the call volume. Furthermore our Spanish speaker base was growing rapidly and none of our human operators were bilingual.
43
43
44
44
So we started to automate the call handling process using AI models.
45
45
@@ -54,13 +54,13 @@ Operating the call for a few weeks like this, we started to experience problems:
54
54
◊ol{
55
55
◊li{Misinterpretations of the time and the team from what the user actually selected.}
56
56
◊li{Evals non-deterministically failing with false positives and false negatives.}
57
-
◊li{Pure hallucinations around the user's preferences. For some reason the model interpretted the user's preference as England when no such team was selected.}
58
-
◊li{Randomly interpretting the result as "no time selected" if a user said they were super busy at the end of the call, even though they actually wanted to watch the game and in the conversation had already agreed upon a time; misinterpretation of intent.}
57
+
◊li{Pure hallucinations around the user's preferences. For some reason the model interpreted the user's preference as England when no such team was selected.}
58
+
◊li{Randomly interpreting the result as "no time selected" if a user said they were super busy at the end of the call, even though they actually wanted to watch the game and in the conversation had already agreed upon a time; misinterpretation of intent.}
59
59
}
60
60
61
-
And thus we enterered a frustrating period of whack-a-mole, made worse by the fact that we did not feel we could trust our evals. Before long we were back to reading transcripts...
61
+
And thus we entered a frustrating period of whack-a-mole, made worse by the fact that we did not feel we could trust our evals. Before long we were back to reading transcripts...
62
62
63
-
◊h3{Seeing the Forrest for the Trees}
63
+
◊h3{Seeing the Forest for the Trees}
64
64
65
65
Now going back to our pseudocode, consider the system we'd built:
66
66
@@ -80,12 +80,12 @@ Now going back to our pseudocode, consider the system we'd built:
80
80
run_evals(evals)
81
81
}
82
82
83
-
Running through this, we realized that every single time we relied on a model for interpretation, we were introducing non-determinism into oursystem. And that in itself wasn't necesarily an issue, it was the fact that we completely relied on the model to interpret the user's intent in order for the system to function as intended.
83
+
Running through this, we realized that every single time we relied on a model for interpretation, we were introducing non-determinism into oursystem. And that in itself wasn't necessarily an issue, it was the fact that we completely relied on the model to interpret the user's intent in order for the system to function as intended.
84
84
85
85
Yet this was hardly realistic... right? Consider the above but annotated with the potential failures.
86
86
87
87
◊ol{
88
-
◊li{Original call: most discuss team preference andtime with the user}
88
+
◊li{Original call: must discuss team preference andtime with the user}
89
89
◊li{Recording time extraction: re-interprets the user's intent to extract the recording time}
90
90
◊li{Team preference extraction: re-interprets the user's intent to extract the team preference}
91
91
◊li{Recording timeeval: re-interprets the model's output to verify the recording was scheduled correctly}
@@ -112,25 +112,25 @@ Let's just say, in order to assign some numbers, correctness looks like:
112
112
team_preference_eval = 99%
113
113
odds_of_correct_evals = 97 * 99 = 96.0%
114
114
115
-
odds_of_total_correctnes = 97.2 * 96.0 = 93.3%
115
+
odds_of_total_correctness = 97.2 * 96.0 = 93.3%
116
116
}
117
117
118
118
So in this case, the model is correct 93.3% of the time, orfor the 2,000 calls we make a week, 133 will be either incorrect or the evals will be false positives/negatives. That is untenable. Not only will your users lose confidence in your feature, but your internal ops team will lose faith in the evals.
119
119
120
-
And this, for all intentand purposes, is a pretty simple example.
120
+
And this, for all intentsand purposes, is a pretty simple example.
121
121
122
122
◊h3{Stop Re-Interpreting}
123
123
124
124
A critical insight is that within the call we have two things at the time of scheduling the recording: (1) we have interpretted the timeand (2) we have the ability to go back and forth with the user, live in case of booking failure or the need to adjust. Thus, if we give the agent a tool, we can take advantage of those two things to fundamentally stop re-interpreting.
125
125
126
126
Furthermore, since we control the tool to schedule, we can put safety guards there, ensuring the booked timeand booked team are both valid.
127
127
128
-
Thus, we can actually drammatically simplify the non-determinism creep! And honestly we don't need these specific evals anymore as we can trust the tool to schedule correctly (within reason).
128
+
Thus, we can actually dramatically simplify the non-determinism creep! And honestly we don't need these specific evals anymore as we can trust the tool to schedule correctly (within reason).
129
129
130
130
◊code-block{
131
131
original_call = 99.9%
132
132
133
-
odds_of_total_correctnes = 99.9%
133
+
odds_of_total_correctness = 99.9%
134
134
}
135
135
136
136
In the context of the 2,000 call volume, we now expect only 2 incorrect calls per week! Now we're starting to talk about a reasonably reliable system.
;; Description — plain strings, with markup as txexprs / link helper:
17
+
"A small runtime and DSL for analyzing the reliability of agentic systems. Specifically focused on the concept of non-determinism creep and compounding uncertainty (a.k.a. tolerance stacking in more traditional mechanical engineering)."))
0 commit comments