init non-determinism creep post

robertDurst · robertDurst · commit 90b2a83eafbf · 2026-06-14T14:23:51.000-06:00
diff --git a/docs/index.html.pm b/docs/index.html.pm
@@ -1,5 +1,7 @@
 #lang pollen
 
+◊(require "projects/caffeine.rkt" "projects/manatee.rkt")
+
 ◊h2{About}
 
 ◊cyan{We are an independent research lab focused on exploring interesting ideas in the interaction between system reliability and programming languages.}
@@ -8,12 +10,14 @@ Founded by ◊link["https://www.linkedin.com/in/robertdurst/"]{Rob Durst} in 202
 
 ◊h2{Projects}
 
-◊project["https://caffeine-lang.run/" "Caffeine" "Active Development" "https://caffeine-lang.run/images/temp_caffeine_icon.png" "Gleam" "https://gleam.run/"]{A programming language for generating reliability artifacts from service expectation definitions. Grounded in assume/guarantee contracts, Caffeine helps developers and AI agents assert reasonable system construction at design time ◊em{and} in production. We ◊link["/status"]{dogfood it} here.}
+◊|caffeine|
+
+◊|manatee|
 
 ◊h2{Notes}
 
 ◊note-list{
-  ◊note["2026-05-31" "/notes/2026-05-31-caffeine-relays.html"]{New Caffeine Feature: Relays}
+  ◊note["2026-06-14" "/notes/2026-06-14-non-determinism-creep.html"]{Non-Determinism Creep}
 }
 
 ◊small{◊link["/notes"]{all notes →}}
diff --git a/docs/index.ptree b/docs/index.ptree
@@ -4,4 +4,4 @@ index.html
 manifesto.html
 status.html
 notes/index.html
-notes/2026-05-31-caffeine-relays.html
+notes/2026-06-14-non-determinism-creep.html
diff --git a/docs/manatee_logo.png b/docs/manatee_logo.png
diff --git a/docs/notes/2026-06-14-non-determinism-creep.html.pm b/docs/notes/2026-06-14-non-determinism-creep.html.pm
@@ -7,9 +7,9 @@
 
 ◊h2{Non-Determinism Creep}
 
-Outside of Brickell Research, I hold a day job where I have spent some time working on an important, yet a surfaces value trivial agentic feature. In order to seaprate this post here from that work, I will make up a scenario that captures the essence of the problem.
+Outside of Brickell Research, I hold a day job where I have spent some time working on an important, yet superficially trivial agentic feature. In order to separate this post here from that work, I will make up a scenario that captures the essence of the problem.
 
-I have spent a fair amount of time the past few days watching the world cup on Fubo and for one of the games I was watching, I acidentally recorded the game. While not teribly useful for me - the games are in friendly timezones to myself - for folks on the other side of the globe, watching live games is likely impossible. Realizing this myself, let's say that Fubo has an api for game recording and I am interested in building an agentic AI voice agent that can help schedule the recordings.
+I have spent a fair amount of time the past few days watching the world cup on Fubo and for one of the games I was watching, I accidentally recorded the game. While not terribly useful for me - the games are in friendly timezones to myself - for folks on the other side of the globe, watching live games is likely impossible. Realizing this myself, let's say that Fubo has an api for game recording and I am interested in building an agentic AI voice agent that can help schedule the recordings.
 
 Now funny enough, if you're a cranky realist like myself, your first instinct might be "can't we just build a simple with deterministic drop downs that the user can select from?" Easy right? Well it turns out that this feature is ingrained within a football fan immersive voice experience hotline. This hotline helps troubled football fans decide which team is for them by reasoning, live, through a set of non-deterministic questions proposed by an agent. Then, at the end of the call we setup the game recording to show the game at their time of choosing.
 
@@ -24,22 +24,22 @@ Maybe this scenario feels a bit contrived? While it is, bear with me. TLDR we've
 
 ◊h3{A Beginning}
 
-Ok, so regardles of whatever fun survey we prompt the agent to help the user hone in on their preffered team, the agent needs to be able to work with the user to establish at what time they'd like to watch the game. Prompting this isn't inherently hard, nor particularly interesting. The hard bit is taking the time the user agreed upon and then scheduling the recording to be available at that time.
+Ok, so regardless of whatever fun survey we prompt the agent to help the user hone in on their preferred team, the agent needs to be able to work with the user to establish at what time they'd like to watch the game. Prompting this isn't inherently hard, nor particularly interesting. The hard bit is taking the time the user agreed upon and then scheduling the recording to be available at that time.
 
 For starters, let's say the call itself is a one-shot, black box call and all we get at the end is a transcript. This was more or less where I started my journey at work.
 
 Well, now we have the issue of taking a transcript and figuring out how to interpret at what time the user and the agent agreed upon.
 
 ◊code-block{
-    let transcripts = call(agent, patient)
+    let transcripts = call(agent, user)
     let time_to_schedule_recording = extract_time(transcript)
 }
 
 At the beginning, the call volume was low enough that we just had a human operator handle the calls and extract the time manually. All worked well.
 
 ◊h3{Scaling}
 
-However, our service was so awesome that Telemundo picked it up and started to grow the call volume. Furthermore our Spanish speaker base was growing rapidly and none of our human operators were billingual.
+However, our service was so awesome that Telemundo picked it up and started to grow the call volume. Furthermore our Spanish speaker base was growing rapidly and none of our human operators were bilingual.
 
 So we started to automate the call handling process using AI models.
 
@@ -54,13 +54,13 @@ Operating the call for a few weeks like this, we started to experience problems:
 ◊ol{
   ◊li{Misinterpretations of the time and the team from what the user actually selected.}
   ◊li{Evals non-deterministically failing with false positives and false negatives.}
-  ◊li{Pure hallucinations around the user's preferences. For some reason the model interpretted the user's preference as England when no such team was selected.}
-  ◊li{Randomly interpretting the result as "no time selected" if a user said they were super busy at the end of the call, even though they actually wanted to watch the game and in the conversation had already agreed upon a time; misinterpretation of intent.}
+  ◊li{Pure hallucinations around the user's preferences. For some reason the model interpreted the user's preference as England when no such team was selected.}
+  ◊li{Randomly interpreting the result as "no time selected" if a user said they were super busy at the end of the call, even though they actually wanted to watch the game and in the conversation had already agreed upon a time; misinterpretation of intent.}
 }
 
-And thus we enterered a frustrating period of whack-a-mole, made worse by the fact that we did not feel we could trust our evals. Before long we were back to reading transcripts...
+And thus we entered a frustrating period of whack-a-mole, made worse by the fact that we did not feel we could trust our evals. Before long we were back to reading transcripts...
 
-◊h3{Seeing the Forrest for the Trees}
+◊h3{Seeing the Forest for the Trees}
 
 Now going back to our pseudocode, consider the system we'd built:
 
@@ -80,12 +80,12 @@ Now going back to our pseudocode, consider the system we'd built:
     run_evals(evals)
 }
 
-Running through this, we realized that every single time we relied on a model for interpretation, we were introducing non-determinism into our system. And that in itself wasn't necesarily an issue, it was the fact that we completely relied on the model to interpret the user's intent in order for the system to function as intended.
+Running through this, we realized that every single time we relied on a model for interpretation, we were introducing non-determinism into our system. And that in itself wasn't necessarily an issue, it was the fact that we completely relied on the model to interpret the user's intent in order for the system to function as intended.
 
 Yet this was hardly realistic... right? Consider the above but annotated with the potential failures.
 
 ◊ol{
-  ◊li{Original call: most discuss team preference and time with the user}
+  ◊li{Original call: must discuss team preference and time with the user}
   ◊li{Recording time extraction: re-interprets the user's intent to extract the recording time}
   ◊li{Team preference extraction: re-interprets the user's intent to extract the team preference}
   ◊li{Recording time eval: re-interprets the model's output to verify the recording was scheduled correctly}
@@ -112,25 +112,25 @@ Let's just say, in order to assign some numbers, correctness looks like:
     team_preference_eval = 99%
     odds_of_correct_evals = 97 * 99 = 96.0%
 
-    odds_of_total_correctnes = 97.2 * 96.0 = 93.3%
+    odds_of_total_correctness = 97.2 * 96.0 = 93.3%
 }
 
 So in this case, the model is correct 93.3% of the time, or for the 2,000 calls we make a week, 133 will be either incorrect or the evals will be false positives/negatives. That is untenable. Not only will your users lose confidence in your feature, but your internal ops team will lose faith in the evals.
 
-And this, for all intent and purposes, is a pretty simple example.
+And this, for all intents and purposes, is a pretty simple example.
 
 ◊h3{Stop Re-Interpreting}
 
 A critical insight is that within the call we have two things at the time of scheduling the recording: (1) we have interpretted the time and (2) we have the ability to go back and forth with the user, live in case of booking failure or the need to adjust. Thus, if we give the agent a tool, we can take advantage of those two things to fundamentally stop re-interpreting.
 
 Furthermore, since we control the tool to schedule, we can put safety guards there, ensuring the booked time and booked team are both valid.
 
-Thus, we can actually drammatically simplify the non-determinism creep! And honestly we don't need these specific evals anymore as we can trust the tool to schedule correctly (within reason).
+Thus, we can actually dramatically simplify the non-determinism creep! And honestly we don't need these specific evals anymore as we can trust the tool to schedule correctly (within reason).
 
 ◊code-block{
     original_call = 99.9%
 
-    odds_of_total_correctnes = 99.9%
+    odds_of_total_correctness = 99.9%
 }
 
 In the context of the 2,000 call volume, we now expect only 2 incorrect calls per week! Now we're starting to talk about a reasonably reliable system.
diff --git a/docs/pollen.rkt b/docs/pollen.rkt
@@ -5,7 +5,16 @@
 
 (module setup racket/base
   (provide (all-defined-out))
-  (define poly-targets '(html)))
+  (require racket/runtime-path racket/path)
+  (define poly-targets '(html))
+  ;; Pollen's cache keys off each page's mtime, its template, and pollen.rkt —
+  ;; but NOT modules pulled in via ◊(require ...). Watch every project card so
+  ;; editing projects/*.rkt invalidates the cache and the page re-renders.
+  (define-runtime-path projects-dir "projects")
+  (define cache-watchlist
+    (for/list ([f (in-directory projects-dir)]
+               #:when (path-has-extension? f #".rkt"))
+      f)))
 
 ;; Root function — auto-wraps paragraphs
 (define (root . elements)
@@ -38,13 +47,27 @@
 (define (link url . text)
   `(a ((href ,url)) ,@text))
 
-;; Project card
-(define (project url name status img lang lang-url . description)
+;; Project card. Each project lives in its own file under projects/ and calls
+;; this to produce its card; index.html.pm requires those files and drops the
+;; cards in. Description is the rest arg so markup (em, link, …) still works.
+(define (project #:url url
+                 #:name name
+                 #:status status
+                 #:img img
+                 #:lang lang
+                 #:lang-url lang-url
+                 #:adoption [adoption #f]
+                 . description)
   `(div ((class "project"))
      (img ((src ,img) (alt ,name) (class "project-logo")))
      (h3 (a ((href ,url)) ,name))
      (p ((class "project-desc")) ,@description)
-     (p ((class "project-status")) (span ((class "pink")) "Status: ") (em ,status) " · Written in " (a ((href ,lang-url)) ,lang))))
+     (p ((class "project-status"))
+        (span ((class "pink")) "Status: ") (em ,status)
+        ,@(if adoption
+              `(" · " (span ((class "pink")) "Adoption: ") (em ,adoption))
+              '())
+        " · Written in " (a ((href ,lang-url)) ,lang))))
 
 ;; Datadog dashboard embed
 (define (datadog-embed src)
diff --git a/docs/projects/caffeine.rkt b/docs/projects/caffeine.rkt
@@ -0,0 +1,21 @@
+#lang racket/base
+
+;; Caffeine project card. Edit the fields below in isolation.
+(require "../pollen.rkt")
+(provide caffeine)
+
+(define caffeine
+  (project
+   #:url      "https://caffeine-lang.run/"
+   #:name     "Caffeine"
+   #:status   "Active Development"
+   #:img      "https://caffeine-lang.run/images/temp_caffeine_icon.png"
+   #:lang     "Gleam"
+   #:lang-url "https://gleam.run/"
+   #:adoption "Production (industry)"
+   ;; Description — plain strings, with markup as txexprs / link helper:
+   "A programming language for generating reliability artifacts from service "
+   "expectation definitions. Grounded in assume/guarantee contracts, Caffeine "
+   "helps developers and AI agents assert reasonable system construction at "
+   "design time " '(em "and") " in production. We "
+   (link "/status" "dogfood it") " here."))
diff --git a/docs/projects/manatee.rkt b/docs/projects/manatee.rkt
@@ -0,0 +1,17 @@
+#lang racket/base
+
+;; Manatee project card. Edit the fields below in isolation.
+(require "../pollen.rkt")
+(provide manatee)
+
+(define manatee
+  (project
+   #:url      "https://github.com/Brickell-Research/manatee"
+   #:name     "Manatee"
+   #:status   "Active Development"
+   #:img      "/manatee_logo.png"
+   #:lang     "Gleam"
+   #:lang-url "https://gleam.run/"
+   #:adoption "Experimental"
+   ;; Description — plain strings, with markup as txexprs / link helper:
+   "A small runtime and DSL for analyzing the reliability of agentic systems. Specifically focused on the concept of non-determinism creep and compounding uncertainty (a.k.a. tolerance stacking in more traditional mechanical engineering)."))