final dspy

acompa · acompa · commit 94a7759e7e9d · 2026-04-06T00:10:39.000-04:00
diff --git a/src/src/content/blog/dspy-production-challenges.md b/src/src/content/blog/dspy-production-challenges.md
@@ -1,39 +1,22 @@
 ---
-title: "DSPy solves prompting problems and creates production ones"
-description: "DSPy is an invaluable tool for AI workflow prototypes which creates new headaches in production."
-pubDate: 2026-04-02
+title: "DSPy's abstractions worked for us, but its runtime didn't"
+description: "DSPy is a useful AI workflow development tool, but its runtime configuration introduced issues for our production workflows."
+pubDate: 2026-04-06
 ---
 
-I felt a pang of jealousy after enjoying [Skylar Payne's write-up on DSPy](https://skylarbpayne.com/posts/dspy-engineering-patterns/). Fantastic post! It deserves a read for its all-too-accurate description of the path taken by [DSPy](https://dspy.ai) (or any half-baked implementation of its concepts) into production, as well as the centrality of 
-- typed signatures,
-- composable modules, and
-- prompt optimizers
-in successful AI workflows.
+I greatly enjoyed Skylar Payne's write-up on [DSPy](https://skylarbpayne.com/posts/dspy-engineering-patterns/). I felt it would be helpful to share my experiences building production workflows with DSPy for two separate projects in 2025 at [Not Diamond](https://www.notdiamond.ai/).[^1] 
 
-My jealousy stems from procrastinating on a similar post describing my experiences building production workflows with DSPy for two separate projects in 2025 at [Not Diamond](https://www.notdiamond.ai/).[^1] This is that post. Thanks, Skylar, for the well-written kick-to-the-butt.
+This is that post. Thanks, Skylar, for the well-written kick-to-the-butt.
 
-## The concepts don't block adoption
+## Learning the library v. using it
 
-Anyway, we diverge on his conclusion. Skylar writes:
+I agree with the criticality of signatures, modules and optimizers in modern AI workflows. DSPy uses these concepts to great effect; between that, its comprehensive library of prompt-optimizing algorithms, and its excellent documentation, we quickly learned the library.
 
-> DSPy has adoption problems because it asks you to think differently before you’ve actually felt the pain of thinking the same way everyone else does.
-
-> The patterns DSPy embodies aren’t optional. If your AI system gets complex enough, you will reinvent them. The only question is whether you do it deliberately or accidentally.
-
-We agree that AI engineers absolutely must understand signatures and interfaces, both exemplified by the ubiquity of structured outputs. Optimizers - as a component of workflow evals - are similarly critical today. 
-
-But these concepts are not too esoteric to hinder DSPy’s adoption. Engineers can use DSPy's heaps of excellent documentation to quickly learn its concepts, which align with common practices for evals and workflow optimization. Let's give folks the benefit of the doubt, and assume a new user can quickly reach competency in DSPy.
-
-My view is that Skylar has mixed up the causality here. Engineers don't struggle with DSPy because of its novel concepts. Instead, **DSPy's limitations hinder the creation of modular, optimizable AI workflows**.
-
-During our testing I observed three concrete limitations in DSPy:
-- Its optimized programs have limited portability,
-- It takes a strongly-opinionated approach to concurrency, and
-- It has an LLM client dependency which can be challenging to work with and tedious to replace.
+That said, I tried to apply DSPy to a specific flavor of distributed system: async-optimized throughput, cross-provider targets, supporting multi-workflow stakeholders in enterprise settings. Its runtime didn't work with our production system, however, so we pulled it.
 
 ## Conflicting concurrency
 
-Both of my team's attempts at integrating DSPy into live workflows involved distributed, asynchronous training and dynamic model-switching via Celery with `gevent`. Users switch LMs via `dspy.configure`, but we encountered [this](https://github.com/stanfordnlp/dspy/blob/main/dspy/dsp/utils/settings.py#L159-L163) configuration issue:
+Both of my team's attempts at integrating DSPy into live workflows involved distributed, asynchronous training and dynamic model-switching via Celery with `gevent`. We abandoned DSPy shortly after encountering [this](https://github.com/stanfordnlp/dspy/blob/main/dspy/dsp/utils/settings.py#L159-L163) configuration issue:
 
 ```python
 if not in_ipython and config_owner_async_task != asyncio.current_task():
@@ -43,7 +26,7 @@ if not in_ipython and config_owner_async_task != asyncio.current_task():
 	)
 ```
 
-As suggested, we tried `dspy.context` to avoid async task conflicts. But every model switch required context managers, [as demonstrated by this example snippet](https://github.com/stanfordnlp/dspy/blob/main/docs/docs/tutorials/cache/index.md?plain=1#L178-L181):
+So we used `dspy.context`, as suggested to avoid async task conflicts. As mentioned before, though, we wanted to switch models, which obligated [this example snippet](https://github.com/stanfordnlp/dspy/blob/main/docs/docs/tutorials/cache/index.md?plain=1#L178-L181):
 
 ```python
 with dspy.context(lm=dspy.LM("openai/gpt-5-mini")):
@@ -53,18 +36,21 @@ with dspy.context(lm=dspy.LM("openai/gpt-5-nano")):
     result2 = predict(question="Who do *you* think is the GOAT of soccer?")
 ```
 
-In a Markdown doc or Jupyter notebook this reads fine. In a production codebase this practically _invites_ bugs. We abandoned both efforts shortly thereafter.
-
 While DSPy 3.0 shipped async improvements, [limitations in configurability](https://github.com/stanfordnlp/dspy/issues/8197) and async support still persist:
-- program training still [requires a synchronous context](https://github.com/stanfordnlp/dspy/issues/9075)
+- program training seems to [require a synchronous context](https://github.com/stanfordnlp/dspy/issues/9075)
 - `dspy.settings` has [mixed support](https://github.com/stanfordnlp/dspy/issues/8197) across coroutines
 - multi-model workflows have limited support across [different concurrency models](https://github.com/stanfordnlp/dspy/issues/8797)
 
-More broadly, **this anti-modular design emerges from DSPy's tightly-coupled inference-time story**. By optimizing your program with DSPy, you've also bought into the library's architectural decisions for concurrency. This is no more evident than [in the FastAPI tutorial](https://dspy.ai/tutorials/deployment/#deploying-with-fastapi), where the user defers concurrency management to DSPy via `dspy.asyncify`. But production deployments require performance optimizations tailored to the needs of *your* workflows - not those of a library dependency.
+More broadly, **this design tightly couples DSPy's runtime to your production setting**. By optimizing your program with DSPy, you've also bought into the library's architectural decisions for concurrency. This is no more evident than [in the FastAPI tutorial](https://dspy.ai/tutorials/deployment/#deploying-with-fastapi), where the user's concurrency management must be deferred to DSPy via `dspy.asyncify`. 
+
+Our production deployment requires performance optimizations tailored to the needs of *our* workflows, instead of those of a library dependency. To avoid conflicts between Celery and DSPy's concurrency management, we essentially distributed prompt optimization by model and ran each optimizer serially within a task. 
 
+This architecture wrecked the system's overall throughput. DSPy's optimizers were now the slowest part of our optimization stack. 
 ## Pulling prompts
 
-What did I mean earlier by a "tightly-coupled inference-time story?" Let's say you want to use your optimized prompts outside of DSPy. You'd have to extract them [like so](https://github.com/stanfordnlp/dspy/issues/7830):
+Once optimization completed, we encountered additional challenges when porting optimized prompts out of the distributed tasks. These prompts needed to pass human review for use in workflows which did not depend on DSPy, or existed outside of automated pipelines altogether (like AI chat interfaces).
+
+ Let's say you want to use your optimized prompts outside of DSPy. You'd have to extract prompts [like this](https://github.com/stanfordnlp/dspy/issues/7830):
 
 ```python
 {
@@ -77,55 +63,33 @@ What did I mean earlier by a "tightly-coupled inference-time story?" Let's say y
 }
 ```
 
-Omar Khattab[^2] has [explained the intention](https://github.com/stanfordnlp/dspy/issues/8042#issuecomment-2773833904) behind this choice:
-
-> DSPy does a lot of heavy-lifting indeed, and it's very common that people try to extract the optimized prompt but end up hurting quality in the process. We do not usually advise that you extract anything for this reason, since the optimized prompt assumes a lot of DSPy behavior like the way the inference calls are made. 
-
-> That said, if you really want to get a prompt you can apply this process. Note that it gives you a list of messages.
-
-He's right, of course. Many of DSPy's optimizers[^3] treat signatures as optimizable parameters. If the user decouples those components from the underlying prompts, they can inadvertently unwind performance gains. 
-
-But...that's _my_ prompt. Now, though, it has limited applicability. Concretely: **this inference-time requirement creates a golden-path dependency on DSPy.** 
-
-This bleeds into other concerns. For eg. observability, we faced a choice between forking the library to add spans or adopting [MLflow](https://dspy.ai/tutorials/observability/#tracing). If your production service doesn't use Python, you're instead relying on unofficial ports[^4]. If your prompt needs to land in a non-software workflow handled by Claude Cowork or Codex? You're back to Omar's earlier snippet.
-
+Note that many of DSPy's optimizers[^2] treat signatures as optimizable parameters. If the user decouples those components from the underlying prompts, they can inadvertently unwind performance gains. Users should preserve DSPy's programs as-is.
 
+**But these programs do not willingly exit the DSPy runtime**. We attempted to extract programs using the above snippet, as well as applying various scripts to DSPy's JSON I/O options and its `Signature` and `ChatAdapter` classes. None of these approaches ultimately met the requirements of our end users for workflow compatibility and human understanding.
 ## Dealing with dependencies
 
-DSPy also relies on leaky abstractions thanks to its underlying LLM client library, LiteLLM. Several years ago, LiteLLM's cross-provider compatibility and ease of implementation made it the de-facto LLM client for workflow prototyping. Since then the AI ecosystem aligned onto OpenAI-style messages, and Skylar’s post demonstrates common patterns in client usage and request retries.
+Our production runtime deploys into environments with tighly-constrained requirements, so we require control over our supply chain. DSPy, however, executes all of its LLM client requests via LiteLLM. Introducing DSPy meant accepting LiteLLM into our carefully-curated environments and observability stack.
 
-LiteLLM poses its own issues in production contexts, however, such as poor logging configuration. It produces verbose, poorly-formatted logs at `INFO` level, which polluted Datadog, Sentry and our local testing tools:[^5]
+I've found that LiteLLM presents significant issues in non-trivial production contexts. The library introduces verbose, poorly-formatted logs at `INFO` level, which polluted Datadog and our local testing tools:[^3]
 
-![Commonly seen in logs across Jupyter notebooks and dev environments](/images/dspy-litellm-logs.png)
+![A common sight if you forgot to set the LiteLLM root logger to `WARN`.](public/images/dspy-litellm-logs.png)
 
-LiteLLM also recently experienced [a supply chain vulnerability](https://docs.litellm.ai/blog/security-update-march-2026) after a core maintainer's PyPI credentials were compromised. This risk exists for any OSS project, but if LiteLLM is not proxying your requests then your alternatives can be as simple as `openai.OpenAI` and `tenacity`. [^6]
+Our supply chain requirements extend to security concerns, so it's worth noting LiteLLM recently experienced [a supply chain vulnerability](https://docs.litellm.ai/blog/security-update-march-2026), where a core maintainer's PyPI credentials were seemingly compromised. 
 
-In DSPy, users can swap out the underlying client library with some effort. [There's no documentation on this capability](https://dspy.ai/learn/programming/language_models/#advanced-building-custom-lms-and-writing-your-own-adapters). A DSPy contributor is working on extracting LiteLLM into an extra dependency [while redesigning  `dspy.LM` ](https://github.com/stanfordnlp/dspy/issues/9514), which significantly helps production users once the change lands.
+We overcame these challenges by building our own client logic with `openai.OpenAI` and `tenacity` which extends across structured outputs and function calling.[^4] As of today there's broad support for `chat.completions` across AI providers. Our in-house client library works comprehensively and reliably, thanks to strong compatibility with our observability stack.
 
 ## What it is and what it is not
 
-Admittedly, our experience with DSPy could confirm "Khattab's Law" as proposed by Skylar:
-
-> Any sufficiently complicated AI system contains an ad hoc, informally-specified, bug-ridden implementation of half of DSPy.
+If you need to manage your own concurrency, port your prompts to different settings, or keep an orderly deployment environment, you might find that DSPy's dependency stack and runtime don't fit your needs.
 
-Our prompting layer leverages typed signatures, a separate eval harness and first-party provider SDKs to optimize prompts over eval datasets. 
-
-But none of the facts I've laid out above dilute the usefulness of DSPy. It's still a fantastic exploratory tool for prompt optimization. Our broader AI ecosystem benefits from research produced by labs like Stanford NLP and MIT CSAIL, whose researchers primarily maintain DSPy.
-
-That said, it's worth acknowledging difficulties integrating it into production systems with optimized concurrency, comprehensive monitoring, and tight requirements for environment builds. 
-
-As Skylar noted, DSPy’s concepts form the basis of any well-designed AI workflow. I believe that applying those concepts inward - to the library’s architecture and interfaces - would broaden its impact far beyond prototyping and research.
+For other scenarios, though, DSPy can help launch and optimize your workflow prototypes very quickly, and its concepts will endure well into the production phase of your project.
 
 *Thanks to Devansh Jain and James Kunstle for input on this write-up*.
 
-[^1]: Astute readers will note that Not Diamond offers [Prompt Optimization](https://www.notdiamond.ai/#prompt-optimization). If you haven't closed this tab yet, I *solemnly vow* to avoid any sales in this post.
-
-[^2]: Lead author on [the original DSPy paper](https://arxiv.org/abs/2310.03714) and core contributor to the open-source library.
-
-[^3]: Notably, this is not true of GEPA, where users can [seamlessly extract prompts](https://x.com/LakshyAAAgrawal/status/1978198977985069348) from the optimized adapter. While GEPA is available in DSPy, this capability is primarily available in the underlying `gepa-ai` [library](https://gepa-ai.github.io/gepa/).
+[^1]: Astute readers will note that one of our products at Not Diamond is [Prompt Optimization](https://www.notdiamond.ai/#prompt-optimization). If you haven't closed this tab yet, I *solemnly vow* to avoid any sales in this post.
 
-[^4]: There are community-built libraries for languages like [TypeScript](https://github.com/ruvnet/dspy.ts) and [Rust](https://github.com/krypticmouse/DSRs). They lack feature parity or ongoing support.
+[^2]: Notably, this is not true of GEPA, where users can [seamlessly extract prompts](https://x.com/LakshyAAAgrawal/status/1978198977985069348) from the optimized adapter. While GEPA is available in DSPy, this capability is primarily available in the underlying `gepa-ai` [library](https://gepa-ai.github.io/gepa/).
 
-[^5]: Honestly, "Local Engineer Complains About Dependency Logging Configs" is the least surprising headline of the day! But LiteLLM's usage of `INFO` for low-signal logs feels disconnected from common logging practices in production systems. It's symptomatic of LiteLLM's other issues, a discussion of which falls well outside of this post's scope.
+[^3]: Honestly, "Local Engineer Complains About Dependency Logging Configs" is the least surprising headline of the day! But LiteLLM's usage of `INFO` for low-signal logs feels disconnected from common logging practices in production systems.
 
-[^6]: I expect most AI engineers writing Python have battle scars with `openai.OpenAI` and `tenacity.retry` at this point.
+[^4]: I expect most AI engineers writing Python have battle scars with `openai.OpenAI` and `tenacity.retry` at this point. `RETRYABLE_HTTP_ERROR_CODES` should be a permanent fixture of Pythonic AI project templates.