Skip to content

Commit 7beadf5

Browse files
[Blog] How Toffee streamlines inference and cut GPU costs with dstack (#3345)
1 parent faa68e6 commit 7beadf5

File tree

2 files changed

+89
-1
lines changed

2 files changed

+89
-1
lines changed

docs/blog/posts/ea-gtc25.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ image: https://dstack.ai/static-assets/static-assets/images/dstack-ea-slide-2-ba
77
categories:
88
- Case studies
99
links:
10-
- NVIDIA GTC 2025: https://www.nvidia.com/en-us/on-demand/session/gtc25-s73667/
10+
- NVIDIA GTC 2025: https://www.nvidia.com/en-us/on-demand/session/gtc25-s73667/
1111
---
1212

1313
# How EA uses dstack to fast-track AI development

docs/blog/posts/toffee.md

Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
---
2+
title: "How Toffee streamlines inference and cut GPU costs with dstack"
3+
date: 2025-12-05
4+
description: "TBA"
5+
slug: toffee
6+
image: https://dstack.ai/static-assets/static-assets/images/dstack-toffee.png
7+
categories:
8+
- Case studies
9+
links:
10+
- Toffee's research blog: https://research.toffee.ai/blog/how-we-use-dstack-at-toffee
11+
---
12+
13+
# How Toffee streamlines inference and cut GPU costs with dstack
14+
15+
In a recent engineering [blog post](https://research.toffee.ai/blog/how-we-use-dstack-at-toffee), Toffee shared how they use `dstack` to run large-language and image-generation models across multiple GPU clouds, while keeping their core backend on AWS. This case study summarizes key insights and highlights how `dstack` became the backbone of Toffee’s multi-cloud inference stack.
16+
17+
<img src="https://dstack.ai/static-assets/static-assets/images/dstack-toffee.png" width="630" />
18+
19+
<!-- more -->
20+
21+
[Toffee](https://toffee.ai) builds AI-powered experiences backed by LLMs and image-generation models. To serve these workloads efficiently, they combine:
22+
23+
- **GPU neoclouds** such as [RunPod](https://www.runpod.io/) and [Vast.ai](https://vast.ai/) for flexible, cost-efficient GPU capacity
24+
- **AWS** for core, non-AI services and backend infrastructure
25+
- **dstack** as the orchestration layer that provisions GPU resources and exposes AI models via `dstack` [services](../../docs/concepts/services.md) and [gateways](../../docs/concepts/gateways.md)
26+
27+
Most user-facing logic lives in AWS. The backend communicates with AI services through `dstack` gateways, each running on an EC2 instance inside Toffee’s AWS perimeter and exposed via Route 53 private hosted zones. `dstack`, in turn, manages GPU workloads on GPU clouds, abstracting away provider differences.
28+
29+
Unlike the major hyperscalers (AWS, GCP, Azure), GPU neoclouds have historically offered more limited infrastructure-as-code (IaC) support, so teams often had to build their own tooling to provision and manage workloads at scale.
30+
31+
Toffee ran LLM and image-generation workloads across several GPU providers, but:
32+
33+
- Each provider had its own APIs and quirks
34+
- Maintaining custom scripts and Terraform modules became increasingly painful as they scaled
35+
36+
They needed **a unified orchestration layer** that:
37+
38+
- Worked across their GPU providers
39+
- Didn’t require Toffee to build and maintain its own orchestration platform
40+
41+
`dstack` became the core of Toffee’s infrastructure by providing a declarative, cloud-agnostic way to provision GPUs and run services across multiple providers.
42+
43+
> *Since we switched to `dstack`, we’ve cut the overhead of GPU-cloud orchestration by more than 50%. What used to take hours of custom Terraform + CLI scripting now deploys in minutes with a single declarative config — freeing us to focus on modelling, not infrastructure.*
44+
>
45+
> *[Nikita Shupeyko](https://www.linkedin.com/in/nikita-shupeyko/), AI/ML & Cloud Infrastructure Architect at Toffee*
46+
47+
Toffee primarily uses these `dstack` components:
48+
49+
- [**Services**](../../docs/concepts/services.md) – to define and run inference endpoints for LLM and image-generation models, including replica counts and resource requirements
50+
- [**Gateways**](../../docs/concepts/gateways.md) – EC2-based entry points inside AWS that expose `dstack` services to the Toffee backend as secure and auto-scalable model endpoints
51+
- **Dashboard UI** – to manage active workloads, see where services are running, and track usage and cost across providers
52+
53+
This architecture lets Toffee:
54+
55+
- Deploy new AI services via declarative configs instead of hand-rolled scripts
56+
- Switch between providers like GPU clouds without changing service code
57+
- Keep all AI traffic flowing through their AWS network perimeter
58+
59+
<div style="text-align: center">
60+
<img src="https://dstack.ai/static-assets/static-assets/images/toffee-diagram.svg" width="630" />
61+
</div>
62+
63+
Beyond oechestration, Toffee relies on `dstack`’s UI as a central observability hub for their GPU workloads across GPU clouds. From `dstack` UI, they can:
64+
65+
- See all active runs with resource allocations, costs, and current status across providers
66+
- Inspect service-level dashboards for each AI endpoint
67+
- Drill down into replica-level metrics, incl. GPU and CPU utilization, memory consumption, and instance-level logs and configuration details.
68+
69+
<img src="https://dstack.ai/static-assets/static-assets/images/toffee-metrics-dark.png" width="750" />
70+
71+
> *Thanks to dstack’s seamless integration with GPU neoclouds like RunPod and Vast.ai, we’ve been able to shift most workloads off hyperscalers — reducing our effective GPU spend by roughly 2–3× without changing a single line of model code.*
72+
>
73+
> *[Nikita Shupeyko](https://www.linkedin.com/in/nikita-shupeyko/), AI/ML & Cloud Infrastructure Architect at Toffee*
74+
75+
Before adopting `dstack`, there were serious drawbacks:
76+
77+
- Significant **maintenance overhead** as they scaled to more services and providers
78+
- Limited support for **zero-downtime deployments** and **autoscaling**
79+
- Additional engineering effort required to build features that platforms like `dstack` already provided
80+
81+
As Toffee’s user base and model footprint grew, investing further in home-grown orchestration stopped making sense. With `dstack` in place, Toffee’s model and product teams spend more time on experimentation and user experience, and less firefighting and maintaining brittle tooling.
82+
83+
*Huge thanks to Kamran and Nikita from Toffee’s team for sharing these insights. For more details, including the diagrams and some of hte open-source code, check out the original blog post in Toffee's [research blog](https://research.toffee.ai/blog/how-we-use-dstack-at-toffee).*
84+
85+
!!! info "What's next?"
86+
1. Check [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md)
87+
2. Follow [Quickstart](../../docs/quickstart.md)
88+
3. Browse [Examples](../../examples.md)

0 commit comments

Comments
 (0)