|
| 1 | +--- |
| 2 | +title: "How Toffee streamlines inference and cut GPU costs with dstack" |
| 3 | +date: 2025-12-05 |
| 4 | +description: "TBA" |
| 5 | +slug: toffee |
| 6 | +image: https://dstack.ai/static-assets/static-assets/images/dstack-toffee.png |
| 7 | +categories: |
| 8 | + - Case studies |
| 9 | +links: |
| 10 | + - Toffee's research blog: https://research.toffee.ai/blog/how-we-use-dstack-at-toffee |
| 11 | +--- |
| 12 | + |
| 13 | +# How Toffee streamlines inference and cut GPU costs with dstack |
| 14 | + |
| 15 | +In a recent engineering [blog post](https://research.toffee.ai/blog/how-we-use-dstack-at-toffee), Toffee shared how they use `dstack` to run large-language and image-generation models across multiple GPU clouds, while keeping their core backend on AWS. This case study summarizes key insights and highlights how `dstack` became the backbone of Toffee’s multi-cloud inference stack. |
| 16 | + |
| 17 | +<img src="https://dstack.ai/static-assets/static-assets/images/dstack-toffee.png" width="630" /> |
| 18 | + |
| 19 | +<!-- more --> |
| 20 | + |
| 21 | +[Toffee](https://toffee.ai) builds AI-powered experiences backed by LLMs and image-generation models. To serve these workloads efficiently, they combine: |
| 22 | + |
| 23 | +- **GPU neoclouds** such as [RunPod](https://www.runpod.io/) and [Vast.ai](https://vast.ai/) for flexible, cost-efficient GPU capacity |
| 24 | +- **AWS** for core, non-AI services and backend infrastructure |
| 25 | +- **dstack** as the orchestration layer that provisions GPU resources and exposes AI models via `dstack` [services](../../docs/concepts/services.md) and [gateways](../../docs/concepts/gateways.md) |
| 26 | + |
| 27 | +Most user-facing logic lives in AWS. The backend communicates with AI services through `dstack` gateways, each running on an EC2 instance inside Toffee’s AWS perimeter and exposed via Route 53 private hosted zones. `dstack`, in turn, manages GPU workloads on GPU clouds, abstracting away provider differences. |
| 28 | + |
| 29 | +Unlike the major hyperscalers (AWS, GCP, Azure), GPU neoclouds have historically offered more limited infrastructure-as-code (IaC) support, so teams often had to build their own tooling to provision and manage workloads at scale. |
| 30 | + |
| 31 | +Toffee ran LLM and image-generation workloads across several GPU providers, but: |
| 32 | + |
| 33 | +- Each provider had its own APIs and quirks |
| 34 | +- Maintaining custom scripts and Terraform modules became increasingly painful as they scaled |
| 35 | + |
| 36 | +They needed **a unified orchestration layer** that: |
| 37 | + |
| 38 | +- Worked across their GPU providers |
| 39 | +- Didn’t require Toffee to build and maintain its own orchestration platform |
| 40 | + |
| 41 | +`dstack` became the core of Toffee’s infrastructure by providing a declarative, cloud-agnostic way to provision GPUs and run services across multiple providers. |
| 42 | + |
| 43 | +> *Since we switched to `dstack`, we’ve cut the overhead of GPU-cloud orchestration by more than 50%. What used to take hours of custom Terraform + CLI scripting now deploys in minutes with a single declarative config — freeing us to focus on modelling, not infrastructure.* |
| 44 | +> |
| 45 | +> *— [Nikita Shupeyko](https://www.linkedin.com/in/nikita-shupeyko/), AI/ML & Cloud Infrastructure Architect at Toffee* |
| 46 | +
|
| 47 | +Toffee primarily uses these `dstack` components: |
| 48 | + |
| 49 | +- [**Services**](../../docs/concepts/services.md) – to define and run inference endpoints for LLM and image-generation models, including replica counts and resource requirements |
| 50 | +- [**Gateways**](../../docs/concepts/gateways.md) – EC2-based entry points inside AWS that expose `dstack` services to the Toffee backend as secure and auto-scalable model endpoints |
| 51 | +- **Dashboard UI** – to manage active workloads, see where services are running, and track usage and cost across providers |
| 52 | + |
| 53 | +This architecture lets Toffee: |
| 54 | + |
| 55 | +- Deploy new AI services via declarative configs instead of hand-rolled scripts |
| 56 | +- Switch between providers like GPU clouds without changing service code |
| 57 | +- Keep all AI traffic flowing through their AWS network perimeter |
| 58 | + |
| 59 | +<div style="text-align: center"> |
| 60 | + <img src="https://dstack.ai/static-assets/static-assets/images/toffee-diagram.svg" width="630" /> |
| 61 | +</div> |
| 62 | + |
| 63 | +Beyond oechestration, Toffee relies on `dstack`’s UI as a central observability hub for their GPU workloads across GPU clouds. From `dstack` UI, they can: |
| 64 | + |
| 65 | +- See all active runs with resource allocations, costs, and current status across providers |
| 66 | +- Inspect service-level dashboards for each AI endpoint |
| 67 | +- Drill down into replica-level metrics, incl. GPU and CPU utilization, memory consumption, and instance-level logs and configuration details. |
| 68 | + |
| 69 | +<img src="https://dstack.ai/static-assets/static-assets/images/toffee-metrics-dark.png" width="750" /> |
| 70 | + |
| 71 | +> *Thanks to dstack’s seamless integration with GPU neoclouds like RunPod and Vast.ai, we’ve been able to shift most workloads off hyperscalers — reducing our effective GPU spend by roughly 2–3× without changing a single line of model code.* |
| 72 | +> |
| 73 | +> *— [Nikita Shupeyko](https://www.linkedin.com/in/nikita-shupeyko/), AI/ML & Cloud Infrastructure Architect at Toffee* |
| 74 | +
|
| 75 | +Before adopting `dstack`, there were serious drawbacks: |
| 76 | + |
| 77 | +- Significant **maintenance overhead** as they scaled to more services and providers |
| 78 | +- Limited support for **zero-downtime deployments** and **autoscaling** |
| 79 | +- Additional engineering effort required to build features that platforms like `dstack` already provided |
| 80 | + |
| 81 | +As Toffee’s user base and model footprint grew, investing further in home-grown orchestration stopped making sense. With `dstack` in place, Toffee’s model and product teams spend more time on experimentation and user experience, and less firefighting and maintaining brittle tooling. |
| 82 | + |
| 83 | +*Huge thanks to Kamran and Nikita from Toffee’s team for sharing these insights. For more details, including the diagrams and some of hte open-source code, check out the original blog post in Toffee's [research blog](https://research.toffee.ai/blog/how-we-use-dstack-at-toffee).* |
| 84 | + |
| 85 | +!!! info "What's next?" |
| 86 | + 1. Check [dev environments](../../docs/concepts/dev-environments.md), [tasks](../../docs/concepts/tasks.md), [services](../../docs/concepts/services.md), and [fleets](../../docs/concepts/fleets.md) |
| 87 | + 2. Follow [Quickstart](../../docs/quickstart.md) |
| 88 | + 3. Browse [Examples](../../examples.md) |
0 commit comments