From fc388a70a408a298148bca8be9816e8e4c3386d7 Mon Sep 17 00:00:00 2001 From: Erik Osterman Date: Wed, 15 Oct 2025 10:22:13 -0500 Subject: [PATCH 1/9] Add blog post: Why We Recommend Managed Node Groups Over Fargate for EKS Add-Ons MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit This post explains the practical challenges of running EKS add-ons on Fargate-only clusters and why a small managed node group provides better reliability, cost efficiency, and automation for production environments. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- ...5-10-15-fargate-vs-managed-node-groups.mdx | 118 ++++++++++++++++++ 1 file changed, 118 insertions(+) create mode 100644 blog/2025-10-15-fargate-vs-managed-node-groups.mdx diff --git a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx new file mode 100644 index 000000000..a7e66563e --- /dev/null +++ b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx @@ -0,0 +1,118 @@ +--- +title: "Why We Recommend Managed Node Groups Over Fargate for EKS Add-Ons" +description: "For production EKS clusters, a small managed node group provides reliability, cost efficiency, and automation—without Fargate's hidden complexity and bootstrap deadlock." +tags: [eks, kubernetes, karpenter, fargate, managed node groups, aws, best practices] +date: 2025-10-15 +authors: [osterman] +--- +import FeatureList from '@site/src/components/FeatureList'; +import Intro from '@site/src/components/Intro'; + + +When simplicity meets automation, sometimes it's the hidden complexity that bites back. + + +For a while, running Karpenter on AWS Fargate sounded like a perfect solution. No nodes to manage, automatic scaling, and no EC2 lifecycle headaches. Even the Karpenter team showcased it — after all, what better way to demonstrate an autoscaler's power than by letting it manage everything from a blank slate? + +But in practice, that setup started to cause problems for certain EKS add-ons. Over time, those lessons led us — and our customers — to recommend using a small managed node group (MNG) instead of relying solely on Fargate. Here's why. + +--- + +## The Problem with "No Nodes" + +EKS cluster creation with Terraform requires certain managed add-ons — like CoreDNS or the EBS CSI driver — to become active before Terraform considers the cluster complete. + +But Fargate pods don't exist until there's a workload that needs them. That means when Terraform tries to deploy add-ons, there are no compute nodes for the add-ons to run on. Terraform waits… and waits… until the cluster creation fails. + +You can manually retry or patch things later, but that defeats the purpose of automation. We build for repeatability — not babysitting. + +--- + +## The Hidden Cost of "Serverless Nodes" + +Even after getting past cluster creation, the Fargate-only model creates subtle but serious issues with high availability. + +By AWS and Cloud Posse best practices, production-grade clusters should span three availability zones, with cluster-critical services distributed across them. + +However, during initial scheduling, Karpenter might spin up just one node large enough to fit all your add-on pods — even if they request three replicas with anti-affinity rules. Kubernetes will happily co-locate them all on that single node. + +Once they're running, those pods don't move automatically, even as the cluster grows. The result? + +**A deceptively healthy cluster with all your CoreDNS replicas living on the same node in one AZ — a single point of failure disguised as a distributed system.** + +--- + +## The Terraform Catch-22 + +Terraform enforces a strict dependency model: it won't complete a resource until it's ready. + +So without a static node group, Terraform can't successfully create the cluster (because the add-ons can't start). + +And without those add-ons running, Karpenter can't launch its first node (because Karpenter itself is waiting on the cluster to stabilize). + +This circular dependency means your beautiful "fully automated" Fargate-only cluster gets stuck in the most ironic place: **bootstrap deadlock**. + +--- + +## The Solution: A Minimal Managed Node Pool + +Our solution is simple: + +**Deploy a tiny managed node group — one node per availability zone — as part of your base cluster.** + + + - This provides a home for cluster-critical add-ons during creation + - It ensures that CoreDNS, EBS CSI, and other vital components are naturally distributed across AZs + - It gives Karpenter a stable platform to run on + - And it eliminates the bootstrap deadlock problem entirely + + +You can even disable autoscaling for this node pool. One node per AZ is enough. + +Think of it as your cluster's heartbeat — steady, predictable, and inexpensive. + +--- + +## Cost and Flexibility + +Fargate offers convenience, but at a premium. A pod requesting 2 vCPUs and 4 GiB of memory costs about **$0.098/hour**, compared to **$0.076/hour** for an equivalent EC2 c6a.large instance. + +And because Fargate bills in coarse increments, you often overpay for partial capacity. + +By contrast, a managed node group gives you flexibility: + + + - Use Graviton-based instances (c7g.medium) to cut costs nearly in half + - Mix On-Demand nodes for reliability and Spot nodes (via Karpenter) for efficiency + - Keep the static node pool minimal while letting Karpenter handle the dynamic scale-out + + +The result: **predictable cost floor, flexible scale ceiling**. + +--- + +## Lessons Learned + +At Cloud Posse, we love automation — but we love reliability through simplicity even more. + +Running Karpenter on Fargate works for proof-of-concepts or ephemeral clusters. + +But for production systems where uptime and high availability matter, a hybrid model is the clear winner: + + + - Static MNG for cluster-critical add-ons + - Karpenter for dynamic workloads + - Fargate only when you truly need "no nodes" + + +It's not about Fargate being bad — it's about knowing where it fits in your architecture. + +--- + +> "Karpenter doesn't use voting. Leader election uses Kubernetes leases. There's no strict technical requirement to have three pods — unless you actually care about staying up." +> +> — Ihor Urazov, Cloud Posse + +That's the key. If you care about staying up, give your cluster something to stand on. + +A small, stable managed node pool does exactly that. From 4f50c0f71db8ee65f4327cbb941dfb4d6e5180df Mon Sep 17 00:00:00 2001 From: Erik Osterman Date: Wed, 15 Oct 2025 10:26:13 -0500 Subject: [PATCH 2/9] Update attribution to SweetOps Slack --- blog/2025-10-15-fargate-vs-managed-node-groups.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx index a7e66563e..c5006b315 100644 --- a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx +++ b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx @@ -111,7 +111,7 @@ It's not about Fargate being bad — it's about knowing where it fits in your ar > "Karpenter doesn't use voting. Leader election uses Kubernetes leases. There's no strict technical requirement to have three pods — unless you actually care about staying up." > -> — Ihor Urazov, Cloud Posse +> — SweetOps Slack That's the key. If you care about staying up, give your cluster something to stand on. From 5a8220d101a89f740c3c29cf6a4d9e12940c1340 Mon Sep 17 00:00:00 2001 From: Erik Osterman Date: Wed, 15 Oct 2025 10:27:00 -0500 Subject: [PATCH 3/9] Update attribution to Ihor Urazov, SweetOps Slack --- blog/2025-10-15-fargate-vs-managed-node-groups.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx index c5006b315..32a07d356 100644 --- a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx +++ b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx @@ -111,7 +111,7 @@ It's not about Fargate being bad — it's about knowing where it fits in your ar > "Karpenter doesn't use voting. Leader election uses Kubernetes leases. There's no strict technical requirement to have three pods — unless you actually care about staying up." > -> — SweetOps Slack +> — Ihor Urazov, SweetOps Slack That's the key. If you care about staying up, give your cluster something to stand on. From 94234ee2818125f9bf65bc01953ae6ed4947d1a8 Mon Sep 17 00:00:00 2001 From: Erik Osterman Date: Wed, 15 Oct 2025 10:27:52 -0500 Subject: [PATCH 4/9] Remove horizontal rule dividers (let CSS handle formatting) --- blog/2025-10-15-fargate-vs-managed-node-groups.mdx | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx index 32a07d356..ebde633fd 100644 --- a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx +++ b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx @@ -16,8 +16,6 @@ For a while, running Karpenter on AWS Fargate sounded like a perfect solution. N But in practice, that setup started to cause problems for certain EKS add-ons. Over time, those lessons led us — and our customers — to recommend using a small managed node group (MNG) instead of relying solely on Fargate. Here's why. ---- - ## The Problem with "No Nodes" EKS cluster creation with Terraform requires certain managed add-ons — like CoreDNS or the EBS CSI driver — to become active before Terraform considers the cluster complete. @@ -26,8 +24,6 @@ But Fargate pods don't exist until there's a workload that needs them. That mean You can manually retry or patch things later, but that defeats the purpose of automation. We build for repeatability — not babysitting. ---- - ## The Hidden Cost of "Serverless Nodes" Even after getting past cluster creation, the Fargate-only model creates subtle but serious issues with high availability. @@ -40,8 +36,6 @@ Once they're running, those pods don't move automatically, even as the cluster g **A deceptively healthy cluster with all your CoreDNS replicas living on the same node in one AZ — a single point of failure disguised as a distributed system.** ---- - ## The Terraform Catch-22 Terraform enforces a strict dependency model: it won't complete a resource until it's ready. @@ -52,8 +46,6 @@ And without those add-ons running, Karpenter can't launch its first node (becaus This circular dependency means your beautiful "fully automated" Fargate-only cluster gets stuck in the most ironic place: **bootstrap deadlock**. ---- - ## The Solution: A Minimal Managed Node Pool Our solution is simple: @@ -71,8 +63,6 @@ You can even disable autoscaling for this node pool. One node per AZ is enough. Think of it as your cluster's heartbeat — steady, predictable, and inexpensive. ---- - ## Cost and Flexibility Fargate offers convenience, but at a premium. A pod requesting 2 vCPUs and 4 GiB of memory costs about **$0.098/hour**, compared to **$0.076/hour** for an equivalent EC2 c6a.large instance. @@ -89,8 +79,6 @@ By contrast, a managed node group gives you flexibility: The result: **predictable cost floor, flexible scale ceiling**. ---- - ## Lessons Learned At Cloud Posse, we love automation — but we love reliability through simplicity even more. @@ -107,8 +95,6 @@ But for production systems where uptime and high availability matter, a hybrid m It's not about Fargate being bad — it's about knowing where it fits in your architecture. ---- - > "Karpenter doesn't use voting. Leader election uses Kubernetes leases. There's no strict technical requirement to have three pods — unless you actually care about staying up." > > — Ihor Urazov, SweetOps Slack From 59362c145dd2b507b6dca7e613e78ea1d32e1280 Mon Sep 17 00:00:00 2001 From: Erik Osterman Date: Thu, 16 Oct 2025 00:56:14 -0500 Subject: [PATCH 5/9] Improve Fargate vs MNG blog post with citations and balanced perspective MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Combine "The Terraform Catch-22" with "The Problem with No Nodes" to eliminate duplication - Fix logical inconsistency: clarify co-location issue is with MNG, not Fargate - Add acknowledgment that recommendation diverges from official AWS guidance - Add citations to AWS EKS Best Practices, Karpenter docs, and Fargate configuration docs - Add context about why Fargate was initially attractive - Document additional Fargate architectural constraints - Note evolution of Karpenter's own defaults to MNG - Add "Your Mileage May Vary" section acknowledging teams that successfully use Fargate - Clarify that frequently-rebuilt dev clusters are worse candidates for Fargate - Strengthen conclusion to focus on operational requirements determining choice 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- ...5-10-15-fargate-vs-managed-node-groups.mdx | 83 +++++++++++++++---- 1 file changed, 66 insertions(+), 17 deletions(-) diff --git a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx index ebde633fd..074e8033c 100644 --- a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx +++ b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx @@ -12,39 +12,50 @@ import Intro from '@site/src/components/Intro'; When simplicity meets automation, sometimes it's the hidden complexity that bites back. -For a while, running Karpenter on AWS Fargate sounded like a perfect solution. No nodes to manage, automatic scaling, and no EC2 lifecycle headaches. Even the Karpenter team showcased it — after all, what better way to demonstrate an autoscaler's power than by letting it manage everything from a blank slate? +For a while, running Karpenter on AWS Fargate sounded like a perfect solution. No nodes to manage, automatic scaling, and no EC2 lifecycle headaches. The [AWS EKS Best Practices Guide](https://aws.github.io/aws-eks-best-practices/karpenter/#run-the-karpenter-controller-on-eks-fargate-or-on-a-worker-node-that-belongs-to-a-node-group) and [Karpenter's official documentation](https://karpenter.sh/docs/getting-started/getting-started-with-karpenter/) both present Fargate as a viable option for running the Karpenter controller. -But in practice, that setup started to cause problems for certain EKS add-ons. Over time, those lessons led us — and our customers — to recommend using a small managed node group (MNG) instead of relying solely on Fargate. Here's why. +But in practice, that setup started to cause problems for certain EKS add-ons. Over time, those lessons led us — and our customers — to recommend using a small managed node group (MNG) instead of relying solely on Fargate. -## The Problem with "No Nodes" +**This recommendation diverges from some official AWS guidance**, and we acknowledge that. Here's why we made this decision. + +## Why Fargate Was Attractive (and Still Is, Sometimes) + +The appeal of Fargate for Karpenter is understandable: + + + - No need to bootstrap a managed node group before deploying Karpenter + - Simpler initial setup for teams not using Infrastructure-as-Code frameworks + - Karpenter's early versions had limited integration with managed node pools + - It showcased Karpenter's capabilities in the most dramatic way possible + + +For teams deploying clusters manually or with basic tooling, Fargate eliminates several complex setup steps. But when you're using sophisticated Infrastructure-as-Code like [Cloud Posse's Terraform components](https://docs.cloudposse.com/components/), that initial complexity is already handled—and the operational benefits of a managed node group become far more valuable. + +## The Problem with "No Nodes" (and the Terraform Catch-22) EKS cluster creation with Terraform requires certain managed add-ons — like CoreDNS or the EBS CSI driver — to become active before Terraform considers the cluster complete. But Fargate pods don't exist until there's a workload that needs them. That means when Terraform tries to deploy add-ons, there are no compute nodes for the add-ons to run on. Terraform waits… and waits… until the cluster creation fails. +Terraform enforces a strict dependency model: it won't complete a resource until it's ready. Without a static node group, Terraform can't successfully create the cluster (because the add-ons can't start). And without those add-ons running, Karpenter can't launch its first node (because Karpenter itself is waiting on the cluster to stabilize). + +This circular dependency means your beautiful "fully automated" Fargate-only cluster gets stuck in the most ironic place: **bootstrap deadlock**. + You can manually retry or patch things later, but that defeats the purpose of automation. We build for repeatability — not babysitting. ## The Hidden Cost of "Serverless Nodes" -Even after getting past cluster creation, the Fargate-only model creates subtle but serious issues with high availability. +Even after getting past cluster creation, there are subtle but serious issues with high availability. By AWS and Cloud Posse best practices, production-grade clusters should span three availability zones, with cluster-critical services distributed across them. -However, during initial scheduling, Karpenter might spin up just one node large enough to fit all your add-on pods — even if they request three replicas with anti-affinity rules. Kubernetes will happily co-locate them all on that single node. +However, during initial scheduling with **managed node groups**, Karpenter might spin up just one node large enough to fit all your add-on pods — even if they request three replicas with anti-affinity rules. Kubernetes will happily co-locate them all on that single node. Once they're running, those pods don't move automatically, even as the cluster grows. The result? **A deceptively healthy cluster with all your CoreDNS replicas living on the same node in one AZ — a single point of failure disguised as a distributed system.** -## The Terraform Catch-22 - -Terraform enforces a strict dependency model: it won't complete a resource until it's ready. - -So without a static node group, Terraform can't successfully create the cluster (because the add-ons can't start). - -And without those add-ons running, Karpenter can't launch its first node (because Karpenter itself is waiting on the cluster to stabilize). - -This circular dependency means your beautiful "fully automated" Fargate-only cluster gets stuck in the most ironic place: **bootstrap deadlock**. +While [topologySpreadConstraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/) can help encourage multi-AZ distribution, they don't guarantee it during the critical cluster bootstrap phase when Karpenter is creating its first nodes. ## The Solution: A Minimal Managed Node Pool @@ -63,11 +74,24 @@ You can even disable autoscaling for this node pool. One node per AZ is enough. Think of it as your cluster's heartbeat — steady, predictable, and inexpensive. +## Additional Fargate Constraints + +Beyond the HA challenges, [Fargate has architectural constraints](https://docs.aws.amazon.com/eks/latest/userguide/fargate-pod-configuration.html) that can affect cluster add-ons: + + + - Each Fargate pod runs on its own isolated compute resource (one pod per node) + - No dynamic persistent volume provisioning + - Fixed CPU and memory configurations with coarse granularity + - 256 MB memory overhead for Kubernetes components + + +While these constraints don't necessarily prevent Fargate from working, they add complexity when running cluster-critical infrastructure that needs precise resource allocation and high availability guarantees. + ## Cost and Flexibility Fargate offers convenience, but at a premium. A pod requesting 2 vCPUs and 4 GiB of memory costs about **$0.098/hour**, compared to **$0.076/hour** for an equivalent EC2 c6a.large instance. -And because Fargate bills in coarse increments, you often overpay for partial capacity. +And because [Fargate bills in coarse increments](https://docs.aws.amazon.com/eks/latest/userguide/fargate-pod-configuration.html), you often overpay for partial capacity. By contrast, a managed node group gives you flexibility: @@ -79,6 +103,12 @@ By contrast, a managed node group gives you flexibility: The result: **predictable cost floor, flexible scale ceiling**. +## The Evolution of Karpenter's Recommendations + +Interestingly, the Karpenter team's own guidance has evolved over time. [Karpenter's current getting started guide](https://karpenter.sh/docs/getting-started/getting-started-with-karpenter/) now defaults to using **EKS Managed Node Groups** in its example configurations, with Fargate presented as an alternative that requires uncommenting configuration sections. + +While we can't pinpoint exactly when this shift occurred, it suggests the Karpenter team recognized that managed node groups provide a more reliable foundation for most production use cases. + ## Lessons Learned At Cloud Posse, we love automation — but we love reliability through simplicity even more. @@ -95,10 +125,29 @@ But for production systems where uptime and high availability matter, a hybrid m It's not about Fargate being bad — it's about knowing where it fits in your architecture. +## When Fargate-Only Might Still Work + +To be fair, there are scenarios where running Karpenter on Fargate might make sense: + + + - Long-lived development environments where the $120/month MNG baseline cost matters more than availability + - Clusters deployed manually (not via Terraform) where bootstrap automation isn't critical + - Proof-of-concept deployments demonstrating Karpenter's capabilities + - Organizations that have accepted the operational trade-offs and built workarounds + + +**However**, be aware that development clusters that are frequently rebuilt will hit the Terraform bootstrap deadlock problem more often—making automation failures a regular occurrence rather than a one-time setup issue. + +## Your Mileage May Vary + +It's worth noting that [experienced practitioners in the SweetOps community](https://sweetops.slack.com/) have successfully run Karpenter on Fargate for years across multiple production clusters. Their setups work, and they've built processes around the constraints. + +This proves our recommendation isn't absolute—some teams make Fargate work through careful configuration and accepted trade-offs. However, these same practitioners acknowledged they'd likely choose MNG if starting fresh today with modern tooling. + > "Karpenter doesn't use voting. Leader election uses Kubernetes leases. There's no strict technical requirement to have three pods — unless you actually care about staying up." > > — Ihor Urazov, SweetOps Slack -That's the key. If you care about staying up, give your cluster something to stand on. +That's the key insight. The technical requirements are flexible—it's your operational requirements that determine the right choice. -A small, stable managed node pool does exactly that. +If staying up matters, if automation matters, if avoiding manual intervention matters, then give your cluster something solid to stand on. A small, stable managed node pool does exactly that. From 6f71ac0e28636f129ea36bb9ccf395c6e53b6669 Mon Sep 17 00:00:00 2001 From: "Erik Osterman (CEO @ Cloud Posse)" Date: Thu, 16 Oct 2025 11:32:03 -0500 Subject: [PATCH 6/9] Update blog/2025-10-15-fargate-vs-managed-node-groups.mdx Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --- blog/2025-10-15-fargate-vs-managed-node-groups.mdx | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx index 074e8033c..537dbe40e 100644 --- a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx +++ b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx @@ -80,7 +80,7 @@ Beyond the HA challenges, [Fargate has architectural constraints](https://docs.a - Each Fargate pod runs on its own isolated compute resource (one pod per node) - - No dynamic persistent volume provisioning + - No support for EBS-backed dynamic PVCs; only EFS CSI volumes are supported - Fixed CPU and memory configurations with coarse granularity - 256 MB memory overhead for Kubernetes components From ca1de72d87a30893552011c31fdcdf6dc20d7e02 Mon Sep 17 00:00:00 2001 From: Erik Osterman Date: Thu, 16 Oct 2025 11:32:54 -0500 Subject: [PATCH 7/9] Clarify MNG uses On-Demand instances, Karpenter uses Spot MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fix confusion about which component uses which instance type: - Static MNG runs On-Demand instances for reliability of cluster-critical add-ons - Karpenter provisions Spot instances for dynamic application workloads - Update "Cost and Flexibility" section to clearly distinguish the two - Update "Lessons Learned" section to specify instance types per component This addresses the concern that mixing Spot instances in the static MNG would undermine the reliability we're advocating for. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- ...025-10-15-fargate-vs-managed-node-groups.mdx | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx index 537dbe40e..f3083a15a 100644 --- a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx +++ b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx @@ -93,15 +93,16 @@ Fargate offers convenience, but at a premium. A pod requesting 2 vCPUs and 4 GiB And because [Fargate bills in coarse increments](https://docs.aws.amazon.com/eks/latest/userguide/fargate-pod-configuration.html), you often overpay for partial capacity. -By contrast, a managed node group gives you flexibility: +By contrast, a managed node group for your baseline cluster infrastructure gives you flexibility: - - Use Graviton-based instances (c7g.medium) to cut costs nearly in half - - Mix On-Demand nodes for reliability and Spot nodes (via Karpenter) for efficiency - - Keep the static node pool minimal while letting Karpenter handle the dynamic scale-out + - Use Graviton-based instances (c7g.medium) for the MNG to cut costs nearly in half + - Run the static MNG with On-Demand instances for reliability + - Let Karpenter provision Spot instances for dynamic workloads, maximizing cost savings + - Keep the baseline small and predictable while Karpenter handles burst capacity -The result: **predictable cost floor, flexible scale ceiling**. +The result: **reliable foundation with On-Demand nodes, cost-efficient scaling with Spot**. ## The Evolution of Karpenter's Recommendations @@ -118,9 +119,9 @@ Running Karpenter on Fargate works for proof-of-concepts or ephemeral clusters. But for production systems where uptime and high availability matter, a hybrid model is the clear winner: - - Static MNG for cluster-critical add-ons - - Karpenter for dynamic workloads - - Fargate only when you truly need "no nodes" + - Static MNG with On-Demand instances for cluster-critical add-ons (CoreDNS, Karpenter, etc.) + - Karpenter provisioning Spot instances for dynamic application workloads + - Fargate only when you truly need pod-level isolation It's not about Fargate being bad — it's about knowing where it fits in your architecture. From c13546a4360430c0d65a5017f2aae7f10a906b72 Mon Sep 17 00:00:00 2001 From: Erik Osterman Date: Thu, 16 Oct 2025 11:35:07 -0500 Subject: [PATCH 8/9] Clarify Spot instances are only for application workloads, not add-ons MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The previous wording "dynamic workloads" was ambiguous and could be misread as including cluster add-ons. This explicitly states: - MNG with On-Demand instances = cluster add-ons (stable foundation) - Karpenter with Spot instances = application workloads only (cost savings) This distinction is critical to the stability argument. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- blog/2025-10-15-fargate-vs-managed-node-groups.mdx | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx index f3083a15a..3212cee02 100644 --- a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx +++ b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx @@ -93,16 +93,16 @@ Fargate offers convenience, but at a premium. A pod requesting 2 vCPUs and 4 GiB And because [Fargate bills in coarse increments](https://docs.aws.amazon.com/eks/latest/userguide/fargate-pod-configuration.html), you often overpay for partial capacity. -By contrast, a managed node group for your baseline cluster infrastructure gives you flexibility: +By contrast, the hybrid approach unlocks significant advantages: - - Use Graviton-based instances (c7g.medium) for the MNG to cut costs nearly in half - - Run the static MNG with On-Demand instances for reliability - - Let Karpenter provision Spot instances for dynamic workloads, maximizing cost savings - - Keep the baseline small and predictable while Karpenter handles burst capacity + - Static MNG with On-Demand instances provides a stable foundation for cluster add-ons + - Use cost-effective Graviton instances (c7g.medium) to reduce baseline costs + - Karpenter provisions Spot instances exclusively for application workloads (not add-ons) + - Achieve cost savings on application pods while maintaining reliability for cluster infrastructure -The result: **reliable foundation with On-Demand nodes, cost-efficient scaling with Spot**. +The result: **stable cluster services on On-Demand, cost-optimized applications on Spot**. ## The Evolution of Karpenter's Recommendations From 66a5d4d3d3d3dd38bd9e9d8464804f4f59e8ee66 Mon Sep 17 00:00:00 2001 From: Erik Osterman Date: Thu, 16 Oct 2025 11:37:45 -0500 Subject: [PATCH 9/9] Add section on EKS Auto Mode as alternative solution MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit EKS Auto Mode (announced December 2024) solves the bootstrap deadlock problem by running Karpenter and other cluster components off-cluster as AWS-managed services. This eliminates the chicken-and-egg dependency entirely. Added balanced coverage noting: - How Auto Mode sidesteps the bootstrap problem - Trade-offs: 12-15% cost premium, CNI lock-in, less control - When it makes sense vs when MNG + Karpenter approach is still relevant This provides readers with awareness of all current options. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude --- ...025-10-15-fargate-vs-managed-node-groups.mdx | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx index 3212cee02..e8f51861f 100644 --- a/blog/2025-10-15-fargate-vs-managed-node-groups.mdx +++ b/blog/2025-10-15-fargate-vs-managed-node-groups.mdx @@ -152,3 +152,20 @@ This proves our recommendation isn't absolute—some teams make Fargate work thr That's the key insight. The technical requirements are flexible—it's your operational requirements that determine the right choice. If staying up matters, if automation matters, if avoiding manual intervention matters, then give your cluster something solid to stand on. A small, stable managed node pool does exactly that. + +## What About EKS Auto Mode? + +It's worth mentioning that AWS introduced [EKS Auto Mode](https://docs.aws.amazon.com/eks/latest/userguide/automode.html) in December 2024, which takes a fundamentally different approach to solving these problems. + +EKS Auto Mode runs Karpenter and other critical cluster components (like the EBS CSI driver and Load Balancer Controller) **off-cluster** as AWS-managed services. This elegantly sidesteps the bootstrap deadlock problem entirely—there's no chicken-and-egg dependency because the control plane components don't need to run inside your cluster. + +The cluster starts with zero nodes and automatically provisions compute capacity as workloads are scheduled. While this solves the technical bootstrap challenge we've discussed, it comes with trade-offs: + + + - Additional 12-15% cost premium on top of EC2 instance costs + - Lock-in to AWS VPC CNI (can't use alternatives like Cilium or Calico) + - Less control over cluster infrastructure configuration + - Available only for Kubernetes 1.29+ and not in all AWS regions + + +For organizations willing to accept these constraints in exchange for fully managed operations, EKS Auto Mode may address many of the concerns raised in this post. However, for teams requiring fine-grained control, cost optimization, or running on older Kubernetes versions, the MNG + Karpenter approach remains highly relevant.