Skip to content

[FEA] [deploy] Replace Docker Compose deployment with Helm charts for Kubernetes #106

@antoniomtz

Description

@antoniomtz

Is your feature request related to a problem? Please describe.

The project is deployable today through Docker Compose only. That works for local and single-node demos, but it does not provide a Kubernetes-native deployment path for teams that need declarative releases, configurable persistence, ingress-based routing, secret management, and optional GPU-backed local NVIDIA NIM services.

The current deployment model is split across:

  • docker-compose.yml for the app stack: nginx, ui, merchant, psp, apps-sdk, 4 NAT agents, and a one-shot milvus-seeder
  • docker-compose.infra.yml for milvus-etcd, milvus-minio, milvus-standalone, and phoenix
  • docker-compose-nim.yml for optional self-hosted nemotron-nano and embedqa

This leaves Kubernetes users without a supported way to deploy the system end-to-end while preserving the existing ACP/UCP flows, Apps SDK routing, Milvus-backed recommendation/search path, Phoenix observability, and optional in-cluster NIM inference.

Describe the solution you'd like

Add Helm charts and Kubernetes deployment documentation that can deploy the platform end-to-end.

Scope:

  • Core app services: ui, merchant, psp, apps-sdk
  • NAT agent services: promotion-agent, post-purchase-agent, recommendation-agent, search-agent
  • Supporting services: milvus, etcd, minio, phoenix
  • One-shot Milvus initialization job
  • Ingress/service routing that preserves current external behavior now handled by nginx
  • Optional self-hosted NVIDIA NIM services: nemotron-nano, embedqa

Implementation expectations:

  • Add a top-level Helm chart under deploy/helm/retail-agentic-commerce/ with values-driven enablement of optional components
  • Use Kubernetes Ingress plus Service resources instead of deploying the current standalone nginx container
  • Preserve the current route behavior:
    • / -> ui
    • /api/webhooks/, /api/agents/, /api/proxy/ -> ui
    • /api/ -> merchant
    • /psp/ -> psp
    • /apps-sdk/ -> apps-sdk
  • Preserve current service-to-service environment wiring from Compose:
    • merchant -> promotion/post-purchase/recommendation agents
    • apps-sdk -> merchant/psp/recommendation/search
    • ui -> merchant/psp/phoenix and UCP profile URL
    • agents -> phoenix/milvus/NIM endpoints
  • Replace Compose health checks with Kubernetes startupProbe, readinessProbe, and livenessProbe
  • Convert milvus-seeder into a Kubernetes Job or Helm hook that runs after Milvus is reachable
  • Ship values files for:
    • default mode using public NVIDIA API endpoints
    • optional self-hosted NIM mode using in-cluster nemotron-nano and embedqa
  • Document secrets/config needed for:
    • NVIDIA_API_KEY
    • NGC_API_KEY
    • MERCHANT_API_KEY
    • PSP_API_KEY
    • WEBHOOK_SECRET
    • NIM base/model overrides

Persistence requirements:

  • Model the current persistent state explicitly:
    • shared SQLite data currently mounted as acp-data
    • Milvus data
    • MinIO data
    • etcd data
    • Phoenix working directory
  • Initial Kubernetes support may assume a storage class that can satisfy the current shared SQLite requirement. If that is not feasible in a target cluster, document it as a deployment prerequisite or limitation in this issue rather than silently redesigning persistence.

Deliverables:

  • Helm chart(s) committed to the repo
  • Values files for public-NIM and self-hosted-NIM deployment modes
  • Kubernetes deployment docs replacing or complementing deploy/docker-deployment.md
  • Validation steps with exact helm and kubectl commands

Acceptance criteria:

  • helm template succeeds for default and NIM-enabled values
  • helm install deploys the core stack successfully on a local Kubernetes target for non-GPU mode
  • UI is reachable through ingress and can talk to backend services through the expected routes
  • Merchant, PSP, Apps SDK, and all 4 NAT agents expose healthy pods/services
  • Recommendation and search agents can reach Milvus
  • Phoenix is reachable
  • Milvus seeding completes automatically
  • Public NIM mode works without in-cluster GPU services
  • Self-hosted NIM mode can be enabled through values and includes GPU scheduling/resource configuration for nemotron-nano and embedqa
  • Docs include install, upgrade, rollback, and verification commands

Suggested verification commands for the eventual implementation:

  • helm template deploy/helm/retail-agentic-commerce -f deploy/helm/retail-agentic-commerce/values.yaml
  • helm template deploy/helm/retail-agentic-commerce -f deploy/helm/retail-agentic-commerce/values-nim.yaml
  • kubectl get pods,svc,ingress
  • kubectl logs job/<milvus-seeder-job>
  • smoke checks for /, /api/health, /psp/health, /apps-sdk/health

Describe alternatives you've considered

  • Keep Docker Compose as the only supported deployment path: simple, but not suitable for Kubernetes environments
  • Use raw Kubernetes manifests instead of Helm: possible, but harder to maintain and configure across environments
  • Split this immediately into multiple issues: likely useful later, but a single implementation issue is a better first step as long as scope stays bounded and explicit

Additional context

Relevant repo files and behavior the implementation should mirror:

  • deploy/docker-deployment.md
  • deploy/local-development.md
  • docs/architecture.md
  • docker-compose.yml
  • docker-compose.infra.yml
  • docker-compose-nim.yml
  • nginx.conf
  • src/merchant/Dockerfile
  • src/payment/Dockerfile
  • src/apps_sdk/Dockerfile
  • src/ui/Dockerfile
  • src/agents/Dockerfile

Out of scope:

  • Changing ACP/UCP protocol semantics
  • Re-architecting service boundaries
  • Replacing the current data model as part of this issue
  • Production-grade autoscaling or multi-cluster design beyond what is required to run reliably on Kubernetes

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions