This document captures architectural decisions and design patterns for the ToolHive Operator.
When building operators, the decision of when to use a podTemplateSpec and when to use a CRD attribute is always disputed. For the ToolHive Operator we have a defined rule of thumb.
- Business logic that affects your operator's behavior
- Validation requirements (ranges, formats, constraints)
- Cross-resource coordination (affects Services, ConfigMaps, etc.)
- Operator decision making (triggers different reconciliation paths)
- Infrastructure concerns (node selection, resources, affinity)
- Sidecar containers
- Standard Kubernetes pod configuration
- Things a cluster admin would typically configure
- "Does this affect my operator's reconciliation logic?" -> Dedicated attribute
- "Is this standard Kubernetes pod configuration?" -> PodTemplateSpec
- "Do I need to validate this beyond basic Kubernetes validation?" -> Dedicated attribute
Decision: Use standard Kubernetes workload status pattern matching MCPServer — flat Phase + Ready condition + ReadyReplicas + URL.
Rationale:
- Consistency with MCPServer and standard Kubernetes workload patterns
- Enables
kubectl wait --for=condition=Readyand standard monitoring - The operator only needs to track deployment readiness, not internal registry server state
- Tracking internal sync/API states would require the operator to call the registry server, which with auth enabled is not feasible
Implementation: Controller sets Phase, Message, URL, ReadyReplicas, and a Ready condition directly based on the API deployment's readiness. The latest resource version is refetched before status updates to avoid conflicts.
History: The original design used a StatusCollector pattern (mcpregistrystatus package) that batched status changes from multiple independent sources — an APIStatusCollector for deployment state and originally a sync collector — then applied them atomically via a single Status().Update(). A StatusDeriver computed the overall phase from sub-phases (SyncPhase + APIPhase → MCPRegistryPhase). This was removed because with sync operations moved to the registry server itself, only one status source remained (deployment readiness), making the batching/derivation indirection unnecessary. The new approach produces the same number of API server calls with less abstraction.
Decision: Deploy individual API service per MCPRegistry rather than shared service.
Rationale:
- Isolation: Each registry has independent lifecycle and scaling
- Security: Per-registry access control possible
- Reliability: Failure of one registry doesn't affect others
- Lifecycle Management: Automatic cleanup via owner references
Trade-offs: More resources consumed but better isolation and security.
Decision: Structured error types (registryapi.Error) with condition metadata.
Rationale:
- Different error types need different handling strategies
- Structured errors carry
ConditionReasonfor setting Kubernetes conditions with specific failure reasons (e.g.,ConfigMapFailed,DeploymentFailed) - Enables better observability via condition reasons
Implementation: registryapi.Error carries ConditionReason and Message. The controller uses errors.As to extract structured fields when available, falling back to generic NotReady reason for unstructured errors.
- Status Updates: Single refetch-then-update per reconciliation cycle
- API Deployment: Lazy creation only when needed (implemented)
Minimal required permissions following principle of least privilege:
- ConfigMaps: For storage management
- Services/Deployments: For API service management
- MCPRegistry: For status updates
Optional network policies for registry API access control in security-sensitive environments.