Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions config/_default/hugo.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@ taxonomies:
tag: tags
category: categories
params:
drawio:
enable: true
taxonomy:
taxonomyCloud:
- tags
Expand Down
2 changes: 1 addition & 1 deletion content/en/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ Capsule is completely declarative and GitOps ready.

{{% blocks/lead color="dark" %}}

## Capsule is a CNCF Incubating Project { class="text-center mb-4" }
## Capsule is a CNCF Sandbox Project { class="text-center mb-4" }

---

Expand Down
194 changes: 190 additions & 4 deletions content/en/docs/operating/monitoring.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,203 @@
---
title: Monitoring
weight: 5
description: "Monitoring Capsule Controller and Tenants"
description: "Monitoring Capsule Items and Tenants"
---

The Capsule dashboard allows you to track the health and performance of Capsule manager and tenants, with particular attention to resources saturation, server responses, and latencies. Prometheus and Grafana are requirements for monitoring Capsule.

## Metrics
## ResourcePools

### Quotas
Instrumentation for [ResourcePools](../resourcepools/).

### Dashboards

### Custom
Dashboards can be deployed via helm-chart, enable the following values:

```yaml
monitoring:
dashboards:
enabled: true
```

#### Capsule / ResourcePools

Dashboard which grants a detailed overview over the ResourcePools

![Resourcepool Dashboard](/images/content/monitoring/dashboard-resourcepools-1.png)

---

### Rules

Example rules to give you some idea, what's possible.

1. Alert on [ResourcePools](../resourcepools/) usage
```yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: resourcepool-usage-alert
spec:
groups:
- name: capsule-pool-usage.rules
rules:
- alert: CapsulePoolHighUsageWarning
expr: |
capsule_pool_usage_percentage > 90
for: 10m
labels:
severity: warning
annotations:
summary: High resource usage in Resourcepool
description: |
Resource {{ $labels.resource }} in pool {{ $labels.pool }} is at {{ $value }}% usage for the last 10 minutes.

- alert: CapsulePoolHighUsageCritical
expr: |
capsule_pool_usage_percentage > 95
for: 10m
labels:
severity: critical
annotations:
summary: Critical resource usage in Resourcepool
description: |
Resource {{ $labels.resource }} in pool {{ $labels.pool }} has exceeded 95% usage for the last 10 minutes.
```

---

### Metrics

The following Metrics are exposed and can be used for monitoring:

```shell
# HELP capsule_claim_condition The current condition status of a claim.
# TYPE capsule_claim_condition gauge
capsule_claim_condition{condition="Bound",name="compute",pool="solar-compute",reason="Succeeded",target_namespace="solar-prod"} 1
capsule_claim_condition{condition="Bound",name="compute-10",pool="solar-compute",reason="PoolExhausted",target_namespace="solar-prod"} 0
capsule_claim_condition{condition="Bound",name="compute-2",pool="solar-compute",reason="Succeeded",target_namespace="solar-prod"} 1
capsule_claim_condition{condition="Bound",name="compute-3",pool="solar-compute",reason="Succeeded",target_namespace="solar-prod"} 1
capsule_claim_condition{condition="Bound",name="compute-4",pool="solar-compute",reason="Succeeded",target_namespace="solar-test"} 1
capsule_claim_condition{condition="Bound",name="compute-5",pool="solar-compute",reason="PoolExhausted",target_namespace="solar-test"} 0
capsule_claim_condition{condition="Bound",name="compute-6",pool="solar-compute",reason="PoolExhausted",target_namespace="solar-test"} 0
capsule_claim_condition{condition="Bound",name="pods",pool="solar-size",reason="Succeeded",target_namespace="solar-test"} 1

# HELP capsule_claim_resource The given amount of resources from the claim
# TYPE capsule_claim_resource gauge
capsule_claim_resource{name="compute",resource="limits.cpu",target_namespace="solar-prod"} 0.375
capsule_claim_resource{name="compute",resource="limits.memory",target_namespace="solar-prod"} 4.02653184e+08
capsule_claim_resource{name="compute",resource="requests.cpu",target_namespace="solar-prod"} 0.375
capsule_claim_resource{name="compute",resource="requests.memory",target_namespace="solar-prod"} 4.02653184e+08
capsule_claim_resource{name="compute-10",resource="limits.memory",target_namespace="solar-prod"} 1.073741824e+10
capsule_claim_resource{name="compute-2",resource="limits.cpu",target_namespace="solar-prod"} 0.5
capsule_claim_resource{name="compute-2",resource="limits.memory",target_namespace="solar-prod"} 5.36870912e+08
capsule_claim_resource{name="compute-2",resource="requests.cpu",target_namespace="solar-prod"} 0.5
capsule_claim_resource{name="compute-2",resource="requests.memory",target_namespace="solar-prod"} 5.36870912e+08
capsule_claim_resource{name="compute-3",resource="requests.cpu",target_namespace="solar-prod"} 0.5
capsule_claim_resource{name="compute-4",resource="requests.cpu",target_namespace="solar-test"} 0.5
capsule_claim_resource{name="compute-5",resource="requests.cpu",target_namespace="solar-test"} 0.5
capsule_claim_resource{name="compute-6",resource="requests.cpu",target_namespace="solar-test"} 5
capsule_claim_resource{name="pods",resource="pods",target_namespace="solar-test"} 3

# HELP capsule_pool_available Current resource availability for a given resource in a resource pool
# TYPE capsule_pool_available gauge
capsule_pool_available{pool="solar-compute",resource="limits.cpu"} 1.125
capsule_pool_available{pool="solar-compute",resource="limits.memory"} 1.207959552e+09
capsule_pool_available{pool="solar-compute",resource="requests.cpu"} 0.125
capsule_pool_available{pool="solar-compute",resource="requests.memory"} 1.207959552e+09
capsule_pool_available{pool="solar-size",resource="pods"} 4

# HELP capsule_pool_exhaustion Resources become exhausted, when there's not enough available for all claims and the claims get queued
# TYPE capsule_pool_exhaustion gauge
capsule_pool_exhaustion{pool="solar-compute",resource="limits.memory"} 1.073741824e+10
capsule_pool_exhaustion{pool="solar-compute",resource="requests.cpu"} 5.5

# HELP capsule_pool_exhaustion_percentage Resources become exhausted, when there's not enough available for all claims and the claims get queued (Percentage)
# TYPE capsule_pool_exhaustion_percentage gauge
capsule_pool_exhaustion_percentage{pool="solar-compute",resource="limits.memory"} 788.8888888888889
capsule_pool_exhaustion_percentage{pool="solar-compute",resource="requests.cpu"} 4300

# HELP capsule_pool_limit Current resource limit for a given resource in a resource pool
# TYPE capsule_pool_limit gauge
capsule_pool_limit{pool="solar-compute",resource="limits.cpu"} 2
capsule_pool_limit{pool="solar-compute",resource="limits.memory"} 2.147483648e+09
capsule_pool_limit{pool="solar-compute",resource="requests.cpu"} 2
capsule_pool_limit{pool="solar-compute",resource="requests.memory"} 2.147483648e+09
capsule_pool_limit{pool="solar-size",resource="pods"} 7

# HELP capsule_pool_namespace_usage Current resources claimed on namespace basis for a given resource in a resource pool for a specific namespace
# TYPE capsule_pool_namespace_usage gauge
capsule_pool_namespace_usage{pool="solar-compute",resource="limits.cpu",target_namespace="solar-prod"} 0.875
capsule_pool_namespace_usage{pool="solar-compute",resource="limits.memory",target_namespace="solar-prod"} 9.39524096e+08
capsule_pool_namespace_usage{pool="solar-compute",resource="requests.cpu",target_namespace="solar-prod"} 1.375
capsule_pool_namespace_usage{pool="solar-compute",resource="requests.cpu",target_namespace="solar-test"} 0.5
capsule_pool_namespace_usage{pool="solar-compute",resource="requests.memory",target_namespace="solar-prod"} 9.39524096e+08
capsule_pool_namespace_usage{pool="solar-size",resource="pods",target_namespace="solar-test"} 3

# HELP capsule_pool_namespace_usage_percentage Current resources claimed on namespace basis for a given resource in a resource pool for a specific namespace (percentage)
# TYPE capsule_pool_namespace_usage_percentage gauge
capsule_pool_namespace_usage_percentage{pool="solar-compute",resource="limits.cpu",target_namespace="solar-prod"} 43.75
capsule_pool_namespace_usage_percentage{pool="solar-compute",resource="limits.memory",target_namespace="solar-prod"} 43.75
capsule_pool_namespace_usage_percentage{pool="solar-compute",resource="requests.cpu",target_namespace="solar-prod"} 68.75
capsule_pool_namespace_usage_percentage{pool="solar-compute",resource="requests.cpu",target_namespace="solar-test"} 25
capsule_pool_namespace_usage_percentage{pool="solar-compute",resource="requests.memory",target_namespace="solar-prod"} 43.75
capsule_pool_namespace_usage_percentage{pool="solar-size",resource="pods",target_namespace="solar-test"} 42.857142857142854

# HELP capsule_pool_resource Type of resource being used in a resource pool
# TYPE capsule_pool_resource gauge
capsule_pool_resource{pool="solar-compute",resource="limits.cpu"} 1
capsule_pool_resource{pool="solar-compute",resource="limits.memory"} 1
capsule_pool_resource{pool="solar-compute",resource="requests.cpu"} 1
capsule_pool_resource{pool="solar-compute",resource="requests.memory"} 1
capsule_pool_resource{pool="solar-size",resource="pods"} 1

# HELP capsule_pool_usage Current resource usage for a given resource in a resource pool
# TYPE capsule_pool_usage gauge
capsule_pool_usage{pool="solar-compute",resource="limits.cpu"} 0.875
capsule_pool_usage{pool="solar-compute",resource="limits.memory"} 9.39524096e+08
capsule_pool_usage{pool="solar-compute",resource="requests.cpu"} 1.875
capsule_pool_usage{pool="solar-compute",resource="requests.memory"} 9.39524096e+08
capsule_pool_usage{pool="solar-size",resource="pods"} 3

# HELP capsule_pool_usage_percentage Current resource usage for a given resource in a resource pool (percentage)
# TYPE capsule_pool_usage_percentage gauge
capsule_pool_usage_percentage{pool="solar-compute",resource="limits.cpu"} 43.75
capsule_pool_usage_percentage{pool="solar-compute",resource="limits.memory"} 43.75
capsule_pool_usage_percentage{pool="solar-compute",resource="requests.cpu"} 93.75
capsule_pool_usage_percentage{pool="solar-compute",resource="requests.memory"} 43.75
capsule_pool_usage_percentage{pool="solar-size",resource="pods"} 42.857142857142854
```


## Quotas

Instrumentation for [Quotas](../tenants/quotas/).

### Metrics

The following Metrics are exposed and can be used for monitoring:

```shell
# HELP capsule_tenant_resource_limit Current resource limit for a given resource in a tenant
# TYPE capsule_tenant_resource_limit gauge
capsule_tenant_resource_limit{resource="limits.cpu",resourcequotaindex="0",tenant="solar"} 2
capsule_tenant_resource_limit{resource="limits.memory",resourcequotaindex="0",tenant="solar"} 2.147483648e+09
capsule_tenant_resource_limit{resource="pods",resourcequotaindex="1",tenant="solar"} 7
capsule_tenant_resource_limit{resource="requests.cpu",resourcequotaindex="0",tenant="solar"} 2
capsule_tenant_resource_limit{resource="requests.memory",resourcequotaindex="0",tenant="solar"} 2.147483648e+09

# HELP capsule_tenant_resource_usage Current resource usage for a given resource in a tenant
# TYPE capsule_tenant_resource_usage gauge
capsule_tenant_resource_usage{resource="limits.cpu",resourcequotaindex="0",tenant="solar"} 0
capsule_tenant_resource_usage{resource="limits.memory",resourcequotaindex="0",tenant="solar"} 0
capsule_tenant_resource_usage{resource="namespaces",resourcequotaindex="",tenant="solar"} 2
capsule_tenant_resource_usage{resource="pods",resourcequotaindex="1",tenant="solar"} 0
capsule_tenant_resource_usage{resource="requests.cpu",resourcequotaindex="0",tenant="solar"} 0
capsule_tenant_resource_usage{resource="requests.memory",resourcequotaindex="0",tenant="solar"} 0
```

## Custom Metrics

You can gather more information based on the status of the tenants. These can be scrapped via [Kube-State-Metrics CustomResourcesState Metrics](https://github.com/kubernetes/kube-state-metrics/blob/main/docs/customresourcestate-metrics.md). With these you have the possibility to create custom metrics based on the status of the tenants.

Expand Down
2 changes: 1 addition & 1 deletion content/en/docs/overview/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ With capsule you have an ecosystem which addresses the challenges when it comes

<br>

![capsule-workflow](/images/content/tenants.gif)
![capsule-workflow](/images/content/capsule-architecture.drawio.png)

As shown, we can create a new boundary between Kubernetes (Cluster) Administrators and Tenant Audiences. While the Kubernetes Adminsitrators define the boundaries on a Tenant, the Tenant Audience can act within the namespaces of a Tenant. For the Tenant audience we differenciate between **Tenant Owners** and **Tenant Users**. The main Perk Tenant Owners have is the creation of namespaces within the tenants they are owner off. WIth the enabling them to act within the tenant and therefor achieveing a shift left from being dependant on a Kubernetes Administrator to have Responsability shifted to the Tenant Owners.

Expand Down
4 changes: 2 additions & 2 deletions content/en/docs/overview/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ If your cluster architecture prevents any of these capabilities, or if certain a

Strong tenant isolation, ensuring that any noisy neighbor effects remain confined within individual tenants (tenant responsibility). This approach may involve higher administrative overhead and costs compared to shared compute. It also provides enhanced security by dedicating nodes to a single customer/application. It is recommended, at a minimum, to separate the cluster’s operator workload from customer workloads.

![Dedicated Nodepool](/images/content/node-schedule-dedicated.gif)
![Dedicated Nodepool](/images/content/scheduling-dedicated.drawio.png)

### Shared

Expand All @@ -41,7 +41,7 @@ With this approach you share the nodes amongst all Tenants, therefor giving you
- ❌ Not ideal for applications that are not cloud-native ready, as they may adversely affect the operation of other applications or the maintenance of node pools.
- ❌ Not ideal if strong isolation is required

![Shared Nodepool](/images/content/node-schedule-shared.gif)
![Shared Nodepool](/images/content/scheduling-shared.drawio.png)

There's some further aspects you must think about with shared approaches:

Expand Down
Loading
Loading