Skip to content

Commit a8b819a

Browse files
committed
feat: add enterprise sizing knobs and sizing guide
- OpenSearch: persistence.size, OPENSEARCH_JAVA_OPTS, storageClass - Data Prepper: number_of_shards/number_of_replicas on opensearch sinks - Prometheus: server.retention (15d default), persistentVolume options - README: sizing guide with storage formula, shard rules, quick-reference profiles (dev/small team/enterprise)
1 parent 74c4cf6 commit a8b819a

2 files changed

Lines changed: 78 additions & 1 deletion

File tree

charts/observability-stack/README.md

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -102,6 +102,61 @@ Configured via `scrapeConfigs` in `values.yaml`. Default K8s scrape jobs are dis
102102

103103
> **Note:** Targets use the helm release name as prefix. The values in `values.yaml` are hardcoded to `obs-stack-*` — update them if you change the release name.
104104

105+
## Sizing Guide
106+
107+
The default values are tuned for development/demo (single-node OpenSearch, minimal resources). For production or enterprise-scale deployments, adjust the following knobs.
108+
109+
### OpenSearch Cluster
110+
111+
| Knob | Default | Production Guidance |
112+
|------|---------|---------------------|
113+
| `opensearch.replicas` | `1` | 3+ data nodes minimum for HA |
114+
| `opensearch.singleNode` | `true` | Set `false` for multi-node |
115+
| `opensearch.resources.requests.memory` | `2Gi` | 8–64Gi per node (JVM gets 50%) |
116+
| `opensearch.persistence.size` | `8Gi` | Size per formula below |
117+
| `opensearch.extraEnvs[OPENSEARCH_JAVA_OPTS]` | `-Xms1g -Xmx1g` | 50% of node RAM, max 31g |
118+
119+
**Storage formula:**
120+
```
121+
storage_per_node = (daily_ingest_GB × 1.45 × (replicas + 1) × retention_days) / node_count
122+
```
123+
The 1.45x multiplier accounts for indexing overhead (10%), OS reserved space for merges (20%), filesystem overhead (5%), and node failure buffer (10%).
124+
125+
**Shard sizing:**
126+
- Logs/traces (write-heavy): 30–50 GB per primary shard
127+
- Search (latency-sensitive): 10–30 GB per primary shard
128+
- Total shards should be a multiple of data node count
129+
- Max 25 shards per GB of JVM heap
130+
131+
Shard count is configurable per Data Prepper pipeline sink via `number_of_shards` and `number_of_replicas` (commented out in `values.yaml`).
132+
133+
### Data Prepper Pipeline Tuning
134+
135+
| Knob | Default | Description |
136+
|------|---------|-------------|
137+
| `data-prepper.pipelineConfig.config.otel-logs-pipeline.workers` | `5` | Parallel log processing threads |
138+
| `...opensearch.number_of_shards` | (OS default: 1) | Primary shards per index |
139+
| `...opensearch.number_of_replicas` | (OS default: 1) | Replica shards per primary |
140+
| `...opensearch.bulk_size` | `5` (MiB) | Bulk request size to OpenSearch |
141+
142+
### Prometheus
143+
144+
| Knob | Default | Description |
145+
|------|---------|-------------|
146+
| `prometheus.server.retention` | `15d` | How long metrics are kept |
147+
| `prometheus.server.persistentVolume.enabled` | `false` | Enable for production |
148+
| `prometheus.server.persistentVolume.size` | `8Gi` | Disk for metrics TSDB |
149+
150+
### Quick Reference: Sizing Profiles
151+
152+
| Profile | OS Nodes | OS Memory | OS Disk | Prometheus Retention |
153+
|---------|----------|-----------|---------|---------------------|
154+
| **Dev/Demo** (default) | 1 | 2Gi | 8Gi | 15d |
155+
| **Small team** (~10 GB/day) | 3 | 8Gi | 100Gi | 30d |
156+
| **Enterprise** (~100 GB/day) | 6+ | 32Gi | 500Gi+ | 90d |
157+
158+
Sources: [OpenSearch shard sizing](https://opensearch.org/blog/optimize-opensearch-index-shard-size/), [AWS sizing guide](https://docs.aws.amazon.com/prescriptive-guidance/latest/opensearch-service-migration/sizing.html), [AWS shard best practices](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/bp-sharding.html)
159+
105160
## Key Values
106161
107162
See `values.yaml` for all options. Notable settings:

charts/observability-stack/values.yaml

Lines changed: 23 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,12 @@
22
# Mirrors the docker-compose setup for Kubernetes deployment
33

44
# -- OpenSearch
5+
# Sizing guide:
6+
# Storage: daily_ingest_GB × 1.45 × (replicas + 1) × retention_days
7+
# Shards: 30–50 GB per shard for logs/traces, 10–30 GB for search
8+
# JVM: 50% of node RAM, max ~31 GB (set via OPENSEARCH_JAVA_OPTS)
9+
# Nodes: minimum 3 for production, 1 for dev/demo
10+
# Heap-to-shard ratio: max 25 shards per GB of JVM heap
511
opensearch:
612
enabled: true
713
singleNode: true
@@ -13,9 +19,16 @@ opensearch:
1319
requests:
1420
memory: "2Gi"
1521
cpu: "500m"
22+
persistence:
23+
enabled: true
24+
size: 8Gi # Increase for production (e.g. 100Gi, 500Gi)
25+
# storageClass: "gp3" # Uncomment for AWS gp3 (better IOPS/$ than gp2)
1626
extraEnvs:
1727
- name: OPENSEARCH_INITIAL_ADMIN_PASSWORD
1828
value: "My_password_123!@#"
29+
# JVM heap — set to 50% of resources.requests.memory, max 31g
30+
- name: OPENSEARCH_JAVA_OPTS
31+
value: "-Xms1g -Xmx1g"
1932
config:
2033
opensearch.yml: |
2134
plugins.query.datasources.encryption.masterkey: "BTqK4Ytdz67La1kShIKV3Pu9"
@@ -135,6 +148,10 @@ data-prepper:
135148
password: "My_password_123!@#"
136149
insecure: true
137150
index_type: log-analytics-plain
151+
# Shard tuning — adjust for ingest volume:
152+
# 1 shard handles ~30-50 GB for logs. Scale shards with data node count.
153+
# number_of_shards: 1
154+
# number_of_replicas: 1
138155

139156
otel-traces-pipeline:
140157
delay: 100
@@ -160,6 +177,8 @@ data-prepper:
160177
password: "My_password_123!@#"
161178
insecure: true
162179
index_type: trace-analytics-plain-raw
180+
# number_of_shards: 1
181+
# number_of_replicas: 1
163182

164183
service-map-pipeline:
165184
delay: 100
@@ -277,8 +296,11 @@ opentelemetry-collector:
277296
prometheus:
278297
enabled: true
279298
server:
299+
# Retention — how long Prometheus keeps metrics. Increase for longer history.
300+
retention: "15d"
280301
persistentVolume:
281-
enabled: false
302+
enabled: false # Enable for production (survives pod restarts)
303+
# size: 50Gi
282304
extraFlags:
283305
- "web.enable-remote-write-receiver"
284306
- "web.enable-otlp-receiver"

0 commit comments

Comments
 (0)