Skip to content

Commit 7077693

Browse files
gorakhnathy7Gorakh Nath Yadav
andauthored
fix: revamped remaining top traffic pages (#370)
* changes * changed default global retention --------- Co-authored-by: Gorakh Nath Yadav <gnyadav@openobserve.ai>
1 parent 26e7015 commit 7077693

18 files changed

Lines changed: 585 additions & 380 deletions

File tree

docs/administration/maintenance/storage-management/storage.md

Lines changed: 34 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,10 @@
11
---
2+
title: Storage Management | OpenObserve
23
description: >-
3-
Learn how OpenObserve stores ingested stream data and the metadata for ingested date using disk, SQLite, Postgres, or S3-compatible object storage.
4+
Learn how OpenObserve stores ingested stream data and the metadata for ingested data using disk, SQLite, Postgres, or S3-compatible object storage.
45
---
6+
# Storage
7+
58
This guide explains how to configure data and metadata storage in OpenObserve. The information applies to both the open-source and enterprise versions.
69

710
## Overview
@@ -13,22 +16,21 @@ There are 2 primary items that need to be stored in OpenObserve.
1316
By default:
1417

1518
- Metadata is always stored on disk using **SQLite** in **Local mode**.
16-
- Metadata is always stored on disk using **postgres** in **Cluster mode**.
17-
- Stream data can be stored on disk or object storage such as Amazon S3, minIO, Google GCS, Alibaba OSS, or Tencent COS.
19+
- Metadata is always stored on disk using **PostgreSQL** in **Cluster mode**.
20+
- Stream data can be stored on disk or object storage such as Amazon S3, MinIO, Google GCS, Alibaba OSS, or Tencent COS.
1821

1922
## Storage Modes
2023

2124
- OpenObserve runs in **Local mode** by default.
22-
- To enable **Cluster mode**, set the environment variable `LOCAL_MODE=false`.
25+
- To enable **Cluster mode**, set the environment variable `ZO_LOCAL_MODE=false`.
2326
- In **Local mode**, stream data can be stored in S3 by setting `ZO_LOCAL_MODE_STORAGE=s3`.
24-
- GCS and OSS support the S3 SDK and can be treated as S3-compatible storages. Azure Blob storage is also supported.
25-
26-
## Data Storage Format
27+
- GCS and OSS support the S3 SDK and can be treated as S3-compatible storages.
28+
- Azure Blob storage is supported via `ZO_S3_PROVIDER=azure`.
2729

28-
Stream data is stored in Parquet format. Parquet is columnar storage format optimized for storage efficiency and query performance.
30+
### Data Storage Format
2931

32+
Stream data is stored in **Parquet** format, a columnar storage format optimized for storage efficiency and query performance.
3033
## Stream Data Storage Options
31-
3234
### Disk
3335

3436
Disk is default storage place for stream data. **Ensure that sufficient disk space is available for storing stream data.**
@@ -61,7 +63,7 @@ Then set the following environment variables:
6163
| ZO_S3_PROVIDER | minio | ... |
6264

6365

64-
### Openstack Swift
66+
### OpenStack Swift
6567
To use OpenStack Swift for storing stream data, first create the bucket in Swift.
6668
Then set the following environment variables:
6769

@@ -72,15 +74,15 @@ Then set the following environment variables:
7274
| ZO_S3_ACCESS_KEY | - | Access key |
7375
| ZO_S3_SECRET_KEY | - | Secret key |
7476
| ZO_S3_BUCKET_NAME | - | Bucket name |
75-
| ZO_S3_FEATURE_HTTP1_ONLY | true | Enables compatibility with Swift |
77+
| ZO_S3_FEATURE_HTTP1_ONLY | true | Enables compatibility with Swift |
7678
| ZO_S3_PROVIDER | s3 | Enables S3-compatible API |
7779
| AWS_EC2_METADATA_DISABLED | true | Disables EC2 metadata access, which is not supported by Swift |
7880

7981

8082
### Google GCS
8183
To use GCS for storing stream data, first create the bucket in GCS.
8284

83-
**Using the S3-compatible API:**
85+
#### Using the S3-compatible API
8486

8587
| Environment Variable | Value | Description |
8688
| ------------------------ | -------| --------------------------------------------------------------- |
@@ -94,7 +96,7 @@ To use GCS for storing stream data, first create the bucket in GCS.
9496

9597
Refer to [GCS AWS migration documentation](https://cloud.google.com/storage/docs/aws-simple-migration) for more information.
9698

97-
**Using GCS directly:**
99+
#### Using GCS directly
98100

99101
| Environment Variable | Value | Description |
100102
| ------------------------ | -------| ----------------------------------------------------------------------- |
@@ -106,7 +108,7 @@ Refer to [GCS AWS migration documentation](https://cloud.google.com/storage/docs
106108

107109
OpenObserve uses the [object_store crate](https://docs.rs/object_store/0.10.1/object_store/gcp/struct.GoogleCloudStorageBuilder.html) to initialize the storage configuration. It calls the with_env() function by default. If the ZO_S3_ACCESS_KEY variable is set, OpenObserve additionally uses the with_service_account_path() function to load the GCP service account key.
108110

109-
### Alibaba OSS (aliyun)
111+
### Alibaba OSS (Aliyun)
110112
To use Alibaba OSS for storing stream data, first create the bucket in Alibaba Cloud.
111113
Then set the following environment variables:
112114

@@ -164,15 +166,15 @@ Refer to [Baidu BOS documentation](https://cloud.baidu.com/doc/BOS/s/xjwvyq9l4).
164166

165167
### Azure Blob
166168

167-
OpenObserve can use azure blob for storing stream data. Following environment variables needs to be setup:
169+
OpenObserve can use Azure Blob for storing stream data. The following environment variables need to be set:
168170

169171
| Environment Variable | Value | Description |
170172
| -------------------------- | -------------------- | -------------------------------------------- |
171-
| ZO_S3_PROVIDER | azure | Enables Azure Blob storage support |
173+
| ZO_S3_PROVIDER | azure | Enables Azure Blob storage support |
172174
| ZO_LOCAL_MODE_STORAGE | s3 | Required only if running in single node mode |
173-
| AZURE_STORAGE_ACCOUNT_NAME | Storage account name | Need to provide mandatorily |
174-
| AZURE_STORAGE_ACCOUNT_KEY | Access key | Need to provide mandatorily |
175-
| ZO_S3_BUCKET_NAME | Blob Container name | Need to provide mandatorily |
175+
| AZURE_STORAGE_ACCOUNT_NAME | Storage account name | Required |
176+
| AZURE_STORAGE_ACCOUNT_KEY | Access key | Required |
177+
| ZO_S3_BUCKET_NAME | Blob Container name | Required |
176178

177179
### Hetzner Cloud Object Storage
178180

@@ -215,17 +217,25 @@ OpenObserve supports multiple metadata store backends, configurable using the `Z
215217
### PostgreSQL
216218
- Set `ZO_META_STORE=postgres`.
217219
- Recommended for production deployments due to reliability and scalability.
218-
- The default Helm chart (after February 23, 2024) uses [cloudnative-pg](https://cloudnative-pg.io/) to create a postgres cluster (primary + replica) which is used as the meta store. These instances provide high availability and backup support.
220+
- The default Helm chart (after February 23, 2024) uses [cloudnative-pg](https://cloudnative-pg.io/) to create a PostgreSQL cluster (primary + replica) which is used as the meta store. These instances provide high availability and backup support.
219221

220222
### etcd (Removed)
221223

222224
!!! warning "Removal notice"
223-
Etcd support has been removed. Use NATS instead.
224-
225-
- Set `ZO_META_STORE=etcd`.
226-
- While etcd is used as the cluster coordinator, it was also the default metadata store in Helm charts released before 23 February 2024. This configuration is now deprecated. Helm charts released after 23 February 2024 use PostgreSQL as the default metadata store.
225+
Etcd support has been removed. Use NATS as the cluster coordinator and PostgreSQL (or MySQL) as the metadata store. Helm charts released after 23 February 2024 already use PostgreSQL by default.
227226

228227
### MySQL (Deprecated)
229228
- Set `ZO_META_STORE=mysql`.
230229
- Deprecated.
231230
- Use PostgreSQL instead.
231+
232+
## Next steps
233+
234+
- [HA deployment](../../deployment/ha-deployment.md): configure object storage and metadata store in a production cluster.
235+
- [Environment variables](../../configuration/environment-variables.md): full reference for `ZO_S3_*` and `ZO_META_*` settings.
236+
- [Capacity planning](../../../enterprise-setup/capacity-planning.md): sizing storage, compute, and memory for each component.
237+
238+
**Need some help?**
239+
240+
- Join our [Community Slack](https://short.openobserve.ai/community)
241+
- Or [Contact support](https://openobserve.ai/contactus/)

docs/enterprise-setup/performance.md

Lines changed: 30 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -45,13 +45,16 @@ If you have very high ingestion speed requirements (e.g. 100s of thousands of ev
4545

4646
OpenObserve does not do full-text indexing like Elasticsearch. This results in very high compression ratio of ingested data. Coupled with object storage this can give you ~140x lower storage cost. However, this also means that search performance for full text queries in absence of full-text indexes might suffer. However log data has some unique properties that can be leveraged to improve search performance significantly. OpenObserve uses following techniques to improve search performance:
4747

48-
### Column pruning
48+
### Column pruning
49+
4950
OpenObserve uses columnar storage format (parquet) which allows it to read only the columns that are required for a query. This reduces the amount of data that needs to be read from disk and improves search performance. This technique is called column pruning. It reduces the amount of data that needs to be read from disk. You must switch to SQL query mode for this and specify only the columns that you want to be returned.
5051

51-
### Predicate pushdown:
52+
### Predicate pushdown
5253

53-
#### Standard Partitioning (KeyValue partitions)
54-
>Note: Use For low cardinality fields
54+
#### Standard Partitioning (KeyValue partitions)
55+
56+
!!! note
57+
Use for low cardinality fields.
5558

5659
OpenObserve uses a technique called predicate pushdown to further reduce the amount of data that needs to be read from disk. This is done by pushing down the filters to the storage layer. By default OpenObserve will partition data by `org/stream/year/month/day/hour`. So when searching, if you know the time range for which you are searching for data you should specify it and OpenObserve will skip data not following in date range and will search across much less data. This will improve search performance and will utilize predicate pushdown. You can also enable additional partitioning for fields on any stream by going to stream settings. Some good candidates for partition keys are host and kubernetes namespace. You can have multiple partition keys for a stream. You can then specify partition keys in your query. e.g. `host='xyz' and kubernetes_namespace='abc'`. This will improve search performance and will utilize predicate pushdown.*** `DO NOT enable partitioning on all/many fields as it may result in many small underlying parquet files which will result in low compression, extremely poor search performance and high s3 storage costs` ***. As a rule of thumb you would want the size of each stored parquet file to be above 5 MB. Order of partitions does not matter. You can partition by `namespace, pod` or `pod, namespace`.
5760

@@ -116,12 +119,16 @@ For the above scenario, you can enable hash partitioning on namespace field with
116119
You can specify the number of buckets (8, 16, 32, 64, 128) in the index in stream setting when setting up hash partitioning for a particular field.
117120

118121
#### Time range partition
119-
>Note: Enabled by default and cannot be disabled
122+
123+
!!! note
124+
Enabled by default and cannot be disabled.
120125

121126
OpenObserve partitions all data by time range by default in addition to any other partitions that you may have defined. It always makes sense to specify the shortest time range to search for. e.g. if you know that you are looking for data for last 15 minutes, you should specify that in your query by selecting it from the top right corner. This will improve search performance and will utilize predicate pushdown.
122127

123-
### Bloom filter (available starting v0.8.0) (For high cardinality fields)
124-
>Note: Use For high cardinality fields
128+
### Bloom filter
129+
130+
!!! note
131+
Use for high cardinality fields. Available starting v0.8.0.
125132

126133
A bloom filter is a space efficient probabilistic data structure that allow you to check if a value exists in a set. It solves proverbial `needle in a haystack` problem. OpenObserve uses bloom filters to check if a value exists in a column. This allows OpenObserve to skip reading the data from disk if the value does not exist in the column. This improves search performance by reducing `search space`. You must specify bloom filter for the specific fields that you want to search. Fields that are well suited for bloom filter are of very high cardinality .e.g. UUID, request_id, trace_id, device_id, etc. You can specify bloom filter for a field by going to stream settings. You can specify multiple fields for bloom filter. e.g. `request_id` and `trace_id`. You can then use the fields in your query that will utilize bloom filter. e.g. `request_id='abc' and trace_id='xyz'`. Enabling bloom filter on a field with low cardinality will not result in any performance improvement.
127134

@@ -133,7 +140,10 @@ Log search involves full text search. When you try to do a full text search it e
133140
1. Do not use `match_all` directly on full data set, but always use it in combination with one or more filters which can themselves be optimized by partitions or bloom filters. e.g. `host='host1' and match_all('error') ` or `k8s_namespace_name='ns1' and match_all('error')` or `bank_account_number='653456-54654-65' and match_all('error')`. In all of these examples using the filter reduces search space for `match_all`. Additionally if `host` and `k8s_namespace_name` fields are partitioned then you have reduced search space very well and will gain the improvements in full text search. `bank_account_number`, `request_id`, `trace_id`, `device_id` are good candidates for bloom filter and should be used together with `match_all` to improve full text search performance.
134141
1. Enable full text search only on the fields that you need. e.g. body, log, message etc. Fields like hostname, ip address, etc. are not good candidates for full text search and you should not enable full text search on these fields. You can enable full text search on a field by going to stream settings. You can specify multiple fields for full text search. e.g. `body` and `message`. You can then use the fields in your query that will utilize full text search. e.g. `host='host1' and match_all('error')`.
135142

136-
### Inverted Index (available starting v0.10.0)
143+
### Inverted Index
144+
145+
!!! note
146+
Available starting v0.10.0.
137147

138148
Above mentioned partitioning schemes and bloom filters are good for fields where you are doing equality based searches. e.g. `request_id='abc'`. For full text search in fields that contain longer log lines, OpenObserve in its earlier releases relied on brute force search (how [grep](https://www.gnu.org/software/grep/manual/grep.html) works) which works well for most of the scenarios. However, for very large data sets this can be slow. You can enable inverted index to improve full text search performance for such fields. Do not enable inverted index for fields that are not used for full text search but are used for equality based searches. Bloom filters and hash partitions are better suited for equality based searches.
139149

@@ -325,3 +335,15 @@ By enabling User-Defined Schemas (via `ZO_ALLOW_USER_DEFINED_SCHEMAS=true`), you
325335
326336
If you later need one of the fields from the `_raw` data to be searchable, simply add it to the UDS in the stream’s settings. After doing so, this field will become searchable going forward.
327337
338+
## Next steps
339+
340+
- [Capacity planning](capacity-planning.md): sizing CPU, memory, and storage for each component.
341+
- [HA deployment](../administration/deployment/ha-deployment.md): production-grade cluster setup.
342+
- [Architecture](../architecture.md): understand how the components interact.
343+
344+
**Need some help?**
345+
346+
- Join our [Community Slack](https://short.openobserve.ai/community)
347+
- Or [Contact support](https://openobserve.ai/contactus/)
348+
349+

0 commit comments

Comments
 (0)