Skip to content

Commit 937bac6

Browse files
why we are not using prometheus/grafana
1 parent 0275b54 commit 937bac6

1 file changed

Lines changed: 41 additions & 14 deletions

File tree

hadoop-hdds/docs/content/design/storage-distribution.md

Lines changed: 41 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -69,11 +69,16 @@ Summarized cluster statistics:
6969
## Approach 1: Recon-based Implementation
7070

7171
Leverage the existing Recon service to build the dashboard with centralized and efficient data collection.
72+
Recon currently maintains synchronization with the OM database and constructs the NSSummary tree, providing established calculation logic for metrics such as openKeysBytes and committedBytes.
73+
Additionally, Recon already possesses a comprehensive physical and logical capacity to break down information through its OM DB insights component.
74+
These existing capabilities can be effectively leveraged to minimize development effort and ensure consistency.
75+
While certain enhancements to OM are required regardless of the chosen implementation approach—whether CLI-based or Prometheus-driven—the foundational data processing infrastructure is already in place.
76+
The modifications outlined for OM, SCM, and DataNode components remain mandatory across all proposed approaches to ensure complete and accurate storage distribution reporting.
7277

7378
### Benefits
7479

75-
- **Unified Data Source**: All metrics aggregated centrally in Recon
76-
- **Performance Optimization**: Incremental sync reduces load
80+
- **Unified Data Source**: All metrics aggregated centrally in Recon.
81+
- **Performance Optimization**: Incremental sync reduces a load
7782
- **Reduced Overhead**: Avoids redundant calculations across services
7883
- **Code Reusability**: Built on top of existing Recon infrastructure and endpoints
7984

@@ -90,10 +95,10 @@ Leverage the existing Recon service to build the dashboard with centralized and
9095

9196
#### **Storage Container Manager (SCM)**
9297

93-
- **Current Gap**: No block size tracking in block deletion process
98+
- **Current Gap**: No block size tracking in a block deletion process
9499
- **Enhancement**:
95-
- Track block sizes when OM issues deletion request
96-
- Send deletion command to DN along with block size and replication factor
100+
- Track block sizes when OM issues a deletion request
101+
- Send a deletion command to DN along with block size and replication factor
97102

98103
```
99104
OM → SCM: block deletion request + block size
@@ -107,7 +112,6 @@ Leverage the existing Recon service to build the dashboard with centralized and
107112

108113
- **Enhancement**:
109114
- Compute block sizes during deletion
110-
- Extend Recon sync process to extract logical metrics
111115
- **Responsibilities**:
112116
- Expose logical storage metrics: committed keys, open keys, namespace usage
113117

@@ -124,8 +128,36 @@ Leverage the existing Recon service to build the dashboard with centralized and
124128
- DN BlockDeletingService metrics.
125129

126130
---
131+
## Approach 2: Prometheus + Grafana Implementation (Not Recommended)
127132

128-
## Approach 2: CLI-based (Not Proceeding)
133+
### Overview
134+
135+
This approach would involve publishing storage distribution metrics directly from individual components (OM, SCM, DataNodes) to Prometheus, with visualization handled entirely through Grafana dashboards.
136+
137+
### Why This Approach Is Not Recommended
138+
139+
#### **1. Customer Adoption and User Experience**
140+
- **Current Reality**: Customers are already actively using Recon for storage analysis and monitoring
141+
- **Existing Feedback**: Users have specifically identified gaps in Recon's current calculations and requested improvements within the existing interface
142+
- **User Workflow Disruption**: Introducing a completely separate monitoring stack would fragment the user experience
143+
- **Training and Adoption Overhead**: Teams would need to learn new tools and workflows, creating adoption barriers
144+
145+
#### **2. Incomplete Current State**
146+
The primary driver for this enhancement is that **customers have identified that Recon's existing calculations are incomplete or incorrect**. Key issues include:
147+
- Inconsistent storage usage calculations across different views
148+
- Missing pending deletion visibility at granular levels
149+
- Lack of real-time correlation between logical and physical storage metrics
150+
- Incomplete breakdown of storage distribution across cluster components
151+
152+
Moving to Prometheus/Grafana would not address these calculation issues. It would simply relocate them to a different platform while requiring significant additional implementation effort.
153+
154+
#### **3. Recon's Existing Infrastructure Advantages**
155+
- **Data Access**: Recon already has optimized access to OM DB, SCM metadata, and DN reports
156+
- **Calculation Engine**: Existing framework for cross-component metric aggregation and correlation
157+
- **Web Interface**: Established a UI framework for complex data visualization and drill-down capabilities
158+
- **User Base**: Active user community familiar with Recon's interface and capabilities
159+
160+
## Approach 3: CLI-based (Not Proceeding)
129161

130162
A CLI-based approach was evaluated to compute detailed usage and pending deletion breakdown by analyzing offline OM and SCM database checkpoints and querying DataNodes.
131163
While it offers precise, up-to-date results and independence from Recon, it introduces significant operational overhead.
@@ -137,12 +169,7 @@ Given its complexity, dependency on manual execution, and high resource consumpt
137169

138170
# Summary
139171

140-
The proposed dashboard enhances visibility into cluster storage dynamics, enabling better debugging and decision-making. Recon is the ideal location for this feature due to its existing role as the observability hub in Ozone.
141-
142-
This enhancement lays the foundation for future innovations like:
143-
144-
- Storage heatmaps
145-
- Auto-balancing recommendations
146-
- UI-based debugging for deletion backlogs
172+
The proposed dashboard improves visibility into cluster storage dynamics, providing deeper insights for effective debugging and informed decision-making.
173+
Recon is the ideal place to host this feature, given its established role as the central storage overview in Ozone.
147174

148175
---

0 commit comments

Comments
 (0)