why we are not using prometheus/grafana

priyeshkaratha · priyeshkaratha · commit 937bac611c55 · 2025-09-14T13:07:07.000+05:30
diff --git a/hadoop-hdds/docs/content/design/storage-distribution.md b/hadoop-hdds/docs/content/design/storage-distribution.md
@@ -69,11 +69,16 @@ Summarized cluster statistics:
 ## Approach 1: Recon-based Implementation
 
 Leverage the existing Recon service to build the dashboard with centralized and efficient data collection.
+Recon currently maintains synchronization with the OM database and constructs the NSSummary tree, providing established calculation logic for metrics such as openKeysBytes and committedBytes. 
+Additionally, Recon already possesses a comprehensive physical and logical capacity to break down information through its OM DB insights component.
+These existing capabilities can be effectively leveraged to minimize development effort and ensure consistency. 
+While certain enhancements to OM are required regardless of the chosen implementation approach—whether CLI-based or Prometheus-driven—the foundational data processing infrastructure is already in place.
+The modifications outlined for OM, SCM, and DataNode components remain mandatory across all proposed approaches to ensure complete and accurate storage distribution reporting.
 
 ### Benefits
 
-- **Unified Data Source**: All metrics aggregated centrally in Recon
-- **Performance Optimization**: Incremental sync reduces load
+- **Unified Data Source**: All metrics aggregated centrally in Recon. 
+- **Performance Optimization**: Incremental sync reduces a load
 - **Reduced Overhead**: Avoids redundant calculations across services
 - **Code Reusability**: Built on top of existing Recon infrastructure and endpoints
 
@@ -90,10 +95,10 @@ Leverage the existing Recon service to build the dashboard with centralized and
 
 #### **Storage Container Manager (SCM)**
 
-- **Current Gap**: No block size tracking in block deletion process
+- **Current Gap**: No block size tracking in a block deletion process
 - **Enhancement**:
-  - Track block sizes when OM issues deletion request
-  - Send deletion command to DN along with block size and replication factor
+  - Track block sizes when OM issues a deletion request
+  - Send a deletion command to DN along with block size and replication factor
 
   ```
   OM → SCM: block deletion request + block size  
@@ -107,7 +112,6 @@ Leverage the existing Recon service to build the dashboard with centralized and
 
 - **Enhancement**:
   - Compute block sizes during deletion
-  - Extend Recon sync process to extract logical metrics
 - **Responsibilities**:
   - Expose logical storage metrics: committed keys, open keys, namespace usage
 
@@ -124,8 +128,36 @@ Leverage the existing Recon service to build the dashboard with centralized and
   - DN BlockDeletingService metrics.
 
 ---
+## Approach 2: Prometheus + Grafana Implementation (Not Recommended)
 
-## Approach 2: CLI-based (Not Proceeding)
+### Overview
+
+This approach would involve publishing storage distribution metrics directly from individual components (OM, SCM, DataNodes) to Prometheus, with visualization handled entirely through Grafana dashboards.
+
+### Why This Approach Is Not Recommended
+
+#### **1. Customer Adoption and User Experience**
+- **Current Reality**: Customers are already actively using Recon for storage analysis and monitoring
+- **Existing Feedback**: Users have specifically identified gaps in Recon's current calculations and requested improvements within the existing interface
+- **User Workflow Disruption**: Introducing a completely separate monitoring stack would fragment the user experience
+- **Training and Adoption Overhead**: Teams would need to learn new tools and workflows, creating adoption barriers
+
+#### **2. Incomplete Current State**
+The primary driver for this enhancement is that **customers have identified that Recon's existing calculations are incomplete or incorrect**. Key issues include:
+- Inconsistent storage usage calculations across different views
+- Missing pending deletion visibility at granular levels
+- Lack of real-time correlation between logical and physical storage metrics
+- Incomplete breakdown of storage distribution across cluster components
+
+Moving to Prometheus/Grafana would not address these calculation issues. It would simply relocate them to a different platform while requiring significant additional implementation effort.
+
+#### **3. Recon's Existing Infrastructure Advantages**
+- **Data Access**: Recon already has optimized access to OM DB, SCM metadata, and DN reports
+- **Calculation Engine**: Existing framework for cross-component metric aggregation and correlation
+- **Web Interface**: Established a UI framework for complex data visualization and drill-down capabilities
+- **User Base**: Active user community familiar with Recon's interface and capabilities
+
+## Approach 3: CLI-based (Not Proceeding)
 
 A CLI-based approach was evaluated to compute detailed usage and pending deletion breakdown by analyzing offline OM and SCM database checkpoints and querying DataNodes. 
 While it offers precise, up-to-date results and independence from Recon, it introduces significant operational overhead.
@@ -137,12 +169,7 @@ Given its complexity, dependency on manual execution, and high resource consumpt
 
 # Summary
 
-The proposed dashboard enhances visibility into cluster storage dynamics, enabling better debugging and decision-making. Recon is the ideal location for this feature due to its existing role as the observability hub in Ozone.
-
-This enhancement lays the foundation for future innovations like:
-
-- Storage heatmaps
-- Auto-balancing recommendations
-- UI-based debugging for deletion backlogs
+The proposed dashboard improves visibility into cluster storage dynamics, providing deeper insights for effective debugging and informed decision-making. 
+Recon is the ideal place to host this feature, given its established role as the central storage overview in Ozone.
 
 ---