问题概述
在核对 oceanbase/ocp-doc 当前 V4.4.0 分支时,发现监控相关文档中仍存在较多指标名不一致或可疑引用的问题。
这里的“可疑指标”特指:
- 文档中明确列出了该指标,或在监控表达式/告警表达式中直接使用了该指标
- 但在 OB 4.2.5 + OCP 4.3.x 的实际验证环境中,OCP Agent 暴露的指定 endpoint 里没有获取到这些指标
- 也就是说:文档中有,但指定 endpoint 中看不到
这些问题会直接影响用户:
- 根据文档配置监控或告警时,可能查询不到指标
- 中英文文档可能给出不同结论
- 同一仓库内“指标清单”和“指标参考/告警参考”之间存在不一致
验证环境
验证环境:
- OceanBase:
4.2.5
- OCP:
4.3.x
验证方式:
- 通过 OCP Agent / exporter 暴露的监控 endpoint 观察实际指标输出
- 将实际 endpoint 中可见指标与当前
V4.4.0 文档中的指标名进行比对
已检查的 endpoint:
:62889/metrics/ob/basic
:62889/metrics/ob/extra
:62889/metrics/node/host
:62889/metrics/node/ob
:62889/metrics/node/obproxy
:62889/metrics/obproxy
:62888/metrics/stat
:62889/metrics/stat
一、已局部修复,但仓库内未统一修复
1. Session 指标命名不一致
仓库中的自定义指标清单已经列出:
ob_session_active_num
ob_session_all_num
例如:
en-US/880.manage-performance-monitoring/200.manage-custom-monitoring/500.ocp-monitoring-indicator-items.md
zh-CN/880.manage-performance-monitoring/200.manage-custom-monitoring/500.ocp-monitoring-indicator-items.md
但很多参考页仍使用旧名:
ob_active_session_num
ob_all_session_num
例如:
en-US/1900.reference-guide/300.monitoring-indicator-reference/300.ob-cluster/100.database-performance/500.number-of-sessions.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/100.performance-and-sql/400.sessions.md
zh-CN/1900.reference-guide/300.monitoring-indicator-reference/300.ob-cluster/100.database-performance/500.number-of-sessions.md
zh-CN/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/100.performance-and-sql/400.sessions.md
告警页也不一致:
- 中文告警页已改成
ob_session_active_num
- 英文告警页仍是
ob_active_session_num
2. SQL RT Percentile 指标命名未统一
中文某些页面已经改成:
ob_query_rt_total_cumulative_count
例如:
zh-CN/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/100.performance-and-sql/200.response-time.md
但英文对应页,以及中英文多个 OBKV 页,仍然使用旧名:
ob_query_rt_cumulative_count
例如:
en-US/1900.reference-guide/300.monitoring-indicator-reference/300.ob-cluster/100.database-performance/200.query-response-time.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/100.performance-and-sql/200.response-time.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/500.obkv-table/200.obkv-table-response-time.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/600.obkv-hbase/200.obkv-hbase-response-time.md
二、当前文档中仍明确存在的可疑指标
以下“可疑指标”的含义不是仓库内部语义不清,而是:
在 OB 4.2.5 + OCP 4.3.x 的实际验证环境中,OCP Agent 暴露的上述 endpoint 里没有获取到文档中标注的这些指标。
1. 主机 / 分区 / MemStore 相关
以下指标在当前 V4.4.0 文档中仍被引用,但在上述验证环境的 endpoint 中未获取到:
node_ntp_offset_seconds
partition_leader_absent_count
partition_replica_absent_count
ob_partition_frozen_memstore_count
示例位置:
en-US/880.manage-performance-monitoring/200.manage-custom-monitoring/500.ocp-monitoring-indicator-items.md
en-US/1900.reference-guide/100.alarm-reference/300.application-alert/2000.host_ntp_offset_too_large.md
en-US/1900.reference-guide/100.alarm-reference/200.ob-alert/8100.ob_tenant_partition_leader_absent.md
en-US/1900.reference-guide/100.alarm-reference/200.ob-alert/8000.ob_tenant_partition_replica_absent.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/300.storage-and-cache/800.frozen-memstore.md
2. Binlog 相关
以下指标在当前 V4.4.0 文档中仍被引用,但在上述验证环境的 endpoint 中未获取到:
binlog_instance_dump_delay
binlog_instance_dump_rps
binlog_instance_convert_delay
binlog_instance_cpu_used_ratio
binlog_instance_mem_used_ratio
示例位置:
en-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/100.subscription-connection/100.binlog_dump_delay.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/100.subscription-connection/200.binlog_instance_dump_rps.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/200.binlog-performance-monitoring/100.binlog_delay.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/300.binlog-resource-monitoring/100.cpu.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/300.binlog-resource-monitoring/300.memory_ratio.md
3. OBProxy 相关
以下指标在当前 V4.4.0 文档中仍被引用,但在上述验证环境的 endpoint 中未获取到:
odp_sql_request_total
odp_current_session
odp_sql_cost_total
odp_entry_total
odp_request_byte
obproxy_client_max_connections
obproxy_server_max_connections
obproxy_memory_limit_bytes
示例位置:
en-US/880.manage-performance-monitoring/200.manage-custom-monitoring/500.ocp-monitoring-indicator-items.md
en-US/1900.reference-guide/100.alarm-reference/200.ob-alert/7000.obproxy_client_connections_usage_over_threshold.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/200.requests-per-second.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/300.client-connections.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/400.server-connections.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/500.average-response-time-for-each-sql-statement.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/600.obproxy_mem.md
en-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/700.average-route-table-queries-per-second.md
三、当前仓库内部还能直接确认的不一致
- 同一仓库内,自定义指标清单与参考页对同一指标使用不同名字
- 中文页与英文页修复状态不同步
- 同一类页面内部也存在命名不一致
典型例子:
ob_session_active_num vs ob_active_session_num
ob_query_rt_total_cumulative_count vs ob_query_rt_cumulative_count
odp_request_byte vs odp_request_byte_total
四、建议优先处理的方向
- 统一
session 指标命名
- 统一
SQL RT percentile 指标命名
- 全面核查 OBProxy 相关指标页
- 核查主机 / 分区 / MemStore / Binlog 相关指标是否仍为当前版本真实暴露指标
- 完成一次中英文一致性检查
附件建议
建议在 issue 中同时附上:
- 当前文档中引用这些指标的页面链接
- 上述 endpoint 的实际采样结果或原始输出片段
- 已确认真实可见的替代指标名列表
问题概述
在核对
oceanbase/ocp-doc当前V4.4.0分支时,发现监控相关文档中仍存在较多指标名不一致或可疑引用的问题。这里的“可疑指标”特指:
这些问题会直接影响用户:
验证环境
验证环境:
4.2.54.3.x验证方式:
V4.4.0文档中的指标名进行比对已检查的 endpoint:
:62889/metrics/ob/basic:62889/metrics/ob/extra:62889/metrics/node/host:62889/metrics/node/ob:62889/metrics/node/obproxy:62889/metrics/obproxy:62888/metrics/stat:62889/metrics/stat一、已局部修复,但仓库内未统一修复
1. Session 指标命名不一致
仓库中的自定义指标清单已经列出:
ob_session_active_numob_session_all_num例如:
en-US/880.manage-performance-monitoring/200.manage-custom-monitoring/500.ocp-monitoring-indicator-items.mdzh-CN/880.manage-performance-monitoring/200.manage-custom-monitoring/500.ocp-monitoring-indicator-items.md但很多参考页仍使用旧名:
ob_active_session_numob_all_session_num例如:
en-US/1900.reference-guide/300.monitoring-indicator-reference/300.ob-cluster/100.database-performance/500.number-of-sessions.mden-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/100.performance-and-sql/400.sessions.mdzh-CN/1900.reference-guide/300.monitoring-indicator-reference/300.ob-cluster/100.database-performance/500.number-of-sessions.mdzh-CN/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/100.performance-and-sql/400.sessions.md告警页也不一致:
ob_session_active_numob_active_session_num2. SQL RT Percentile 指标命名未统一
中文某些页面已经改成:
ob_query_rt_total_cumulative_count例如:
zh-CN/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/100.performance-and-sql/200.response-time.md但英文对应页,以及中英文多个 OBKV 页,仍然使用旧名:
ob_query_rt_cumulative_count例如:
en-US/1900.reference-guide/300.monitoring-indicator-reference/300.ob-cluster/100.database-performance/200.query-response-time.mden-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/100.performance-and-sql/200.response-time.mden-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/500.obkv-table/200.obkv-table-response-time.mden-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/600.obkv-hbase/200.obkv-hbase-response-time.md二、当前文档中仍明确存在的可疑指标
以下“可疑指标”的含义不是仓库内部语义不清,而是:
在 OB 4.2.5 + OCP 4.3.x 的实际验证环境中,OCP Agent 暴露的上述 endpoint 里没有获取到文档中标注的这些指标。
1. 主机 / 分区 / MemStore 相关
以下指标在当前
V4.4.0文档中仍被引用,但在上述验证环境的 endpoint 中未获取到:node_ntp_offset_secondspartition_leader_absent_countpartition_replica_absent_countob_partition_frozen_memstore_count示例位置:
en-US/880.manage-performance-monitoring/200.manage-custom-monitoring/500.ocp-monitoring-indicator-items.mden-US/1900.reference-guide/100.alarm-reference/300.application-alert/2000.host_ntp_offset_too_large.mden-US/1900.reference-guide/100.alarm-reference/200.ob-alert/8100.ob_tenant_partition_leader_absent.mden-US/1900.reference-guide/100.alarm-reference/200.ob-alert/8000.ob_tenant_partition_replica_absent.mden-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/300.storage-and-cache/800.frozen-memstore.md2. Binlog 相关
以下指标在当前
V4.4.0文档中仍被引用,但在上述验证环境的 endpoint 中未获取到:binlog_instance_dump_delaybinlog_instance_dump_rpsbinlog_instance_convert_delaybinlog_instance_cpu_used_ratiobinlog_instance_mem_used_ratio示例位置:
en-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/100.subscription-connection/100.binlog_dump_delay.mden-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/100.subscription-connection/200.binlog_instance_dump_rps.mden-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/200.binlog-performance-monitoring/100.binlog_delay.mden-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/300.binlog-resource-monitoring/100.cpu.mden-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/300.binlog-resource-monitoring/300.memory_ratio.md3. OBProxy 相关
以下指标在当前
V4.4.0文档中仍被引用,但在上述验证环境的 endpoint 中未获取到:odp_sql_request_totalodp_current_sessionodp_sql_cost_totalodp_entry_totalodp_request_byteobproxy_client_max_connectionsobproxy_server_max_connectionsobproxy_memory_limit_bytes示例位置:
en-US/880.manage-performance-monitoring/200.manage-custom-monitoring/500.ocp-monitoring-indicator-items.mden-US/1900.reference-guide/100.alarm-reference/200.ob-alert/7000.obproxy_client_connections_usage_over_threshold.mden-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/200.requests-per-second.mden-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/300.client-connections.mden-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/400.server-connections.mden-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/500.average-response-time-for-each-sql-statement.mden-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/600.obproxy_mem.mden-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/700.average-route-table-queries-per-second.md三、当前仓库内部还能直接确认的不一致
典型例子:
ob_session_active_numvsob_active_session_numob_query_rt_total_cumulative_countvsob_query_rt_cumulative_countodp_request_bytevsodp_request_byte_total四、建议优先处理的方向
session指标命名SQL RT percentile指标命名附件建议
建议在 issue 中同时附上: