Skip to content

OCP V4.4.0 文档中的部分监控指标与实际 OCP Agent endpoint 不一致 #85

@boathell

Description

@boathell

问题概述

在核对 oceanbase/ocp-doc 当前 V4.4.0 分支时,发现监控相关文档中仍存在较多指标名不一致或可疑引用的问题。

这里的“可疑指标”特指:

  • 文档中明确列出了该指标,或在监控表达式/告警表达式中直接使用了该指标
  • 但在 OB 4.2.5 + OCP 4.3.x 的实际验证环境中,OCP Agent 暴露的指定 endpoint 里没有获取到这些指标
  • 也就是说:文档中有,但指定 endpoint 中看不到

这些问题会直接影响用户:

  • 根据文档配置监控或告警时,可能查询不到指标
  • 中英文文档可能给出不同结论
  • 同一仓库内“指标清单”和“指标参考/告警参考”之间存在不一致

验证环境

验证环境:

  • OceanBase: 4.2.5
  • OCP: 4.3.x

验证方式:

  • 通过 OCP Agent / exporter 暴露的监控 endpoint 观察实际指标输出
  • 将实际 endpoint 中可见指标与当前 V4.4.0 文档中的指标名进行比对

已检查的 endpoint:

  • :62889/metrics/ob/basic
  • :62889/metrics/ob/extra
  • :62889/metrics/node/host
  • :62889/metrics/node/ob
  • :62889/metrics/node/obproxy
  • :62889/metrics/obproxy
  • :62888/metrics/stat
  • :62889/metrics/stat

一、已局部修复,但仓库内未统一修复

1. Session 指标命名不一致

仓库中的自定义指标清单已经列出:

  • ob_session_active_num
  • ob_session_all_num

例如:

  • en-US/880.manage-performance-monitoring/200.manage-custom-monitoring/500.ocp-monitoring-indicator-items.md
  • zh-CN/880.manage-performance-monitoring/200.manage-custom-monitoring/500.ocp-monitoring-indicator-items.md

但很多参考页仍使用旧名:

  • ob_active_session_num
  • ob_all_session_num

例如:

  • en-US/1900.reference-guide/300.monitoring-indicator-reference/300.ob-cluster/100.database-performance/500.number-of-sessions.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/100.performance-and-sql/400.sessions.md
  • zh-CN/1900.reference-guide/300.monitoring-indicator-reference/300.ob-cluster/100.database-performance/500.number-of-sessions.md
  • zh-CN/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/100.performance-and-sql/400.sessions.md

告警页也不一致:

  • 中文告警页已改成 ob_session_active_num
  • 英文告警页仍是 ob_active_session_num

2. SQL RT Percentile 指标命名未统一

中文某些页面已经改成:

  • ob_query_rt_total_cumulative_count

例如:

  • zh-CN/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/100.performance-and-sql/200.response-time.md

但英文对应页,以及中英文多个 OBKV 页,仍然使用旧名:

  • ob_query_rt_cumulative_count

例如:

  • en-US/1900.reference-guide/300.monitoring-indicator-reference/300.ob-cluster/100.database-performance/200.query-response-time.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/100.performance-and-sql/200.response-time.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/500.obkv-table/200.obkv-table-response-time.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/600.obkv-hbase/200.obkv-hbase-response-time.md

二、当前文档中仍明确存在的可疑指标

以下“可疑指标”的含义不是仓库内部语义不清,而是:

OB 4.2.5 + OCP 4.3.x 的实际验证环境中,OCP Agent 暴露的上述 endpoint 里没有获取到文档中标注的这些指标

1. 主机 / 分区 / MemStore 相关

以下指标在当前 V4.4.0 文档中仍被引用,但在上述验证环境的 endpoint 中未获取到:

  • node_ntp_offset_seconds
  • partition_leader_absent_count
  • partition_replica_absent_count
  • ob_partition_frozen_memstore_count

示例位置:

  • en-US/880.manage-performance-monitoring/200.manage-custom-monitoring/500.ocp-monitoring-indicator-items.md
  • en-US/1900.reference-guide/100.alarm-reference/300.application-alert/2000.host_ntp_offset_too_large.md
  • en-US/1900.reference-guide/100.alarm-reference/200.ob-alert/8100.ob_tenant_partition_leader_absent.md
  • en-US/1900.reference-guide/100.alarm-reference/200.ob-alert/8000.ob_tenant_partition_replica_absent.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/400.oceanbase-database-tenant/300.storage-and-cache/800.frozen-memstore.md

2. Binlog 相关

以下指标在当前 V4.4.0 文档中仍被引用,但在上述验证环境的 endpoint 中未获取到:

  • binlog_instance_dump_delay
  • binlog_instance_dump_rps
  • binlog_instance_convert_delay
  • binlog_instance_cpu_used_ratio
  • binlog_instance_mem_used_ratio

示例位置:

  • en-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/100.subscription-connection/100.binlog_dump_delay.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/100.subscription-connection/200.binlog_instance_dump_rps.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/200.binlog-performance-monitoring/100.binlog_delay.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/300.binlog-resource-monitoring/100.cpu.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/450.binlog-service/300.binlog-resource-monitoring/300.memory_ratio.md

3. OBProxy 相关

以下指标在当前 V4.4.0 文档中仍被引用,但在上述验证环境的 endpoint 中未获取到:

  • odp_sql_request_total
  • odp_current_session
  • odp_sql_cost_total
  • odp_entry_total
  • odp_request_byte
  • obproxy_client_max_connections
  • obproxy_server_max_connections
  • obproxy_memory_limit_bytes

示例位置:

  • en-US/880.manage-performance-monitoring/200.manage-custom-monitoring/500.ocp-monitoring-indicator-items.md
  • en-US/1900.reference-guide/100.alarm-reference/200.ob-alert/7000.obproxy_client_connections_usage_over_threshold.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/200.requests-per-second.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/300.client-connections.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/400.server-connections.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/500.average-response-time-for-each-sql-statement.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/600.obproxy_mem.md
  • en-US/1900.reference-guide/300.monitoring-indicator-reference/500.obproxy-cluster/700.average-route-table-queries-per-second.md

三、当前仓库内部还能直接确认的不一致

  • 同一仓库内,自定义指标清单与参考页对同一指标使用不同名字
  • 中文页与英文页修复状态不同步
  • 同一类页面内部也存在命名不一致

典型例子:

  • ob_session_active_num vs ob_active_session_num
  • ob_query_rt_total_cumulative_count vs ob_query_rt_cumulative_count
  • odp_request_byte vs odp_request_byte_total

四、建议优先处理的方向

  • 统一 session 指标命名
  • 统一 SQL RT percentile 指标命名
  • 全面核查 OBProxy 相关指标页
  • 核查主机 / 分区 / MemStore / Binlog 相关指标是否仍为当前版本真实暴露指标
  • 完成一次中英文一致性检查

附件建议

建议在 issue 中同时附上:

  • 当前文档中引用这些指标的页面链接
  • 上述 endpoint 的实际采样结果或原始输出片段
  • 已确认真实可见的替代指标名列表

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions