Skip to content

Commit 90ec223

Browse files
Merge pull request #32 from InfiniTensor/feat/add-dashboard-md
docs: add finalized InfiniMetrics Dashboard user guide (Chinese)
2 parents 96daee6 + 516075b commit 90ec223

File tree

7 files changed

+307
-0
lines changed

7 files changed

+307
-0
lines changed

docs/dashboard.md

Lines changed: 150 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,150 @@
1+
# InfiniMetrics Dashboard User Guide
2+
## 1. Dashboard Overview
3+
4+
InfiniMetrics Dashboard provides a unified interface to visualize benchmark and evaluation results of AI accelerators across the following scenarios:
5+
6+
- Communication (NCCL / Collective Communication)
7+
8+
- Training (Training / Distributed Training)
9+
10+
- Inference (Direct / Service Inference)
11+
12+
- Operator (Core Operator Performance)
13+
14+
The benchmark framework produces two types of outputs:
15+
16+
```
17+
JSON -> configuration / environment / scalar metrics
18+
CSV -> curves / time-series data
19+
```
20+
The Dashboard automatically loads test results and provides unified analysis capabilities, including:
21+
22+
- un ID fuzzy search: locate specific test runs using partial Run IDs
23+
24+
- General filters: filter results by framework, model, device count, etc.
25+
26+
- Multi-run comparison: select multiple runs to compare performance
27+
28+
- Performance visualization: display curves such as latency / throughput / loss
29+
30+
- Statistics and configuration view: inspect throughput statistics, runtime configuration, and environment details
31+
32+
For example, you can enter:
33+
```
34+
allreduce
35+
service
36+
```
37+
to perform fuzzy matching on Run IDs
38+
39+
Example screenshot:
40+
![Run_ID research](./images/runid_research.jpg)
41+
## 2. Running the Dashboard
42+
### 2.1 Environment Requirements
43+
44+
Before using the Dashboard, install the following dependencies:
45+
```
46+
streamlit
47+
plotly
48+
pandas
49+
```
50+
### 2.2 Start the Dashboard
51+
52+
Run the following command in the project root directory:
53+
```
54+
python -m streamlit run dashboard/app.py
55+
```
56+
Access URL after startup:
57+
```
58+
Local URL: http://localhost:8501
59+
Network URL: http://<server-ip>:8501
60+
```
61+
Explanation:
62+
63+
Local URL: accessible only on the local machine
64+
65+
Network URL: accessible from other machines within the same network
66+
67+
## 3. Communication Test Analysis
68+
69+
Path:
70+
```
71+
Dashboard → Communication Performance Test
72+
```
73+
Supported features:
74+
```
75+
Bandwidth analysis curve - peak bandwidth
76+
77+
Latency analysis curve - average latency
78+
79+
Test duration
80+
81+
GPU memory usage
82+
83+
Communication configuration analysis
84+
```
85+
Example screenshot:
86+
![Communication Test](./images/dashboard_communication.jpg)
87+
88+
## 4. Inference Test Analysis
89+
90+
Path:
91+
```
92+
Dashboard → Inference Performance Test
93+
```
94+
Modes:
95+
```
96+
Direct Inference
97+
Service Inference
98+
```
99+
Displayed metrics:
100+
```
101+
TTFT
102+
103+
Latency
104+
105+
Throughput
106+
107+
GPU memory usage
108+
109+
Inference configuration analysis
110+
```
111+
Example screenshot:
112+
![Inference Test](./images/dashboard_inference.jpg)
113+
114+
## 5. Training Test Analysis
115+
116+
Path:
117+
```
118+
Dashboard → Training Performance Test
119+
```
120+
Supported features:
121+
```
122+
Loss curve
123+
124+
Perplexity curve
125+
126+
Throughput curve
127+
128+
GPU memory usage
129+
130+
Training configuration analysis
131+
```
132+
Example screenshot:
133+
![Training Test](./images/dashboard_training.jpg)
134+
135+
## 6. Operator Test Analysis
136+
137+
Path:
138+
```
139+
Dashboard → Operator Performance Test
140+
```
141+
Supported metrics:
142+
```
143+
latency
144+
145+
flops
146+
147+
bandwidth
148+
```
149+
Example screenshot:
150+
![Operator Test](./images/dashboard_operators.jpg)
231 KB
Loading
117 KB
Loading
164 KB
Loading

docs/images/dashboard_training.jpg

147 KB
Loading

docs/images/runid_research.jpg

124 KB
Loading

docs/zh/dashboard.md

Lines changed: 157 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,157 @@
1+
# InfiniMetrics Dashboard 使用指南
2+
3+
## 1. Dashboard 简介
4+
5+
InfiniMetrics Dashboard 用于统一展示 AI 加速卡在以下场景下的测试与评测结果
6+
7+
- 通信(NCCL / 集合通信)
8+
- 训练(Training / 分布式训练)
9+
- 推理(Direct / Service 推理)
10+
- 算子(核心算子性能)
11+
12+
测试框架输出两类数据:
13+
```
14+
JSON -> 配置 / 环境 / 标量指标
15+
CSV -> 曲线 / 时序数据
16+
```
17+
Dashboard 会自动加载测试结果,并提供统一的分析功能,包括:
18+
19+
- Run ID 模糊搜索:支持通过部分 Run ID 快速定位测试运行
20+
21+
- 通用筛选器:按框架、模型、设备数量等条件筛选
22+
23+
- 多运行对比:同时选择多个测试运行进行性能对比
24+
25+
- 性能可视化:展示 latency / throughput / loss 等性能曲线
26+
27+
- 统计与配置展示:查看吞吐量统计、运行配置和环境信息
28+
29+
例如可以输入:
30+
```
31+
allreduce
32+
service
33+
```
34+
对 Run ID 进行模糊匹配搜索
35+
36+
示例截图:
37+
38+
![Run ID搜索](../images/runid_research.jpg)
39+
## 2. 运行 Dashboard
40+
### 2.1 环境依赖
41+
使用 Dashboard 前需要安装以下依赖:
42+
```
43+
streamlit
44+
plotly
45+
pandas
46+
```
47+
### 2.2 启动 Dashboard
48+
在项目根目录执行:
49+
```
50+
python -m streamlit run dashboard/app.py
51+
```
52+
访问地址,启动成功后显示:
53+
```
54+
Local URL: http://localhost:8501
55+
Network URL: http://<server-ip>:8501
56+
```
57+
说明:
58+
59+
Local URL:仅本机访问
60+
61+
Network URL:同一网络内其他机器可访问
62+
63+
## 3. 通信测试分析
64+
路径:
65+
66+
```
67+
Dashboard → 通信性能测试
68+
```
69+
70+
支持:
71+
```
72+
带宽分析曲线 - 峰值带宽
73+
74+
延迟分析曲线 - 平均延迟
75+
76+
测试耗时
77+
78+
显存使用
79+
80+
通信配置解析
81+
```
82+
83+
示例截图:
84+
85+
![通信测试](../images/dashboard_communication.jpg)
86+
## 4. 推理测试分析
87+
88+
路径:
89+
90+
```
91+
Dashboard → 推理性能测试
92+
```
93+
94+
模式:
95+
```
96+
Direct Inference
97+
Service Inference
98+
```
99+
展示指标:
100+
```
101+
TTFT
102+
103+
Latency
104+
105+
Throughput
106+
107+
显存使用
108+
109+
推理配置解析
110+
```
111+
示例截图:
112+
113+
![推理测试](../images/dashboard_inference.jpg)
114+
115+
## 5. 训练测试分析
116+
路径:
117+
118+
```
119+
Dashboard → 训练性能测试
120+
```
121+
122+
支持:
123+
```
124+
Loss 曲线
125+
126+
Perplexity 曲线
127+
128+
Throughput 曲线
129+
130+
显存使用
131+
132+
训练配置解析
133+
```
134+
示例截图:
135+
136+
![训练测试](../images/dashboard_training.jpg)
137+
138+
## 6. 算子测试分析
139+
140+
路径:
141+
142+
```
143+
Dashboard → 算子性能测试
144+
```
145+
146+
支持:
147+
```
148+
latency
149+
150+
flops
151+
152+
bandwidth
153+
```
154+
155+
示例截图:
156+
157+
![算子测试](../images/dashboard_operators.jpg)

0 commit comments

Comments
 (0)