Skip to content

Commit aab4db8

Browse files
Merge pull request #100 from rafvasq/clean-docs
[Docs] Update docs
2 parents 714297b + c8f4b52 commit aab4db8

3 files changed

Lines changed: 193 additions & 108 deletions

File tree

tests/LAYERS.md

Lines changed: 52 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,38 @@
11
# Layer Metrics Generation
22

3-
Generate metrics by layers to be used in tests and model enablement debugging.
3+
This guide explains how to use the [`generate_layers_metrics.py`](../scripts/generate_layers_metrics.py) script to generate metrics by layer for validating models and debugging.
44

5-
1. [Generate metrics by layer in GPU](./LAYERS.md#1-generate-metrics-by-layer)
6-
2. [Get Thresholds](./LAYERS.md#2-get-thresholds)
7-
3. [Apply metrics where needed](./LAYERS.md#3-apply-the-thresholds-where-needed)
8-
9-
The steps as part of the diagram below:
10-
![generate flow](./resources/assets/metrics_fms_deepview_integration.zoom.png)
11-
To see the full integration with other debugging tools, check [item 3](./LAYERS.md#3-apply-the-thresholds-where-needed).
5+
1. [Generate metrics by layer](./LAYERS.md#1-generate-metrics-by-layer)
6+
2. [Get thresholds](./LAYERS.md#2-get-thresholds)
7+
3. [Apply thresholds](./LAYERS.md#3-apply-the-thresholds)
128

139
## 1. Generate Metrics by Layer
1410

15-
The idea is to run, the prompts through the model with the pre- and post-hooks added, and then get the metrics for the outputs intercepted by each layer, as in this diagram. Then we can have a baseline with CPU/GPU for a failure threshold in AIU tests. Same idea as the [test_decoders.py](https://github.com/foundation-model-stack/aiu-fms-testing-utils/blob/main/tests/models/test_decoders.py), but for each layer. This way we can measure the discrepancies for the outputs and use the thresholds for detailed debugging problems in AIU.
11+
The goal is to run prompts through the model with pre- and post-hooks added, allowing us to capture output metrics at each layer. This approach lets us establish a CPU/GPU baseline to define failure thresholds for AIU tests, similar to [test_decoders.py](https://github.com/foundation-model-stack/aiu-fms-testing-utils/blob/main/tests/models/test_decoders.py), but applied at each layer. This helps to measure the output discrepancies and use the thresholds for debugging problems on AIU.
1612

1713
![metrics generation by layer](./resources/assets/metrics_generation_layers.png)
1814

19-
The script [generate_layers_metrics.py](../scripts/generate_layers_metrics.py) requires the following arguments to be run:
15+
### Script Usage
2016

21-
```bash
22-
usage: generate_layers_metrics.py [-h] [--architecture ARCHITECTURE] [--variant VARIANT] [--model_path MODEL_PATH] --mode {generate,model-forward} --batch_sizes BATCH_SIZES --seq_lengths SEQ_LENGTHS --max_new_tokens MAX_NEW_TOKENS [--output_path OUTPUT_PATH] [--sharegpt_path SHAREGPT_PATH]
17+
```console
18+
usage: generate_layers_metrics.py [-h]
19+
[--architecture ARCHITECTURE]
20+
[--variant VARIANT]
21+
[--model_path MODEL_PATH]
22+
--mode {generate,model-forward}
23+
--batch_sizes BATCH_SIZES
24+
--seq_lengths SEQ_LENGTHS
25+
--max_new_tokens MAX_NEW_TOKENS
26+
[--output_path OUTPUT_PATH]
27+
[--sharegpt_path SHAREGPT_PATH]
2328

2429
Script to generate the model's metrics by layer
2530

2631
options:
2732
-h, --help show this help message and exit
2833
--architecture ARCHITECTURE
2934
The model architecture Eg.: hf_pretrained
30-
--variant VARIANT The model variants (configuration) to benchmark. E.g. ibm-granite/granite-3.2-8b-instruct
35+
--variant VARIANT The model variants (configuration) to benchmark. E.g. ibm-granite/granite-3.3-8b-instruct
3136
--model_path MODEL_PATH
3237
Paths to the directory containing model's weights (.pth files sharded by tensor parallel rank, not HF weights)
3338
--mode {generate,model-forward}
@@ -44,10 +49,10 @@ options:
4449
Path to sharegpt data json
4550
```
4651

47-
These variables support single and array values.
52+
The only required argument is `--mode`, which sets the type of generation to be used. The options are `generate` or `model-forward`.
53+
54+
- `generate` uses [FMS' generate](https://github.com/foundation-model-stack/foundation-model-stack/blob/main/fms/utils/generation.py) function, a high-level API that wraps many operations (e.g. forward pass, KV cache logic, decoding, post-processing).
4855

49-
The argument required for this script is the `--mode`, which is the generation mode desired for the output; The choices can be `generate` or `model-forward`.
50-
- `generate` uses FMS [generate](../scripts/generate_layers_metrics.py#L118); It’s a high-level API that wraps many operations: forward pass, KV cache logic, sampling or greeting decoding, post-processing.
5156
```python
5257
result = generate(
5358
model,
@@ -62,42 +67,54 @@ result = generate(
6267
extra_kwargs={},
6368
)
6469
```
65-
- `model-forward` will call [model.forward](../scripts/generate_layers_metrics.py#L135); Avoids introducing noise from sampling, past key caching, etc.
70+
71+
- `model-forward` calls `model.forward` directly, avoiding introducing noise from sampling, past key caching, etc.
72+
6673
```python
6774
result = model.forward(
6875
ids,
6976
use_cache=use_cache
7077
)
7178
```
7279

73-
### How to run
80+
#### How to Run
7481

75-
Once all is set up, we can generate the CSV metrics:
82+
To run the script to generate CSV metrics for each layer of the model, first create a directory to hold the output files:
7683

7784
```bash
7885
cd aiu-fms-testing-utils/tests/resources
79-
8086
mkdir /tmp/output
87+
```
8188

82-
python3 generate_layers_metrics.py --mode model-forward --variant ibm-granite/granite-3.2-8b-instruct --architecture hf_pretrained --batch_sizes 1 --seq_lengths 64 --max_new_tokens 128
89+
Then, run the script:
90+
91+
```bash
92+
python3 generate_layers_metrics.py --mode model-forward --variant ibm-granite/granite-3.3-8b-instruct --architecture hf_pretrained --batch_sizes 1 --seq_lengths 64 --max_new_tokens 128
8393
```
84-
The files should get created at `/tmp/output` dir:
94+
95+
CSV files will be generated under `/tmp/output`, unless `--output_path` was specified:
96+
8597
```bash
86-
ibm-granite--granite-3.2-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-64_dtype-float16--model.base_model.layers7.ln.abs_diff.csv
87-
ibm-granite--granite-3.2-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-64_dtype-float16--model.base_model.layers7.ln.cos_sim.csv
88-
ibm-granite--granite-3.2-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-64_dtype-float16--model.base_model.layers8.attn.dense.abs_diff.csv
89-
ibm-granite--granite-3.2-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-64_dtype-float16--model.base_model.layers8.attn.dense.cos_sim.csv
98+
ibm-granite--granite-3.3-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-64_dtype-float16--model.base_model.layers7.ln.abs_diff.csv
99+
ibm-granite--granite-3.3-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-64_dtype-float16--model.base_model.layers7.ln.cos_sim.csv
100+
ibm-granite--granite-3.3-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-64_dtype-float16--model.base_model.layers8.attn.dense.abs_diff.csv
101+
ibm-granite--granite-3.3-8b-instruct_max-new-tokens-128_batch-size-1_seq-length-64_dtype-float16--model.base_model.layers8.attn.dense.cos_sim.csv
90102
```
91103

92104
## 2. Get Thresholds
93105

94-
To get the second step of the flow and get the thresholds by layer, run:
106+
Once the layer-wise metrics are generated, you can compute the thresholds for each layer to serve as baseline metrics.
107+
108+
Run the [`get_thresholds.py`](./resources/get_thresholds.py) script:
109+
95110
```bash
96-
cd /aiu-fms-testing-utils/tests/resources
111+
cd aiu-fms-testing-utils/tests/resources
97112

98-
python3 get_thresholds.py --models ibm-granite/granite-3.2-8b-instruct --metrics abs_diff cos_sim_avg cos_sim_men --file_base /tmp/output --layer_io
113+
python3 get_thresholds.py --models ibm-granite/granite-3.3-8b-instruct --metrics abs_diff cos_sim_avg cos_sim_men --file_base /tmp/output --layer_io
99114
```
100-
It should print the metric of each layer:
115+
116+
You’ll see output like this, showing the computed metrics per layer:
117+
101118
```bash
102119
2025-07-09 19:02:40,657 found 484 layers metric files
103120
2025-07-09 19:02:40,674 Layer model.base_model.embedding abs_diff_linalg_norm = 1.7258892434335918e-07
@@ -122,11 +139,11 @@ It should print the metric of each layer:
122139
2025-07-09 19:03:27,055 Layer model.base_model.layers0.ff_ln cos_sim_mean = 0.9999961135908961
123140

124141
```
125-
Also, a JSON file is saved to the same output dir. A sample file can be found at: [sample_layer_th.json](https://github.com/flaviabeo/aiu-fms-testing-utils/blob/generate_metrics_layers/tests/resources/sample_layer_th.json)
126142

127-
## 3. Apply the thresholds where needed
143+
A `JSON` summary file containing these thresholds is also saved in the same output directory. An example of this file can be found here: [sample_layer_th.json](./resources/sample_layer_th.json).
144+
145+
## 3. Apply the Thresholds
128146

129-
In case of AIU debugging tools, the thresholds will be applied to compare AIU outputs with CPU, and then assert if the differences are within the thresholds generated. Below, is an architecture of the full integration:
130-
![full integration](./resources/assets/metrics_fms_deepview_integration.full.png)
147+
The thresholds serve as bounds to determine whether AIU outputs diverge from CPU.
131148

132-
The box named `deepview layer debug` has the diagram of how the model layers outputs are generated to be compared against the CPU results. This is important so that the debug tools can catch operations and layers that have issues in their enablement for AIU hardware.
149+
**TODO:** Add integration architecture

0 commit comments

Comments
 (0)