You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Generate metrics by layers to be used in tests and model enablement debugging.
3
+
This guide explains how to use the [`generate_layers_metrics.py`](../scripts/generate_layers_metrics.py) script to generate metrics by layer for validating models and debugging.
4
4
5
-
1.[Generate metrics by layer in GPU](./LAYERS.md#1-generate-metrics-by-layer)
6
-
2.[Get Thresholds](./LAYERS.md#2-get-thresholds)
7
-
3.[Apply metrics where needed](./LAYERS.md#3-apply-the-thresholds-where-needed)
The idea is to run, the prompts through the model with the pre- and post-hooks added, and then get the metrics for the outputs intercepted by each layer, as in this diagram. Then we can have a baseline with CPU/GPU for a failure threshold in AIU tests. Same idea as the [test_decoders.py](https://github.com/foundation-model-stack/aiu-fms-testing-utils/blob/main/tests/models/test_decoders.py), but for each layer. This way we can measure the discrepancies for the outputs and use the thresholds for detailed debugging problems in AIU.
11
+
The goal is to runprompts through the model with pre- and post-hooks added, allowing us to capture output metrics at each layer. This approach lets us establish a CPU/GPU baseline to define failure thresholds for AIU tests, similar to [test_decoders.py](https://github.com/foundation-model-stack/aiu-fms-testing-utils/blob/main/tests/models/test_decoders.py), but applied at each layer. This helps to measure the output discrepancies and use the thresholds for debugging problems on AIU.
16
12
17
13

18
14
19
-
The script [generate_layers_metrics.py](../scripts/generate_layers_metrics.py) requires the following arguments to be run:
--variant VARIANT The model variants (configuration) to benchmark. E.g. ibm-granite/granite-3.2-8b-instruct
35
+
--variant VARIANT The model variants (configuration) to benchmark. E.g. ibm-granite/granite-3.3-8b-instruct
31
36
--model_path MODEL_PATH
32
37
Paths to the directory containing model's weights (.pth files sharded by tensor parallel rank, not HF weights)
33
38
--mode {generate,model-forward}
@@ -44,10 +49,10 @@ options:
44
49
Path to sharegpt data json
45
50
```
46
51
47
-
These variables support single and array values.
52
+
The only required argument is `--mode`, which sets the type of generation to be used. The options are `generate` or `model-forward`.
53
+
54
+
-`generate` uses [FMS' generate](https://github.com/foundation-model-stack/foundation-model-stack/blob/main/fms/utils/generation.py) function, a high-level API that wraps many operations (e.g. forward pass, KV cache logic, decoding, post-processing).
48
55
49
-
The argument required for this script is the `--mode`, which is the generation mode desired for the output; The choices can be `generate` or `model-forward`.
50
-
-`generate` uses FMS [generate](../scripts/generate_layers_metrics.py#L118); It’s a high-level API that wraps many operations: forward pass, KV cache logic, sampling or greeting decoding, post-processing.
51
56
```python
52
57
result = generate(
53
58
model,
@@ -62,42 +67,54 @@ result = generate(
62
67
extra_kwargs={},
63
68
)
64
69
```
65
-
-`model-forward` will call [model.forward](../scripts/generate_layers_metrics.py#L135); Avoids introducing noise from sampling, past key caching, etc.
70
+
71
+
-`model-forward` calls `model.forward` directly, avoiding introducing noise from sampling, past key caching, etc.
72
+
66
73
```python
67
74
result = model.forward(
68
75
ids,
69
76
use_cache=use_cache
70
77
)
71
78
```
72
79
73
-
### How to run
80
+
####How to Run
74
81
75
-
Once all is set up, we can generate the CSV metrics:
82
+
To run the script to generate CSV metrics for each layer of the model, first create a directory to hold the output files:
Also, a JSON file is saved to the same output dir. A sample file can be found at: [sample_layer_th.json](https://github.com/flaviabeo/aiu-fms-testing-utils/blob/generate_metrics_layers/tests/resources/sample_layer_th.json)
126
142
127
-
## 3. Apply the thresholds where needed
143
+
A `JSON` summary file containing these thresholds is also saved in the same output directory. An example of this file can be found here: [sample_layer_th.json](./resources/sample_layer_th.json).
144
+
145
+
## 3. Apply the Thresholds
128
146
129
-
In case of AIU debugging tools, the thresholds will be applied to compare AIU outputs with CPU, and then assert if the differences are within the thresholds generated. Below, is an architecture of the full integration:
The thresholds serve as bounds to determine whether AIU outputs diverge from CPU.
131
148
132
-
The box named `deepview layer debug` has the diagram of how the model layers outputs are generated to be compared against the CPU results. This is important so that the debug tools can catch operations and layers that have issues in their enablement for AIU hardware.
0 commit comments