Skip to content

Commit 4fe41a8

Browse files
committed
Add microbenchmark tutorial to docs
1 parent 87d8115 commit 4fe41a8

2 files changed

Lines changed: 85 additions & 0 deletions

File tree

docs/tutorials/microbenchmark.rst

Lines changed: 84 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
.. SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna
2+
..
3+
.. SPDX-License-Identifier: Apache-2.0
4+
5+
Per-Layer Microbenchmarking on PULPOpen
6+
=======================================
7+
8+
Deeploy can wrap each layer in the generated ``RunNetwork`` with PULP performance-counter instrumentation, producing per-layer reports of cycles, instructions, stalls, instruction-cache misses, branch behaviour, and external/TCDM memory traffic. This is intended for profiling individual layers of a deployed network on real hardware or in GVSoC, without modifying any kernel source.
9+
10+
The instrumentation is **off by default** and adds zero overhead unless explicitly enabled.
11+
12+
Enabling
13+
--------
14+
15+
Pass ``--profileMicrobenchmark`` to any of the runner entry points:
16+
17+
.. code-block:: bash
18+
19+
python testMVP.py ... --profileMicrobenchmark
20+
python generateNetwork.py ... --profileMicrobenchmark
21+
python deeployRunner_siracusa.py -t Tests/Kernels/FP32/Add/Regular --profileMicrobenchmark
22+
23+
The flag flows through :py:attr:`Deeploy.DeeployTypes.CodeGenVerbosity.microbenchmarkProfiling`
24+
into the :py:class:`Deeploy.Targets.PULPOpen.CodeTransformationPasses.PULPMicrobenchmark.PULPMicrobenchmark`
25+
code-transformation pass, which is registered at the outermost position of the PULPOpen
26+
``ForkTransformer`` and ``ClusterTransformer`` chains. Because it runs last, the wrapped region
27+
covers the full per-layer body, including all tiling, DMA, and memory-management code.
28+
29+
Output Format
30+
-------------
31+
32+
Each layer emits one block of statistics on ``core 0``:
33+
34+
.. code-block:: text
35+
36+
=== Performance Statistics: Add_0 ===
37+
Cycles: 1442
38+
Instructions: 149
39+
IPC: 0.103
40+
41+
--- Instruction Mix ---
42+
Loads: 24 (16.11%)
43+
Stores: 27 (18.12%)
44+
Branches: 5 (3.36%)
45+
Taken Branches: 2 (40.00%)
46+
Compressed (RVC): 0 (0.00%)
47+
48+
--- Stalls & Hazards ---
49+
Load Stalls: 0
50+
Jump Stalls: 0
51+
I-cache Misses: 724
52+
TCDM Contentions: 0
53+
54+
--- Memory Hierarchy ---
55+
External Loads: 0 (0.00%)
56+
External Stores: 0 (0.00%)
57+
Ext Load Cycles: 0 (avg: 0.00)
58+
Ext Store Cycles: 0 (avg: 0.00)
59+
========================================
60+
61+
Underlying Helpers
62+
------------------
63+
64+
The C-side helpers live in ``TargetLibraries/PULPOpen/inc/perf_utils.h`` and are included by
65+
default in PULPOpen builds via ``Platform.py``. The pass injects:
66+
67+
- ``perf_bench_init()`` / ``perf_bench_start()`` / ``perf_bench_read(&start)`` before the layer body
68+
- ``perf_bench_stop()`` / ``perf_bench_read(&end)`` / ``perf_bench_diff(&total, &end, &start)`` /
69+
``perf_bench_print("<layer>", &total)`` after it
70+
71+
All counters listed in ``perf_stats_t`` are configured at once in ``pi_perf_conf``, so a single
72+
wrap captures the full event set.
73+
74+
Notes & Caveats
75+
---------------
76+
77+
- **External memory counters** (``LD_EXT``, ``ST_EXT``, ``LD_EXT_CYC``, ``ST_EXT_CYC``) only show
78+
non-zero values when the wrapped region performs L2/L3 traffic. Untiled tests that fit in L1/TCDM
79+
will report zero.
80+
- **TCDM contention** depends on the access pattern — regular, bank-friendly kernels (e.g. element-wise
81+
Add) can legitimately report zero contention even with all 8 cores active.
82+
- Some events may not be modelled by GVSoC; verify on a tiled test (e.g. Siracusa-tiled GEMM) before
83+
concluding a counter is broken.
84+
- Output is printed by ``core 0`` only to keep logs readable.

docs/tutorials/overview.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,5 +14,6 @@ Each tutorial covers a specific topic and includes code examples to illustrate t
1414

1515
introduction
1616
debugging
17+
microbenchmark
1718

1819

0 commit comments

Comments
 (0)