|
| 1 | +.. SPDX-FileCopyrightText: 2025 ETH Zurich and University of Bologna |
| 2 | +.. |
| 3 | +.. SPDX-License-Identifier: Apache-2.0 |
| 4 | +
|
| 5 | +Per-Layer Microbenchmarking on PULPOpen |
| 6 | +======================================= |
| 7 | + |
| 8 | +Deeploy can wrap each layer in the generated ``RunNetwork`` with PULP performance-counter instrumentation, producing per-layer reports of cycles, instructions, stalls, instruction-cache misses, branch behaviour, and external/TCDM memory traffic. This is intended for profiling individual layers of a deployed network on real hardware or in GVSoC, without modifying any kernel source. |
| 9 | + |
| 10 | +The instrumentation is **off by default** and adds zero overhead unless explicitly enabled. |
| 11 | + |
| 12 | +Enabling |
| 13 | +-------- |
| 14 | + |
| 15 | +Pass ``--profileMicrobenchmark`` to any of the runner entry points: |
| 16 | + |
| 17 | +.. code-block:: bash |
| 18 | +
|
| 19 | + python testMVP.py ... --profileMicrobenchmark |
| 20 | + python generateNetwork.py ... --profileMicrobenchmark |
| 21 | + python deeployRunner_siracusa.py -t Tests/Kernels/FP32/Add/Regular --profileMicrobenchmark |
| 22 | +
|
| 23 | +The flag flows through :py:attr:`Deeploy.DeeployTypes.CodeGenVerbosity.microbenchmarkProfiling` |
| 24 | +into the :py:class:`Deeploy.Targets.PULPOpen.CodeTransformationPasses.PULPMicrobenchmark.PULPMicrobenchmark` |
| 25 | +code-transformation pass, which is registered at the outermost position of the PULPOpen |
| 26 | +``ForkTransformer`` and ``ClusterTransformer`` chains. Because it runs last, the wrapped region |
| 27 | +covers the full per-layer body, including all tiling, DMA, and memory-management code. |
| 28 | + |
| 29 | +Output Format |
| 30 | +------------- |
| 31 | + |
| 32 | +Each layer emits one block of statistics on ``core 0``: |
| 33 | + |
| 34 | +.. code-block:: text |
| 35 | +
|
| 36 | + === Performance Statistics: Add_0 === |
| 37 | + Cycles: 1442 |
| 38 | + Instructions: 149 |
| 39 | + IPC: 0.103 |
| 40 | +
|
| 41 | + --- Instruction Mix --- |
| 42 | + Loads: 24 (16.11%) |
| 43 | + Stores: 27 (18.12%) |
| 44 | + Branches: 5 (3.36%) |
| 45 | + Taken Branches: 2 (40.00%) |
| 46 | + Compressed (RVC): 0 (0.00%) |
| 47 | +
|
| 48 | + --- Stalls & Hazards --- |
| 49 | + Load Stalls: 0 |
| 50 | + Jump Stalls: 0 |
| 51 | + I-cache Misses: 724 |
| 52 | + TCDM Contentions: 0 |
| 53 | +
|
| 54 | + --- Memory Hierarchy --- |
| 55 | + External Loads: 0 (0.00%) |
| 56 | + External Stores: 0 (0.00%) |
| 57 | + Ext Load Cycles: 0 (avg: 0.00) |
| 58 | + Ext Store Cycles: 0 (avg: 0.00) |
| 59 | + ======================================== |
| 60 | +
|
| 61 | +Underlying Helpers |
| 62 | +------------------ |
| 63 | + |
| 64 | +The C-side helpers live in ``TargetLibraries/PULPOpen/inc/perf_utils.h`` and are included by |
| 65 | +default in PULPOpen builds via ``Platform.py``. The pass injects: |
| 66 | + |
| 67 | +- ``perf_bench_init()`` / ``perf_bench_start()`` / ``perf_bench_read(&start)`` before the layer body |
| 68 | +- ``perf_bench_stop()`` / ``perf_bench_read(&end)`` / ``perf_bench_diff(&total, &end, &start)`` / |
| 69 | + ``perf_bench_print("<layer>", &total)`` after it |
| 70 | + |
| 71 | +All counters listed in ``perf_stats_t`` are configured at once in ``pi_perf_conf``, so a single |
| 72 | +wrap captures the full event set. |
| 73 | + |
| 74 | +Notes & Caveats |
| 75 | +--------------- |
| 76 | + |
| 77 | +- **External memory counters** (``LD_EXT``, ``ST_EXT``, ``LD_EXT_CYC``, ``ST_EXT_CYC``) only show |
| 78 | + non-zero values when the wrapped region performs L2/L3 traffic. Untiled tests that fit in L1/TCDM |
| 79 | + will report zero. |
| 80 | +- **TCDM contention** depends on the access pattern — regular, bank-friendly kernels (e.g. element-wise |
| 81 | + Add) can legitimately report zero contention even with all 8 cores active. |
| 82 | +- Some events may not be modelled by GVSoC; verify on a tiled test (e.g. Siracusa-tiled GEMM) before |
| 83 | + concluding a counter is broken. |
| 84 | +- Output is printed by ``core 0`` only to keep logs readable. |
0 commit comments