Skip to content

Fix/tiling stack scoping and tiling information corruption#177

Merged
runwangdl merged 3 commits into
pulp-platform:develfrom
pauloohaha:fix/tiling-stack-scoping
Mar 26, 2026
Merged

Fix/tiling stack scoping and tiling information corruption#177
runwangdl merged 3 commits into
pulp-platform:develfrom
pauloohaha:fix/tiling-stack-scoping

Conversation

@pauloohaha

@pauloohaha pauloohaha commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

Describe the intent of your PR here.

Added

Changed

Fixed

This pull request address two issues:

1) Big stack usage in C code generated by Deeploy
When the RunNetwork function is called, all the variables defined in the RunNetwork function lives in stack. This include the tiling pointers such as uint8_t bu_DeeployNetwork_TILING_CODEGEN_L1_<layer name>_tileIdxPtr and DeeployNetwork__<layer name>_closure_L3_args. Originally, all tiling pointers and L3 clousure call arguements have scope acroos the entire RunNetwork function, which means tiling pointers and args for all layers live in the stack for the entire RunNetwork function. This causes extensive stack usage when entering the RunNetwork. This stack usage scale for number of layers, more layers you have, more tiling pointers and call args you have, higher the stack usage is. (5KB in the L1 for a 90 layer network)

This fix is very simple. First, move the tiling pointers' declearation into layer code. Second, wrap each layer with a {}. So each layer the generated RunNetwork looks like:

{
Tiling pointer declearation
Call arg declearation
L3 closure call
}

{
Next layer
}

In this way, the tiling pointers and call arg's scope are constrained for the respective layer and we don't need tiling pointers and call args for all layers live in stack.

In my network with 93 layers, wrapping the call args with {} reduce the stack increment when entering RunNetwork from 5KB to 0.7KB. Putting the tiling pointers into {} as well further reduce the increment to 0.6KB.

2) The tiling information is corrupted after the first Runnetwork run

Originally, in the L1 tiling pointer code, the tiling pointer is updated by:

*DeeployNetwork_TILING_CODEGEN_L1_<layer name>_size_ref = DeeployNetwork_TILING_CODEGEN_L1_<name>_size[TILING_I];

This assign the next value from the tiling information array DeeployNetwork_TILING_CODEGEN_L1_<name>_size[TILING_I] to the referene. However, the refrence is defined pointer to the first element of the tiling information array. Thus, the next element in the tiling array is moved to the first element in the tiling information array, corrupting the tiling information.

Easy fix, simply pointer the reference to the next element instead of assigning value fix everything.

DeeployNetwork_TILING_CODEGEN_L1_<layer name>_size_ref = &DeeployNetwork_TILING_CODEGEN_L1_<name>_size[TILING_I];

PR Merge Checklist

  1. The PR is rebased on the latest devel commit and pointing to devel.
  2. Your PR reviewed and approved.
  3. All checks are passing.
  4. The CHANGELOG.md file has been updated.
  5. If the docker was modified, change back its link after review.

…e and after L3 code for each layer to reduce stack usage

- Previosuly the tiling information was corrupted after each run,  because the generated code put the next element in the tiling information array to current location. So after one runnetwork run, we will see that the last element in the tiling array goes to the first location, corrupting the tiling information. The fix in tilingvariablereplacement fix this by pointer the reference to the new location, instead of assigning value into the reference.

- The C code generated by Deeploy has been causing big stack usage. This is because all variables defined in Runnetwork lives in stack, including all the tiling pointers and call arguements. By adding bracket before and after each layer in RunNetwork, this makes the call args only live for one layer, thus significantly reduce the stack usage. The tiling pointers still live in stack, they need to be moved as well. But this require more changes
@pauloohaha pauloohaha changed the title Fix/tiling stack scoping Fix/tiling stack scoping and tiling information corruption Mar 23, 2026
@pauloohaha pauloohaha added the Bug Something isn't working label Mar 23, 2026
@pauloohaha pauloohaha self-assigned this Mar 23, 2026

@Victor-Jung Victor-Jung left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like it, can you quickly check that it runs well with the --profileUntiled flag on please?

@runwangdl runwangdl merged commit 13113de into pulp-platform:devel Mar 26, 2026
47 checks passed
runwangdl added a commit to runwangdl/Deeploy that referenced this pull request Apr 10, 2026
…nceCode

Upstream PR pulp-platform#177 (13113de, "Fix/tiling stack scoping and tiling
information corruption", Pu DENG) wraps each layer's emitted code in a
C block:

    layerCode = reduce(lambda a, b: a + b, sections, "")
    callStack += "{\n" + layerCode + "\n}\n"

so that per-layer call args become short-lived stack variables and
RunNetwork's overall stack footprint goes down.  cc1f68b silently
reverted this hunk during a merge from devel — the training-platform
branch was based on a pre-pulp-platform#177 snapshot and the conflict resolution
went the wrong way.  Restore the wrapping verbatim.

Verified on Siracusa: simplemlp_train passes 0/4 (diff=0.000000 at every
step) in both non-tiled and tiled runs.
runwangdl pushed a commit to runwangdl/Deeploy that referenced this pull request Apr 29, 2026
…form#177)

* [Deeploy PR] put the tiling information into layer code as well

* [Deeploy PR] Fix the tiling information corruption. Add bracket before and after L3 code for each layer to reduce stack usage

- Previosuly the tiling information was corrupted after each run,  because the generated code put the next element in the tiling information array to current location. So after one runnetwork run, we will see that the last element in the tiling array goes to the first location, corrupting the tiling information. The fix in tilingvariablereplacement fix this by pointer the reference to the new location, instead of assigning value into the reference.

- The C code generated by Deeploy has been causing big stack usage. This is because all variables defined in Runnetwork lives in stack, including all the tiling pointers and call arguements. By adding bracket before and after each layer in RunNetwork, this makes the call args only live for one layer, thus significantly reduce the stack usage. The tiling pointers still live in stack, they need to be moved as well. But this require more changes

* Update CHANGELOG.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants