|
| 1 | +# Arduino C Code Generation |
| 2 | + |
| 3 | +SKaiNET provides a specialized compiler backend for exporting trained neural networks to highly optimized, standalone C99 code suitable for microcontrollers like Arduino. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +The Arduino C code generation process transforms a high-level Kotlin model into a memory-efficient C implementation. It prioritizes static memory allocation, minimal overhead, and numerical consistency with the original model. |
| 8 | + |
| 9 | +### Codegen Pipeline |
| 10 | + |
| 11 | +```mermaid |
| 12 | +graph TD |
| 13 | + A[Kotlin Model] --> B[Recording Pass] |
| 14 | + B --> C[Execution Tape] |
| 15 | + C --> D[Compute Graph] |
| 16 | + D --> E[Graph Validation] |
| 17 | + E --> F[Memory Layout Calculation] |
| 18 | + F --> G[C Code Emission] |
| 19 | + G --> H[Arduino Library Packaging] |
| 20 | + H --> I[Generated .h/.c files] |
| 21 | +``` |
| 22 | + |
| 23 | +## Technical Deep Dive |
| 24 | + |
| 25 | +### 1. Tape-based Tracing |
| 26 | +Instead of static analysis of the Kotlin code, SKaiNET uses a dynamic tracing mechanism. When you call `exportToArduinoLibrary`, the framework executes a single forward pass of your model using a specialized `RecordingContext`. |
| 27 | +- Every operation (Dense, ReLU, etc.) is recorded onto an **Execution Tape**. |
| 28 | +- This approach handles Kotlin's language features (loops, conditionals) naturally, as it only records the actual operations that were executed. |
| 29 | + |
| 30 | +### 2. Compute Graph Construction |
| 31 | +The execution tape is converted into a directed acyclic graph (DAG) called `ComputeGraph`. |
| 32 | +- Nodes represent operations (Ops). |
| 33 | +- Edges represent data flow (Tensors). |
| 34 | +- During this phase, the compiler performs **Shape Inference** to ensure every tensor has a fixed, known size. |
| 35 | + |
| 36 | +### 3. Static Memory Management |
| 37 | +Microcontrollers typically have very limited RAM and lack robust heap management. SKaiNET uses a **Ping-Pong Buffer Strategy** to eliminate dynamic memory allocation (`malloc`/`free`) during inference. |
| 38 | + |
| 39 | +#### Ping-Pong Buffer Strategy |
| 40 | +The compiler calculates the maximum size required for any intermediate tensor in the graph and allocates exactly two static buffers of that size. |
| 41 | + |
| 42 | +```mermaid |
| 43 | +sequenceDiagram |
| 44 | + participant I as Input |
| 45 | + participant B1 as Buffer A |
| 46 | + participant B2 as Buffer B |
| 47 | + participant O as Output |
| 48 | + |
| 49 | + I->>B1: Layer 1 (Input -> A) |
| 50 | + B1->>B2: Layer 2 (A -> B) |
| 51 | + B2->>B1: Layer 3 (B -> A) |
| 52 | + B1->>O: Layer 4 (A -> Output) |
| 53 | +``` |
| 54 | + |
| 55 | +- **Buffer Reuse**: Instead of allocating space for every layer's output, buffers are reused. |
| 56 | +- **Direct Output Optimization**: The first layer reads from the input pointer, and the last layer writes directly to the output pointer, avoiding unnecessary copies. |
| 57 | + |
| 58 | +### 4. Code Generation (Emission) |
| 59 | +The `CCodeGenerator` emits C99-compatible code using templates. |
| 60 | +- **Weights & Biases**: Extracted from the trained Kotlin model and serialized as `static const float` arrays. This places them in Flash memory (PROGMEM) on many microcontrollers, saving precious RAM. |
| 61 | +- **Kernel Implementation**: Operations like `Dense` (Linear) are implemented as optimized nested loops. |
| 62 | +- **Header Generation**: Produces a clean API for the user: |
| 63 | + ```c |
| 64 | + int model_inference(const float* input, float* output); |
| 65 | + ``` |
| 66 | +
|
| 67 | +### 5. Validation |
| 68 | +The generator performs post-generation validation: |
| 69 | +- **Static Allocation Check**: Ensures no dynamic allocation is present in the generated source. |
| 70 | +- **Buffer Alternation Check**: Verifies that the ping-pong strategy is correctly implemented without data races or overwrites. |
| 71 | +
|
| 72 | +## Performance and Constraints |
| 73 | +- **Floating Point**: Currently optimized for `FP32`. |
| 74 | +- **Supported Ops**: `Dense`, `ReLU`, `Sigmoid`, `Tanh`, `Add`, `MatMul`. |
| 75 | +- **Memory**: Total memory consumption is `TotalWeights + 2 * MaxIntermediateTensor`. |
0 commit comments