Advanced Training Techniques

Introduction

Tinymind provides several training policies that can be composed via template parameters to customize how neural networks learn. All policies are optional -- existing code that doesn't use them compiles unchanged with null/no-op defaults. Policies are extracted from the TransferFunctionsPolicy via SFINAE traits.

Why Training Policies Matter for Fixed-Point

Training neural networks with fixed-point arithmetic is fundamentally harder than with floating-point. The limited dynamic range of Q-format values means that gradients, weight updates, and accumulated errors can easily overflow, producing garbage values that destroy the network's learned state. On hardware without an FPU, you have no choice but to train in fixed-point -- and without the right guardrails, training will diverge.

The training policies on this page exist specifically to make fixed-point training robust:

Gradient clipping prevents a single large gradient from overflowing the Q-format range -- this is the single most important policy for fixed-point training
L2 weight decay keeps weights bounded, preventing the slow drift toward overflow that accumulates over thousands of training steps
Learning rate scheduling starts with larger updates (faster convergence) and reduces them over time (fine-grained precision without overflow risk)
Early stopping detects convergence and halts training, saving compute cycles on battery-powered devices where every milliamp-hour counts
Adam and RMSprop provide adaptive per-parameter learning rates that naturally scale to the Q-format range, and both reuse existing connection storage so they add zero memory overhead

This page covers:

Adam optimizer -- adaptive per-parameter learning rates
RMSprop optimizer -- running average of squared gradients (preferred for RNNs)
Gradient clipping -- prevents exploding gradients (critical for fixed-point)
L2 weight decay -- ridge regularization to prevent overflow
Learning rate scheduling -- step decay over training
Early stopping -- convergence detection to save compute

Configuring Training Policies

Training policies are specified as template parameters of the FixedPointTransferFunctions (or floating-point equivalent) policy class. Here is the full set of configurable policies:

typedef tinymind::FixedPointTransferFunctions<
    ValueType,                                          // Q-format or float type
    RandomNumberGeneratorPolicy,                        // weight initialization RNG
    HiddenNeuronActivationPolicy,                       // e.g. TanhActivationPolicy
    OutputNeuronActivationPolicy,                       // e.g. SigmoidActivationPolicy
    NumberOfOutputNeurons,                              // default: 1
    NetworkInitializationPolicy,                        // default: DefaultNetworkInitializer
    ErrorCalculatorPolicy,                              // default: MeanSquaredErrorCalculator
    ZeroTolerancePolicy,                                // default: ZeroToleranceCalculator
    GradientClippingPolicy,                             // default: NullGradientClippingPolicy
    WeightDecayPolicy,                                  // default: NullWeightDecayPolicy
    LearningRateSchedulePolicy,                         // default: FixedLearningRatePolicy
    OptimizerPolicy                                     // default: NullOptimizerPolicy (SGD)
> TransferFunctionsType;

The last four parameters (gradient clipping, weight decay, learning rate schedule, and optimizer) are the new training policies. Each has a null/no-op default, so you only need to specify the ones you want.

Adam Optimizer

Adam (Adaptive Moment Estimation) maintains per-parameter running averages of the first moment (mean) and second moment (variance) of the gradient. This provides adaptive learning rates that work well across a wide range of problems.

Template Declaration

template<typename ValueType,
         int Beta1Int = 0, unsigned Beta1Frac = 230,
         int Beta2Int = 0, unsigned Beta2Frac = 255,
         int EpsilonInt = 0, unsigned EpsilonFrac = 1>
struct AdamOptimizer

Default hyperparameters (Q8.8):

beta1 = 230/256 ~ 0.898 (exponential decay rate for first moment)
beta2 = 255/256 ~ 0.996 (exponential decay rate for second moment)
epsilon = 1/256 ~ 0.004 (numerical stability)

For floating-point, use AdamOptimizerFloat<ValueType> which uses standard double-precision defaults (beta1=0.9, beta2=0.999, epsilon=1e-8).

Update Rule

m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
m_hat = m / (1 - beta1^t)   [bias correction]
v_hat = v / (1 - beta2^t)   [bias correction]
weight += lr * m_hat / (sqrt(v_hat) + epsilon)

Adam reuses the existing mDeltaWeight and mPreviousDeltaWeight storage in trainable connections, so it requires no additional memory beyond standard SGD.

Example: Adam with Fixed-Point Q16.16

typedef tinymind::QValue<16, 16, true, tinymind::RoundUpPolicy> ValueType;

typedef tinymind::FixedPointTransferFunctions<
    ValueType,
    UniformRealRandomNumberGenerator<ValueType>,
    tinymind::TanhActivationPolicy<ValueType>,
    tinymind::TanhActivationPolicy<ValueType>,
    1,
    tinymind::DefaultNetworkInitializer<ValueType>,
    tinymind::MeanSquaredErrorCalculator<ValueType, 1>,
    tinymind::ZeroToleranceCalculator<ValueType>,
    tinymind::GradientClipByValue<ValueType>,
    tinymind::NullWeightDecayPolicy<ValueType>,
    tinymind::FixedLearningRatePolicy<ValueType>,
    tinymind::AdamOptimizer<ValueType>> TransferFunctionsType;

typedef tinymind::MultilayerPerceptron<ValueType, 2, 1, 5, 1, TransferFunctionsType> NNType;
NNType nn;

// Adam typically needs a lower learning rate than SGD
nn.setLearningRate(ValueType(0, 655)); // ~ 0.01 in Q16.16

Example: Adam with Floating-Point

typedef double ValueType;

struct AdamTF : public FloatingPointTransferFunctions<
    ValueType, RandomNumberGenerator,
    tinymind::TanhActivationPolicy,
    tinymind::TanhActivationPolicy>
{
    typedef tinymind::AdamOptimizerFloat<ValueType> OptimizerPolicyType;
};

typedef tinymind::MultilayerPerceptron<ValueType, 2, 1, 5, 1, AdamTF> NNType;

RMSprop Optimizer

RMSprop maintains only the second moment (running average of squared gradients) -- it's simpler and lighter than Adam. RMSprop is often preferred for recurrent networks (LSTM, GRU) where it provides more stable training.

Template Declaration

template<typename ValueType,
         int DecayInt = 0, unsigned DecayFrac = 230,
         int EpsilonInt = 0, unsigned EpsilonFrac = 1>
struct RmsPropOptimizer

Default hyperparameters (Q8.8):

decay = 230/256 ~ 0.898 (exponential decay rate for squared gradients)
epsilon = 1/256 ~ 0.004 (numerical stability)

For floating-point, use RmsPropOptimizerFloat<ValueType> (decay=0.9, epsilon=1e-8).

Update Rule

v = decay * v + (1 - decay) * gradient^2
weight += lr * gradient / (sqrt(v) + epsilon)

Example: RMSprop with Fixed-Point

typedef tinymind::QValue<8, 8, true, tinymind::RoundUpPolicy> ValueType;

typedef tinymind::FixedPointTransferFunctions<
    ValueType,
    UniformRealRandomNumberGenerator<ValueType>,
    tinymind::TanhActivationPolicy<ValueType>,
    tinymind::TanhActivationPolicy<ValueType>,
    1,
    tinymind::DefaultNetworkInitializer<ValueType>,
    tinymind::MeanSquaredErrorCalculator<ValueType, 1>,
    tinymind::ZeroToleranceCalculator<ValueType>,
    tinymind::GradientClipByValue<ValueType>,
    tinymind::NullWeightDecayPolicy<ValueType>,
    tinymind::FixedLearningRatePolicy<ValueType>,
    tinymind::RmsPropOptimizer<ValueType>> TransferFunctionsType;

typedef tinymind::MultilayerPerceptron<ValueType, 2, 1, 5, 1, TransferFunctionsType> NNType;

Gradient Clipping

Gradient clipping prevents exploding gradients by clamping gradient values to a fixed range. This is especially critical for fixed-point arithmetic where large gradients can cause overflow and corrupt training.

Template Declaration

template<typename ValueType, int IntegerPart = 1, unsigned FractionalPart = 0>
struct GradientClipByValue

The default clips gradients to the range [-1.0, 1.0]. You can customize this by specifying different values for IntegerPart and FractionalPart.

Usage

// Clip gradients to [-1.0, 1.0] (default)
typedef tinymind::GradientClipByValue<ValueType> ClipPolicy;

// Clip gradients to [-2.0, 2.0]
typedef tinymind::GradientClipByValue<ValueType, 2, 0> WiderClipPolicy;

// No clipping (null policy)
typedef tinymind::NullGradientClippingPolicy<ValueType> NoClipPolicy;

Gradient clipping is specified as a template parameter of the transfer functions policy (see the full configuration example above).

L2 Weight Decay

L2 weight decay (ridge regularization) penalizes large weights by pulling them toward zero on every update. This prevents weights from growing unboundedly, which is especially important for fixed-point where large values cause overflow.

Template Declaration

template<typename ValueType, int IntegerPart = 0, unsigned FractionalPart = 1>
struct L2WeightDecay

The default lambda is 1/256 ~ 0.004 for Q8.8. The decay is applied as: w_new = w * (1 - lr * lambda).

Usage

// Default lambda (~ 0.004 for Q8.8)
typedef tinymind::L2WeightDecay<ValueType> DecayPolicy;

// Custom lambda for Q16.16
typedef tinymind::L2WeightDecay<ValueType, 0, 256> CustomDecayPolicy;

// No weight decay (null policy)
typedef tinymind::NullWeightDecayPolicy<ValueType> NoDecayPolicy;

Learning Rate Scheduling

Step decay reduces the learning rate by a multiplicative factor at regular intervals. This allows the network to make large updates early in training and fine-tune with smaller updates later.

Template Declaration

template<typename ValueType, size_t StepInterval = 1000,
         int DecayIntegerPart = 0, unsigned DecayFractionalPart = 230>
struct StepDecaySchedule

Defaults (Q8.8):

StepInterval = 1000 steps between decays
Decay factor = 230/256 ~ 0.898

Every StepInterval training steps, the learning rate is multiplied by the decay factor.

Usage

// Decay by ~0.9 every 5000 steps
typedef tinymind::StepDecaySchedule<ValueType, 5000> LRSchedule;

// Decay by ~0.5 every 1000 steps
typedef tinymind::StepDecaySchedule<ValueType, 1000, 0, 128> AggressiveSchedule;

// Fixed learning rate (null policy)
typedef tinymind::FixedLearningRatePolicy<ValueType> FixedLR;

Early Stopping

Early stopping monitors the training error and halts training when no improvement has been seen for a configurable number of steps (patience). This saves compute cycles on embedded devices by avoiding unnecessary training iterations.

Template Declaration

template<typename ValueType, size_t Patience = 100>
struct EarlyStopping

Key Methods

bool shouldStop(const ValueType& error) - Report current error. Returns true when patience is exhausted.
void reset() - Reset the monitor to initial state.
ValueType getBestError() const - Return the best error observed so far.
size_t getPatienceCounter() const - Return the current patience counter.

Usage

tinymind::EarlyStopping<ValueType, 200> stopper;

for (int i = 0; i < 10000; ++i)
{
    nn.feedForward(&values[0]);
    error = nn.calculateError(&output[0]);

    if (stopper.shouldStop(error))
    {
        break; // no improvement for 200 steps, stop
    }

    nn.trainNetwork(&output[0]);
}

The first call sets the baseline error. Subsequent calls compare against the best error seen so far. If the error improves (new error < best error), the patience counter resets. If not, the counter increments. When the counter reaches Patience, shouldStop() returns true.

Combining Policies

Here is a complete example combining gradient clipping, L2 weight decay, step decay learning rate, and Adam optimizer:

typedef tinymind::QValue<8, 8, true, tinymind::RoundUpPolicy> ValueType;

typedef tinymind::FixedPointTransferFunctions<
    ValueType,
    RandomNumberGenerator<ValueType>,
    tinymind::TanhActivationPolicy<ValueType>,
    tinymind::TanhActivationPolicy<ValueType>,
    1,                                                    // NumberOfOutputNeurons
    tinymind::DefaultNetworkInitializer<ValueType>,       // initializer
    tinymind::MeanSquaredErrorCalculator<ValueType, 1>,   // error calculator
    tinymind::ZeroToleranceCalculator<ValueType>,         // zero tolerance
    tinymind::GradientClipByValue<ValueType>,             // clip to [-1, 1]
    tinymind::L2WeightDecay<ValueType>,                   // L2 regularization
    tinymind::StepDecaySchedule<ValueType, 5000>,         // decay LR every 5000 steps
    tinymind::AdamOptimizer<ValueType>                    // Adam optimizer
> TransferFunctionsType;

typedef tinymind::NeuralNetwork<ValueType, 2, tinymind::HiddenLayers<5>, 1,
    TransferFunctionsType> RegularizedNetwork;

RegularizedNetwork nn;
tinymind::EarlyStopping<ValueType, 500> stopper;

for (int i = 0; i < 50000; ++i)
{
    nn.feedForward(&values[0]);
    error = nn.calculateError(&output[0]);

    if (stopper.shouldStop(error))
    {
        break;
    }

    if (!TransferFunctionsType::isWithinZeroTolerance(error))
    {
        nn.trainNetwork(&output[0]);
    }
}

This gives you a network with:

Gradients clamped to [-1, 1] to prevent overflow
Weights pulled toward zero to prevent unbounded growth
Learning rate that decays over time for fine-tuning
Adaptive per-parameter learning rates via Adam
Automatic convergence detection via early stopping

Advanced Training Techniques

Introduction

Why Training Policies Matter for Fixed-Point

Configuring Training Policies

Adam Optimizer

Template Declaration

Update Rule

Example: Adam with Fixed-Point Q16.16

Example: Adam with Floating-Point

RMSprop Optimizer

Template Declaration

Update Rule

Example: RMSprop with Fixed-Point

Gradient Clipping

Template Declaration

Usage

L2 Weight Decay

Template Declaration

Usage

Learning Rate Scheduling

Template Declaration

Usage

Early Stopping

Template Declaration

Key Methods

Usage

Combining Policies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally