-
Notifications
You must be signed in to change notification settings - Fork 2
Advanced Training Techniques
Tinymind provides several training policies that can be composed via template parameters to customize how neural networks learn. All policies are optional -- existing code that doesn't use them compiles unchanged with null/no-op defaults. Policies are extracted from the TransferFunctionsPolicy via SFINAE traits.
Training neural networks with fixed-point arithmetic is fundamentally harder than with floating-point. The limited dynamic range of Q-format values means that gradients, weight updates, and accumulated errors can easily overflow, producing garbage values that destroy the network's learned state. On hardware without an FPU, you have no choice but to train in fixed-point -- and without the right guardrails, training will diverge.
The training policies on this page exist specifically to make fixed-point training robust:
- Gradient clipping prevents a single large gradient from overflowing the Q-format range -- this is the single most important policy for fixed-point training
- L2 weight decay keeps weights bounded, preventing the slow drift toward overflow that accumulates over thousands of training steps
- Learning rate scheduling starts with larger updates (faster convergence) and reduces them over time (fine-grained precision without overflow risk)
- Early stopping detects convergence and halts training, saving compute cycles on battery-powered devices where every milliamp-hour counts
- Adam and RMSprop provide adaptive per-parameter learning rates that naturally scale to the Q-format range, and both reuse existing connection storage so they add zero memory overhead
This page covers:
- Adam optimizer -- adaptive per-parameter learning rates
- RMSprop optimizer -- running average of squared gradients (preferred for RNNs)
- Gradient clipping -- prevents exploding gradients (critical for fixed-point)
- L2 weight decay -- ridge regularization to prevent overflow
- Learning rate scheduling -- step decay over training
- Early stopping -- convergence detection to save compute
Training policies are specified as template parameters of the FixedPointTransferFunctions (or floating-point equivalent) policy class. Here is the full set of configurable policies:
typedef tinymind::FixedPointTransferFunctions<
ValueType, // Q-format or float type
RandomNumberGeneratorPolicy, // weight initialization RNG
HiddenNeuronActivationPolicy, // e.g. TanhActivationPolicy
OutputNeuronActivationPolicy, // e.g. SigmoidActivationPolicy
NumberOfOutputNeurons, // default: 1
NetworkInitializationPolicy, // default: DefaultNetworkInitializer
ErrorCalculatorPolicy, // default: MeanSquaredErrorCalculator
ZeroTolerancePolicy, // default: ZeroToleranceCalculator
GradientClippingPolicy, // default: NullGradientClippingPolicy
WeightDecayPolicy, // default: NullWeightDecayPolicy
LearningRateSchedulePolicy, // default: FixedLearningRatePolicy
OptimizerPolicy // default: NullOptimizerPolicy (SGD)
> TransferFunctionsType;The last four parameters (gradient clipping, weight decay, learning rate schedule, and optimizer) are the new training policies. Each has a null/no-op default, so you only need to specify the ones you want.
Adam (Adaptive Moment Estimation) maintains per-parameter running averages of the first moment (mean) and second moment (variance) of the gradient. This provides adaptive learning rates that work well across a wide range of problems.
template<typename ValueType,
int Beta1Int = 0, unsigned Beta1Frac = 230,
int Beta2Int = 0, unsigned Beta2Frac = 255,
int EpsilonInt = 0, unsigned EpsilonFrac = 1>
struct AdamOptimizerDefault hyperparameters (Q8.8):
- beta1 = 230/256 ~ 0.898 (exponential decay rate for first moment)
- beta2 = 255/256 ~ 0.996 (exponential decay rate for second moment)
- epsilon = 1/256 ~ 0.004 (numerical stability)
For floating-point, use AdamOptimizerFloat<ValueType> which uses standard double-precision defaults (beta1=0.9, beta2=0.999, epsilon=1e-8).
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * gradient^2
m_hat = m / (1 - beta1^t) [bias correction]
v_hat = v / (1 - beta2^t) [bias correction]
weight += lr * m_hat / (sqrt(v_hat) + epsilon)
Adam reuses the existing mDeltaWeight and mPreviousDeltaWeight storage in trainable connections, so it requires no additional memory beyond standard SGD.
typedef tinymind::QValue<16, 16, true, tinymind::RoundUpPolicy> ValueType;
typedef tinymind::FixedPointTransferFunctions<
ValueType,
UniformRealRandomNumberGenerator<ValueType>,
tinymind::TanhActivationPolicy<ValueType>,
tinymind::TanhActivationPolicy<ValueType>,
1,
tinymind::DefaultNetworkInitializer<ValueType>,
tinymind::MeanSquaredErrorCalculator<ValueType, 1>,
tinymind::ZeroToleranceCalculator<ValueType>,
tinymind::GradientClipByValue<ValueType>,
tinymind::NullWeightDecayPolicy<ValueType>,
tinymind::FixedLearningRatePolicy<ValueType>,
tinymind::AdamOptimizer<ValueType>> TransferFunctionsType;
typedef tinymind::MultilayerPerceptron<ValueType, 2, 1, 5, 1, TransferFunctionsType> NNType;
NNType nn;
// Adam typically needs a lower learning rate than SGD
nn.setLearningRate(ValueType(0, 655)); // ~ 0.01 in Q16.16typedef double ValueType;
struct AdamTF : public FloatingPointTransferFunctions<
ValueType, RandomNumberGenerator,
tinymind::TanhActivationPolicy,
tinymind::TanhActivationPolicy>
{
typedef tinymind::AdamOptimizerFloat<ValueType> OptimizerPolicyType;
};
typedef tinymind::MultilayerPerceptron<ValueType, 2, 1, 5, 1, AdamTF> NNType;RMSprop maintains only the second moment (running average of squared gradients) -- it's simpler and lighter than Adam. RMSprop is often preferred for recurrent networks (LSTM, GRU) where it provides more stable training.
template<typename ValueType,
int DecayInt = 0, unsigned DecayFrac = 230,
int EpsilonInt = 0, unsigned EpsilonFrac = 1>
struct RmsPropOptimizerDefault hyperparameters (Q8.8):
- decay = 230/256 ~ 0.898 (exponential decay rate for squared gradients)
- epsilon = 1/256 ~ 0.004 (numerical stability)
For floating-point, use RmsPropOptimizerFloat<ValueType> (decay=0.9, epsilon=1e-8).
v = decay * v + (1 - decay) * gradient^2
weight += lr * gradient / (sqrt(v) + epsilon)
typedef tinymind::QValue<8, 8, true, tinymind::RoundUpPolicy> ValueType;
typedef tinymind::FixedPointTransferFunctions<
ValueType,
UniformRealRandomNumberGenerator<ValueType>,
tinymind::TanhActivationPolicy<ValueType>,
tinymind::TanhActivationPolicy<ValueType>,
1,
tinymind::DefaultNetworkInitializer<ValueType>,
tinymind::MeanSquaredErrorCalculator<ValueType, 1>,
tinymind::ZeroToleranceCalculator<ValueType>,
tinymind::GradientClipByValue<ValueType>,
tinymind::NullWeightDecayPolicy<ValueType>,
tinymind::FixedLearningRatePolicy<ValueType>,
tinymind::RmsPropOptimizer<ValueType>> TransferFunctionsType;
typedef tinymind::MultilayerPerceptron<ValueType, 2, 1, 5, 1, TransferFunctionsType> NNType;Gradient clipping prevents exploding gradients by clamping gradient values to a fixed range. This is especially critical for fixed-point arithmetic where large gradients can cause overflow and corrupt training.
template<typename ValueType, int IntegerPart = 1, unsigned FractionalPart = 0>
struct GradientClipByValueThe default clips gradients to the range [-1.0, 1.0]. You can customize this by specifying different values for IntegerPart and FractionalPart.
// Clip gradients to [-1.0, 1.0] (default)
typedef tinymind::GradientClipByValue<ValueType> ClipPolicy;
// Clip gradients to [-2.0, 2.0]
typedef tinymind::GradientClipByValue<ValueType, 2, 0> WiderClipPolicy;
// No clipping (null policy)
typedef tinymind::NullGradientClippingPolicy<ValueType> NoClipPolicy;Gradient clipping is specified as a template parameter of the transfer functions policy (see the full configuration example above).
L2 weight decay (ridge regularization) penalizes large weights by pulling them toward zero on every update. This prevents weights from growing unboundedly, which is especially important for fixed-point where large values cause overflow.
template<typename ValueType, int IntegerPart = 0, unsigned FractionalPart = 1>
struct L2WeightDecayThe default lambda is 1/256 ~ 0.004 for Q8.8. The decay is applied as: w_new = w * (1 - lr * lambda).
// Default lambda (~ 0.004 for Q8.8)
typedef tinymind::L2WeightDecay<ValueType> DecayPolicy;
// Custom lambda for Q16.16
typedef tinymind::L2WeightDecay<ValueType, 0, 256> CustomDecayPolicy;
// No weight decay (null policy)
typedef tinymind::NullWeightDecayPolicy<ValueType> NoDecayPolicy;Step decay reduces the learning rate by a multiplicative factor at regular intervals. This allows the network to make large updates early in training and fine-tune with smaller updates later.
template<typename ValueType, size_t StepInterval = 1000,
int DecayIntegerPart = 0, unsigned DecayFractionalPart = 230>
struct StepDecayScheduleDefaults (Q8.8):
- StepInterval = 1000 steps between decays
- Decay factor = 230/256 ~ 0.898
Every StepInterval training steps, the learning rate is multiplied by the decay factor.
// Decay by ~0.9 every 5000 steps
typedef tinymind::StepDecaySchedule<ValueType, 5000> LRSchedule;
// Decay by ~0.5 every 1000 steps
typedef tinymind::StepDecaySchedule<ValueType, 1000, 0, 128> AggressiveSchedule;
// Fixed learning rate (null policy)
typedef tinymind::FixedLearningRatePolicy<ValueType> FixedLR;Early stopping monitors the training error and halts training when no improvement has been seen for a configurable number of steps (patience). This saves compute cycles on embedded devices by avoiding unnecessary training iterations.
template<typename ValueType, size_t Patience = 100>
struct EarlyStopping-
bool shouldStop(const ValueType& error)- Report current error. Returnstruewhen patience is exhausted. -
void reset()- Reset the monitor to initial state. -
ValueType getBestError() const- Return the best error observed so far. -
size_t getPatienceCounter() const- Return the current patience counter.
tinymind::EarlyStopping<ValueType, 200> stopper;
for (int i = 0; i < 10000; ++i)
{
nn.feedForward(&values[0]);
error = nn.calculateError(&output[0]);
if (stopper.shouldStop(error))
{
break; // no improvement for 200 steps, stop
}
nn.trainNetwork(&output[0]);
}The first call sets the baseline error. Subsequent calls compare against the best error seen so far. If the error improves (new error < best error), the patience counter resets. If not, the counter increments. When the counter reaches Patience, shouldStop() returns true.
Here is a complete example combining gradient clipping, L2 weight decay, step decay learning rate, and Adam optimizer:
typedef tinymind::QValue<8, 8, true, tinymind::RoundUpPolicy> ValueType;
typedef tinymind::FixedPointTransferFunctions<
ValueType,
RandomNumberGenerator<ValueType>,
tinymind::TanhActivationPolicy<ValueType>,
tinymind::TanhActivationPolicy<ValueType>,
1, // NumberOfOutputNeurons
tinymind::DefaultNetworkInitializer<ValueType>, // initializer
tinymind::MeanSquaredErrorCalculator<ValueType, 1>, // error calculator
tinymind::ZeroToleranceCalculator<ValueType>, // zero tolerance
tinymind::GradientClipByValue<ValueType>, // clip to [-1, 1]
tinymind::L2WeightDecay<ValueType>, // L2 regularization
tinymind::StepDecaySchedule<ValueType, 5000>, // decay LR every 5000 steps
tinymind::AdamOptimizer<ValueType> // Adam optimizer
> TransferFunctionsType;
typedef tinymind::NeuralNetwork<ValueType, 2, tinymind::HiddenLayers<5>, 1,
TransferFunctionsType> RegularizedNetwork;
RegularizedNetwork nn;
tinymind::EarlyStopping<ValueType, 500> stopper;
for (int i = 0; i < 50000; ++i)
{
nn.feedForward(&values[0]);
error = nn.calculateError(&output[0]);
if (stopper.shouldStop(error))
{
break;
}
if (!TransferFunctionsType::isWithinZeroTolerance(error))
{
nn.trainNetwork(&output[0]);
}
}This gives you a network with:
- Gradients clamped to [-1, 1] to prevent overflow
- Weights pulled toward zero to prevent unbounded growth
- Learning rate that decays over time for fine-tuning
- Adaptive per-parameter learning rates via Adam
- Automatic convergence detection via early stopping