I would like to propose an updated set of baseline hyperparameters optimized for training our roughly 6 million parameter model. These settings have been adjusted to balance memory efficiency, training stability, and convergence speed.
block_size = 256 # Increased from 32 for much better text context comprehension
n_embd = 192 # Scaled up for higher capacity representation
n_head = 6 # 192 embedding dim / 6 heads = 32 dim per head (optimal)
n_layer = 6 # Deepened to 6 layers for better sequence learning
# Training Dynamics & Stability
batch_size = 64 # Increased from 16 to utilize GPU parallelization cleanly
max_iters = 5000 # Combined with larger batch size, 5k steps provides solid convergence
eval_interval = 250 # Balanced interval to prevent evaluation bottlenecks
learning_rate = 6e-4 # Slightly lower, standard stable rate for AdamW at this parameter scale
eval_iters = 200 # Higher evaluation iteration count for stabler validation metrics
dropout = 0.1 # Optimal standard regularization to combat overfitting
I would like to propose an updated set of baseline hyperparameters optimized for training our roughly 6 million parameter model. These settings have been adjusted to balance memory efficiency, training stability, and convergence speed.