Skip to content

Optimized Hyperparameters for ~6M Parameter Model #86

@Eamon2009

Description

@Eamon2009

I would like to propose an updated set of baseline hyperparameters optimized for training our roughly 6 million parameter model. These settings have been adjusted to balance memory efficiency, training stability, and convergence speed.

block_size = 256     # Increased from 32 for much better text context comprehension
n_embd = 192         # Scaled up for higher capacity representation
n_head = 6           # 192 embedding dim / 6 heads = 32 dim per head (optimal)
n_layer = 6          # Deepened to 6 layers for better sequence learning

# Training Dynamics & Stability
batch_size = 64      # Increased from 16 to utilize GPU parallelization cleanly
max_iters = 5000     # Combined with larger batch size, 5k steps provides solid convergence
eval_interval = 250  # Balanced interval to prevent evaluation bottlenecks
learning_rate = 6e-4 # Slightly lower, standard stable rate for AdamW at this parameter scale
eval_iters = 200     # Higher evaluation iteration count for stabler validation metrics
dropout = 0.1        # Optimal standard regularization to combat overfitting

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions