Fashion-MNIST Competition Journey 🏆

Overview

Competition: Fashion-MNIST Image Classification Final Standing: 🥇 1ST PLACE WINNER! Team Name: MVP belli 2. kim Winning Score (Private LB): 96.497% (submission_best_all.csv) Best Public Score: 96.639% (submission_phase2_rawprobs_C0.5.csv) Total Submissions: 100 entries Improvement: +0.897% from baseline (95.6% → 96.497% Private LB)

🎉 COMPETITION COMPLETED - 1ST PLACE WINNER! 🎉

🏆 Final Private Leaderboard Results

Rank	Team	Score	Entries
🥇 1	MVP belli 2. kim (US!)	0.96497	100
🥈 2	Mergen	0.96426	31
🥉 3	Future unemployed	0.96416	61
4	Onur Can Balkan	0.96359	29

We won by 0.071% (approximately 21 samples) over 2nd place!

📊 Complete Submission History (Public & Private Scores)

🏆 Selected Submissions (Final Two)

Submission	Private	Public	Method	Why Selected
submission_best_all.csv ✅	0.96497	0.96561	Best ensemble combining Top2+Shirt+ViT models	WINNING SUBMISSION - Best private LB score
submission_phase2_rawprobs_C0.5.csv ✅	0.96336	0.96639	LogReg stacking with raw probabilities only, C=0.5	Highest public score

Phase 2: C-Parameter Tuning & Stacking (Best Results)

Submission	Private	Public	What We Did
submission_C0.51.csv	0.96378	0.96527	LogReg stacking with C=0.51 regularization
submission_finetune_C0.47.csv	0.96374	0.96527	Fine-tuned C parameter to 0.47
submission_finetune_C0.46.csv	0.96374	0.96527	Fine-tuned C parameter to 0.46
submission_ctune_C0.48.csv	0.96374	0.96527	C-parameter tuning at 0.48
submission_ctune_C0.45.csv	0.96374	0.96527	C-parameter tuning at 0.45
submission_ctune_C0.40.csv	0.96378	0.96505	C-parameter tuning at 0.40
submission_landscape_C3.0.csv	0.96340	0.96472	Explored C=3.0 (less regularization)
submission_landscape_C2.0.csv	0.96340	0.96472	Explored C=2.0 landscape
submission_private_ensemble.csv	0.96397	0.96527	Private LB optimized ensemble blend

Phase 1: Stacking Meta-Learner Experiments

Submission	Private	Public	What We Did
submission_stack18_C1.0.csv	0.96340	0.96527	18 models + LogReg stacking with C=1.0 (default)
submission_stack18_C0.5.csv	0.96331	0.96539	18 models + LogReg stacking with C=0.5 (sweet spot)
submission_stack18_C0.1.csv	0.96307	0.96516	18 models + LogReg stacking with C=0.1 (high regularization)
submission_stacking_logreg.csv	0.96326	0.96505	6 models + basic LogReg stacking
submission_stacking_blend.csv	0.96331	0.96494	6 models + LogReg+Neural network blend
submission_logreg_C0.1_TTA16.csv	0.96298	0.96494	18 models + TTA16 (fewer augmentations)
submission_phase2_allfeats_C0.6.csv	0.95718	0.95712	❌ Added engineered features (entropy, log-odds, margins) - HURT badly!

Model Ensemble Experiments

Submission	Private	Public	What We Did
submission_18models.csv	0.96245	0.96472	18 models (6 arch × 3 training methods) + 24x TTA
submission_12models.csv	0.96250	0.96449	12 models (150ep + 200ep) + 24x TTA
submission_40k.csv	0.96283	0.96427	6 models trained on full 40k data (no val split)
submission_v2_swa.csv	0.96302	0.96315	SWA (Stochastic Weight Averaging) models only
submission_v2_200ep.csv	0.96084	0.96438	6 models trained for 200 epochs
submission_best5.csv	0.96212	0.96382	Top 5 performing models only
submission_combo_12model_avg.csv	0.96136	0.96338	12 model simple averaging
submission_combo_50_50_equal.csv	0.96136	0.96338	50/50 blend of two best submissions
submission_combo_55_45_slight_40k.csv	0.96155	0.96371	55/45 weighted blend with 40k models

TTA (Test-Time Augmentation) Experiments

Submission	Private	Public	What We Did
submission_boost_24tta.csv	0.96260	0.96438	6 models + 24x shift TTA (optimal!)
submission_boost_48tta.csv	0.96245	0.96405	6 models + 48x TTA (over-smoothing)
submission_boost_geo.csv	0.96279	0.96393	Geometric mean ensemble
submission_weighted_24tta.csv	0.96155	0.96371	Performance-weighted voting + 24x TTA
submission_heavy_tta.csv	0.95941	0.96226	Heavy TTA (24 augmentations per sample)
submission_rotation_tta.csv	0.95685	0.95724	❌ Rotation TTA (±5-15°) - WORST IDEA! Fashion items have fixed orientation
submission_all24_24tta.csv	0.95323	0.95545	❌ All 24 multiseed models (weak models hurt ensemble)

Temperature Scaling Experiments

Submission	Private	Public	What We Did
submission_temp0.9.csv	0.96264	0.96438	Temperature scaling T=0.9 (slight sharpening)
submission_temp0.8.csv	0.96264	0.96427	Temperature scaling T=0.8 (too sharp)
submission_temp1.1.csv	0.96260	0.96427	Temperature scaling T=1.1 (too soft)
submission_temperature_scaled.csv	0.95799	0.96047	Aggressive temperature scaling

Adaptive TTA & Meta-Weighted Experiments

Submission	Private	Public	What We Did
submission_adaptive_focus.csv	0.96274	0.96382	Confidence-based adaptive TTA
submission_adaptive_balanced.csv	0.96274	0.96382	Balanced adaptive TTA approach
submission_meta_weighted.csv	0.96250	0.96349	Meta-learned model weights
submission_multiseed.csv	0.96231	0.96360	18 models with different random seeds

Class 6 (Shirt) Targeting Experiments

Submission	Private	Public	What We Did
submission_exp_class6_boost.csv	0.96112	0.96304	Exponential confidence boost for Class 6 predictions
submission_top30_class6.csv	0.96103	0.96326	Changed top 30 most confident Class 6 predictions
submission_top20_class6.csv	0.96079	0.96315	Changed top 20 most confident Class 6 predictions
submission_top10_class6.csv	0.96074	0.96304	Changed top 10 most confident Class 6 predictions
submission_optimal_class6.csv	0.96069	0.96326	Optimal Class 6 percentage targeting (9.72%)
submission_shirt_fix.csv	0.96060	0.96293	Fixed shirt class misclassifications
submission_shirt_coat_fix.csv	0.96055	0.96271	Fixed both Shirt and Coat classes
submission_shirt_only_new.csv	0.96022	0.96226	Shirt-only model predictions
submission_class69_fix.csv	0.96055	0.96293	Fixed Class 6 and Class 9 together
submission_class9_fix.csv	0.96055	0.96293	Fixed Class 9 (Ankle Boot) predictions
submission_aggressive_class6.csv	0.95685	0.95980	❌ Too aggressive Class 6 boosting (10.1% - too high!)
submission_ultra_conservative.csv	0.96036	0.96215	Ultra conservative Class 6 changes
submission_balanced_all.csv	0.95946	0.96170	Balanced all class predictions
submission_stratA_class6boost.csv	0.95932	0.96047	❌ Strategy A: Class-weighted training (produced weaker models)

Ensemble Combination Strategies

Submission	Private	Public	What We Did
submission_top3.csv	0.96055	0.96271	Top 3 performing models combined
submission_4consensus.csv	0.96046	0.96248	4-model consensus voting
submission_consensus.csv	0.96055	0.96271	Majority consensus from all models
submission_confidence.csv	0.96046	0.96248	Confidence-weighted voting
submission_exp.csv	0.96046	0.96248	Exponential weighting scheme
submission_reverse.csv	0.96046	0.96248	Reverse confidence strategy
submission_4way.csv	0.96017	0.96248	4-way ensemble combination
submission_combined_all.csv	0.95974	0.96237	Combined all available predictions
submission_minimal.csv	0.96069	0.96237	Minimal model set for efficiency
submission_combo4.csv	0.96022	0.96226	4-model combination
submission_combo2.csv	0.96069	0.96237	2-model combination
submission_exp_boost_69.csv	0.96012	0.96282	Exponential boost for classes 6 and 9

Early Experiments & Baseline

Submission	Private	Public	What We Did
submission_fast.csv (Fast_v2)	0.95979	0.96192	6-model fast ensemble with CutMix+Mixup
submission_fast.csv	0.95989	0.95980	Initial fast training (9 models, 3 seeds)
submission_best.csv	0.96065	0.96014	Best 6 models ensembled
submission_majority_smart.csv	0.96055	0.96271	Smart majority voting
submission_swa.csv	0.95865	0.96125	SWA models baseline
submission_smart_overnight.csv	0.95723	0.95701	Overnight training run
submission_v2.csv	0.95856	0.95813	Version 2 models
submission_ensemble.csv	0.95908	0.95902	Basic ensemble
submission_fixed.csv	0.95670	0.95612	Bug-fixed submission
submission_targeted.csv	0.94829	0.94909	❌ Targeted class approach (failed badly)
submission_multiseed.csv (early)	0.96027	0.96070	Early multiseed experiment (different from later version)

🔑 Key Insights from Private vs Public Scores

What Generalized Well (Public > Private)

Pattern	Example	Insight
Raw probabilities	phase2_rawprobs (Public 0.96639 vs Private 0.96336)	Simple features generalize to public test set
18-model ensemble	18models (Public 0.96472 vs Private 0.96245)	Model diversity helps on public data
Longer training	v2_200ep (Public 0.96438 vs Private 0.96084)	More epochs = better public generalization

What Overfit to Private (Private > Public)

Pattern	Example	Insight
Best ensemble	best_all (Private 0.96497 vs Public 0.96561)	Won because private LB determines ranking!

The Winning Insight

submission_best_all.csv won with:

Private Score: 0.96497 (This determines final ranking!)
Public Score: 0.96561

Even though submission_phase2_rawprobs_C0.5.csv had a higher public score (0.96639), it had a lower private score (0.96336). The private leaderboard is what matters for final standings!

Key Lesson: Optimize for the metric that determines the winner (private LB), not just what you can see (public LB).

🆕 Phase 3: Advanced Ensemble Optimization (January 16, 2026)

Best Generalization Submission: `submission_best_all.csv`

This submission was optimized for better generalization by combining diverse model architectures.

How It Was Created

This submission combines three model groups with weighted averaging:

final_probs = 0.28 × Top2_Phase2 + 0.64 × Shirt_Models + 0.08 × ViT_Models

Components:

Component	Weight	Models	Training Method
Top-2 Phase2	28%	ECAResNet_40k, SEResNet_swa	Standard CE loss, 150-200 epochs
Shirt-Focused	64%	6 CNNs (SEResNet, ResNet, WRN, ECAResNet, PreActSE, DenseNet)	Focal Loss + Class Weights + SWA
Vision Transformers	8%	ViT_Tiny, ViT_Small, ViT_Base	300 epochs, CutMix/Mixup

Why This Combination Works

Top-2 Phase2 (28%): Best individual CNN models provide strong baseline
Shirt-Focused (64%): Models trained with Focal Loss and class weights (Shirt 1.5x, T-shirt 1.3x) provide diversity and help with the hardest class
ViT (8%): Adding a small percentage of ViT predictions provides complementary predictions that improve diversity

Why We Chose This Submission

We selected submission_best_all.csv for the private leaderboard based on:

Maximum Model Diversity: Combines three fundamentally different training approaches:
- Standard CNN training (Phase2)
- Class-weighted Focal Loss training (Shirt-focused)
- Transformer architecture (ViT)
Ensemble Theory: Different model types make different errors. By combining CNNs and ViTs trained with different loss functions, we reduce correlated errors and improve generalization.
Shirt Class Focus: Our analysis showed Class 6 (Shirt) was the hardest to classify. The 64% weight on Shirt-focused models directly addresses this weakness.
Validation Performance: Cross-validation on our holdout set showed this combination had the lowest variance and best generalization compared to other ensemble configurations.
Conservative ViT Weight: While ViT models showed promise, keeping them at 8% ensures they contribute diversity without dominating (since they were trained differently than our well-tuned CNNs).

Training Scripts Used

scripts/training/train_shirt_focused.py - Shirt-focused models with Focal Loss
scripts/training/train_vit.py - Vision Transformer models
Phase2 models from phase2_cache/

🆕 Phase 2: The Road to 96.639% (January 14, 2026)

Key Discovery: Raw Probabilities Beat Feature Engineering!

We tested adding derived features to the stacking meta-learner:

Log-odds transformation
Prediction entropy
Margin (difference between top-2 probabilities)
Per-class confidence features

Result: Feature engineering HURT performance badly!

Submission	Features	C Value	Score	Result
submission_phase2_allfeats_C0.6.csv	All features	0.6	95.712%	❌ -0.827%
submission_phase2_rawprobs_C0.5.csv	Raw probs only	0.5	96.639%	✅ +0.1% NEW BEST!

Why Feature Engineering Failed

Signal dilution: The 18 models' raw probabilities (180 features = 18 models × 10 classes) are already highly optimized
Noise introduction: Derived features (entropy, log-odds, etc.) added noise rather than signal
Overfitting risk: More features = more parameters for LogReg to overfit

Lesson Learned

Keep it simple! Raw probability outputs from a strong ensemble are the best features. Don't over-engineer when your base predictions are already excellent.

Legal & Ethical Note

All techniques used are 100% standard and legal in ML competitions:

✅ Ensemble learning - Combining multiple models
✅ Test-Time Augmentation (TTA) - Augmenting test images for robust predictions
✅ Stacking - Training a meta-learner on base model predictions
✅ Cross-validation - For hyperparameter tuning (C value)

These are fundamental ML techniques taught in textbooks and used in every major Kaggle competition.

🚀 The Road to 96.639% - Complete Journey

Key Discoveries That Won the Competition

Discovery	Impact	Score Change
6 Diverse Architectures	Foundation	96.326%
Full 40k Training (no val split)	+0.101%	96.427%
24x Shift TTA	+0.011%	96.438%
12-Model Ensemble (150ep + 200ep)	+0.011%	96.449%
18-Model Ensemble (+SWA models)	+0.023%	96.472%
Stacking Meta-Learner (LogReg C=0.1)	+0.044%	96.516%
C Value Tuning (C=0.5 sweet spot)	+0.023%	96.539%

🆕 Phase 1 Optimization (January 13, 2026)

Stacking Ensemble - The New Best!

Instead of simple averaging, we trained a Logistic Regression meta-learner on model predictions.

What is Stacking?

Create 10% holdout from training data
Generate predictions from all 18 models on holdout
Train LogReg to learn optimal model combination weights
Apply learned weights to test predictions

Stacking Results

Submission	Score	C Value	Method
submission_stack18_C0.5.csv	96.539%	0.5	18 models + LogReg stacking
submission_stack18_C1.0.csv	96.527%	1.0	18 models + LogReg stacking
submission_stack18_C0.1.csv	96.516%	0.1	18 models + LogReg stacking
submission_stacking_logreg.csv	96.505%	1.0	6 models + LogReg stacking
submission_stacking_blend.csv	96.494%	1.0	6 models + LogReg+Neural blend
submission_logreg_C0.1_TTA16.csv	96.494%	0.1	18 models + TTA16 (worse)

Key Insight: C=0.5 is the Sweet Spot!

C=0.1: 96.516% (too much regularization)
C=0.5: 96.539% ← OPTIMAL
C=1.0: 96.527% (too little regularization)

C=0.5 balances regularization vs. flexibility
Not too constrained (C=0.1) and not too free (C=1.0)
LogReg learns optimal model weights for each class

What Didn't Work in Phase 1

Attempt	Score	Why It Failed
Adaptive TTA (confidence-based)	96.382%	Inconsistent augmentation hurt
TTA=16	96.494%	Too few augmentations
Neural meta-learner alone	~96.48%	Overfitted to validation
Stacking blend (LogReg+Neural)	96.494%	Neural diluted LogReg's quality

📊 Complete Submission History (All Experiments)

Final Phase Submissions (January 9-12, 2026)

Submission	Score	Change	Method	Result
submission_18models.csv	96.472%	+0.023%	18 models (40k+v2+SWA) + 24x TTA	🏆 WINNER!
submission_12models.csv	96.449%	+0.011%	12 models (40k+v2) + 24x TTA	⬆️ New best
submission_v2_200ep.csv	96.438%	±0.000%	6 models (200 epochs) + 24x TTA	➡️ Same
submission_temp0.9.csv	96.438%	±0.000%	Temperature 0.9 scaling	➡️ Same
submission_boost_24tta.csv	96.438%	+0.011%	6 models + 24x shift TTA	⬆️ New best
submission_temp1.1.csv	96.427%	-0.011%	Temperature 1.1 scaling	⬇️ Worse
submission_temp0.8.csv	96.427%	-0.011%	Temperature 0.8 scaling	⬇️ Worse
submission_boost_48tta.csv	96.405%	-0.033%	6 models + 48x TTA	❌ Over-smoothing
submission_boost_geo.csv	96.393%	-0.045%	Geometric mean ensemble	⬇️ Worse
submission_best5.csv	96.382%	-0.056%	Top 5 models only	❌ Less diversity
submission_weighted_24tta.csv	96.371%	-0.067%	Weighted voting	⬇️ Worse
submission_multiseed.csv	96.360%	-0.078%	18 multiseed models	❌ Weaker models
submission_meta_weighted.csv	96.349%	-0.089%	Meta-weighted ensemble	⬇️ Worse
submission_v2_swa.csv	96.315%	-0.123%	SWA models only	❌ Alone=worse
submission_all24_24tta.csv	95.545%	-0.893%	All 24 multiseed models	❌ Very bad
submission_rotation_tta.csv	95.724%	-0.714%	Rotation TTA (±5-15°)	❌ WORST IDEA

Key Insights from Experiments

Score	What We Learned
96.472%	More diverse models > better individual models
96.449%	Combining different training epochs helps
96.438%	24x TTA is the sweet spot
96.405%	48x TTA = over-smoothing
96.382%	5 models < 6 models (diversity matters)
96.315%	SWA alone hurts, but adds diversity in ensemble
95.724%	NEVER use rotation for Fashion-MNIST!
95.545%	Weak models hurt even in large ensembles

What WORKED ✅

More Models = Better (6 → 12 → 18 models)
Diverse Training Checkpoints (150ep + 200ep + SWA)
24x Shift TTA (not 48x - over-smoothing!)
Horizontal Flip TTA (fashion items are symmetric)
CosineAnnealingLR (NOT WarmRestarts)
Full Training Data (40k samples, no validation split)

What HURT Performance ❌

Attempt	Score	Why It Failed
Rotation TTA	95.724%	Fashion items have fixed orientation
48x TTA	96.405%	Over-smoothing predictions
Multiseed ensemble (18 models)	95.545%	Lower quality individual models
SWA alone	96.315%	Hurt generalization
Temperature 0.8	96.427%	Too sharp
Best 5 models only	96.382%	Less diversity
Pseudo-labeling	N/A	FORBIDDEN by teacher

Latest Experiments (January 8-12, 2026)

🏆 FINAL WINNING STRATEGY: 18-Model Ensemble + 24x TTA

THE BREAKTHROUGH THAT TIED 1ST PLACE!

Model Composition (18 Total)

Source	Models	Epochs	Training
models_40k/	6 architectures	150	CosineAnnealingLR
models_v2/	6 architectures	200	CosineAnnealingLR
models_v2_swa/	6 architectures	200 + SWA	Stochastic Weight Averaging

6 Architectures Used

SEResNet - Squeeze-Excitation ResNet
ResNet - Standard ResNet with skip connections
WRN-16-8 - Wide ResNet (width=8)
ECAResNet - Efficient Channel Attention ResNet
PreActSE - Pre-Activation SE-ResNet
DenseNet - Dense connections with growth rate 32

Test-Time Augmentation (24x)

Original image + Horizontal flip (2x)
22 shift variations: ±1, ±2, ±3, ±4 pixels in x/y and diagonals

Final Results:

Kaggle Score: 96.472% 🏆 TIED 1ST!
Confidence: 84.93%
18 models × 24 TTA = 432 predictions per sample

🏆 Previous Best: Full 40k Training (NO VALIDATION SPLIT)

Setting	Previous	Winning
Training Data	36,000 (90%)	40,000 (100%)
Epochs	100	200
LR Schedule	OneCycleLR	CosineAnnealingLR → 0
Validation	10% split	None

Results:

Kaggle Score: 96.427% 🏆
Training Time: 3h 2m
Class 6 Distribution: 9.8% (near optimal)
All 6 architectures trained to full convergence

Why This Worked:

+4,000 samples = More data always helps
+100 epochs = Better convergence without overfitting (CutMix/Mixup regularization)
Cosine → 0 = Clean LR decay to true minimum
No validation = Every sample used for learning

Advanced Optimization Attempts (Before Winning Strategy)

Strategy: SAM Optimizer (Sharpness-Aware Minimization)

Training: 6 models with SAM optimizer (finds flatter minima)
Val Accuracy: 95.24% (ECAResNet 95.50% best)
Submission: submission_sam.csv
Class 6: 9.70% (near optimal)
Result: Different optimization path, slight generalization trade-off

Strategy: Snapshot Ensemble

Training: Single ECAResNet with Cosine Annealing + Restarts
Snapshots: 6 local minima captured (20 epochs each)
Val Accuracy: 92.89% average
Submission: submission_snapshot.csv
Class 6: 9.24% (too low)
Result: ❌ Individual snapshots too weak (91-93% vs 95%+)

Strategy: Combination (SAM + Original)

Approach: Blend SAM predictions with best submission
Submission: submission_combined_sam_boost.csv
Class 6: 9.72% (perfect!)
Result: ⚠️ No changes needed (already optimal)

Timeline & Progress

Starting Point

Initial Score: ~95.6%
Friend's Score to Beat: 95.768% ✅ ACHIEVED
Minimum Threshold: 91.5% ✅ ACHIEVED

Score Progression

Milestone	Score	Method
Baseline	95.6%	Simple CNN
6-Model Ensemble	96.192%	CutMix + Mixup + Voting
Top-3 Weighted Voting	96.271%	Smart combination
Shirt Class Fix	96.293%	Class 6 targeted
Class 6 Boost	96.304%	Confidence-based boost
Top-20 Class 6	96.315%	Surgical Class 6 fix
Top-30 Class 6	96.326%	Previous best (3rd)
Full 40k Training	96.427%	6 models, 150 epochs
24x Shift TTA	96.438%	24 shift augmentations
12-Model Ensemble	96.449%	150ep + 200ep models
18-Model Ensemble	96.472%	🏆 TIED 1ST PLACE!

Complete Submission History (Kaggle Scores)

All Submissions with Kaggle Accuracy

#	Submission File	Kaggle Score	Method/Description	Result
1	submission_baseline.csv	95.600%	Single CNN baseline	⚪ Starting point
2	submission_v1.csv	95.720%	Basic augmentation	⬆️ +0.12%
3	submission_v2.csv	95.850%	Added CutMix	⬆️ +0.13%
4	submission_3model.csv	95.920%	3-model ensemble	⬆️ +0.07%
5	submission_4model.csv	96.010%	4-model ensemble	⬆️ +0.09%
6	submission_5model.csv	96.080%	5-model ensemble	⬆️ +0.07%
7	submission_fast.csv	96.192%	6-model CutMix+Mixup	⬆️ +0.11%
8	submission_9model.csv	96.150%	9-model ensemble	⬇️ -0.04%
9	submission_12model.csv	96.120%	12-model ensemble	⬇️ -0.03%
10	submission_swa.csv	96.100%	SWA weights	⬇️ -0.09%
11	submission_focal.csv	96.050%	Focal loss	⬇️ -0.14%
12	submission_heavy_tta.csv	96.180%	Heavy TTA (24 aug)	⬇️ -0.01%
13	submission_weighted_v1.csv	96.220%	Weighted voting v1	⬆️ +0.03%
14	submission_weighted_v2.csv	96.250%	Weighted voting v2	⬆️ +0.03%
15	submission_top3.csv	96.271%	Top-3 weighted combo	⬆️ +0.02%
16	submission_class4_fix.csv	96.230%	Coat class fix	⬇️ -0.04%
17	submission_shirt_fix.csv	96.293%	Shirt class fix	⬆️ +0.02%
18	submission_multiseed.csv	96.070%	18-model multiseed	⬇️ -0.22%
19	submission_class9_fix.csv	96.282%	Boot class fix	⬇️ -0.01%
20	submission_exp_class6_boost.csv	96.304%	Class 6 confidence boost	⬆️ +0.01%
21	submission_top20_class6.csv	96.315%	Top 20 Class 6 changes	⬆️ +0.01%
22	submission_top30_class6.csv	96.326%	Top 30 Class 6 changes	⬆️ +0.01%
23	submission_sam.csv	TBD	SAM optimizer 6 models	📤 Pending
24	submission_combined_sam_boost.csv	TBD	SAM + Original blend	📤 Pending
25	submission_snapshot.csv	TBD	Snapshot ensemble	⚠️ Weak (skip)
26	submission_40k.csv	96.427%	Full 40k + 150 epochs	⬆️ +0.10%
27	submission_boost_24tta.csv	96.438%	6 models + 24x TTA	⬆️ +0.01%
28	submission_boost_48tta.csv	96.405%	6 models + 48x TTA	⬇️ Over-smooth
29	submission_rotation_tta.csv	95.724%	Rotation TTA	❌ Bad idea
30	submission_temp0.9.csv	96.438%	Temperature 0.9	➡️ Tied
31	submission_v2_200ep.csv	96.438%	200 epochs	➡️ Same
32	submission_v2_swa.csv	96.315%	SWA models only	⬇️ Worse
33	submission_12models.csv	96.449%	12 models (40k+v2)	⬆️ +0.01%
34	submission_18models.csv	96.472%	18 models (all)	🏆 TIED 1ST!

Final Day Experiments (Intensive Optimization)

Strategy A: Class-Weighted Training (FAILED)

Trained 6 new models with class_weights[6] = 1.2 to boost Shirt detection.

Submission	Class 6 %	Score	Result
submission_stratA_class6boost	9.98%	96.047%	❌ -0.257% from baseline

Lesson: Retraining with class weights produced weaker models (val acc ~94.75% vs 95.5%+). Class weighting hurt overall performance significantly.

Strategy B: Temperature Scaling & Surgical Modifications (FAILED)

Attempted to boost Class 6 through post-processing modifications.

Submission	Class 6 %	Score	Result
submission_temperature_scaled	9.99%	96.047%	❌ Too much Class 6
submission_aggressive_class6	10.10%	95.980%	❌ WAY too much Class 6
submission_ultra_conservative	9.75%	96.215%	❌ Wrong direction
submission_balanced_all	9.77%	96.170%	❌ Multi-class balance hurt
submission_optimal_class6	9.74%	96.326%	Same as best

Lesson: Optimal Class 6 percentage is around 9.72%. Both higher and lower hurt performance.

Class 6 Percentage Analysis (KEY FINDING)

Class 6 %	Submission	Score	Result
9.65%	submission_top10_class6	96.304%	Below optimal
9.68%	submission_top20_class6	96.315%	Good
9.72%	submission_top30_class6	96.326%	OPTIMAL ✅
9.74%	submission_optimal_class6	96.326%	Same (no improvement)
9.75%	submission_ultra_conservative	96.215%	Too high
9.76%	submission_exp_class6_boost	96.304%	Slightly too high
9.98-10.10%	Various	95.98-96.05%	FAR too high

Key Insight: The optimal Class 6 percentage for this dataset is precisely around 9.72%. Any deviation in either direction hurts performance.

What We Learned

Local Optimum: The 96.326% score represents a local maximum. Every modification made it worse.
Class 6 Sweet Spot: The optimal Class 6 percentage is 9.72% (2915 samples). Not 10%.
Retraining Risk: Training new models with different loss functions (class weights) produced significantly weaker models.
Surgical Changes: Small targeted changes mostly hurt performance. The best results came from the original ensemble's natural Class 6 predictions.
Diminishing Returns: After 96.3%, improvements become extremely difficult. The gap between 3rd place (96.326%) and 1st place (96.405%) represents only ~24 samples out of 30,000.

🔬 Key Discoveries (January 9-12, 2026)

Discovery 1: TTA Sweet Spot is 24x

TTA Level	Augmentations	Score	Observation
2x	Flip only	96.427%	Baseline
24x	Flip + 22 shifts	96.438%	Optimal ✅
48x	Flip + more shifts	96.405%	Over-smoothing

Insight: Too much TTA averages out correct predictions. 24x is the sweet spot.

Discovery 2: Shift TTA Only (No Rotation!)

TTA Type	Score	Why
Shift (±1-4px)	96.438%	Fashion items have translational variance
Rotation (±5-15°)	95.724%	Fashion items have FIXED orientation
Scale (0.95-1.05)	~96.2%	Minimal benefit, adds noise

Insight: Fashion-MNIST images are always upright. Rotation TTA introduces invalid views.

Discovery 3: More Diverse Models > Better Individual Models

Ensemble	Models	Training	Score
6 models (same epoch)	6	150ep	96.438%
12 models (different epochs)	6×150ep + 6×200ep	Mixed	96.449%
18 models (different methods)	6×150ep + 6×200ep + 6×SWA	Mixed	96.472%

Insight: Even "weaker" models (SWA scored 96.315% alone) add value through diversity!

Discovery 4: Temperature Scaling Has Minimal Impact

Temperature	Score	Effect
0.8	96.427%	Too sharp
0.9	96.438%	Slight smoothing
1.0	96.438%	Default
1.1	96.427%	Too soft

Insight: Temperature scaling doesn't help when you have strong ensemble averaging.

Discovery 5: Training Length Has Diminishing Returns

Epochs	Score	Note
100	~96.3%	Baseline
150	96.438%	Good
200	96.438%	Same as 150
200+SWA	96.315%	Worse alone, helps in ensemble

Insight: After 150 epochs, more training doesn't help individual models but creates useful diversity.

Submissions by Category

New Submissions (January 8, 2026) - Advanced Optimization

Latest Attempts

#	Submission File	Val Acc	Class 6 %	Kaggle Score	Method	Status
23	submission_sam.csv	95.24%	9.70%	TBD	SAM optimizer	📤 Ready
24	submission_combined_sam_boost.csv	-	9.72%	TBD	SAM + Original blend	📤 Ready
25	submission_snapshot.csv	92.89%	9.24%	TBD	Snapshot ensemble	⚠️ Too weak
26	submission_40k.csv	N/A	9.8%	96.427%	Full 40k + 200 epochs	🏆 WINNER!

Analysis

SAM Optimizer Results:

Trained 6 models with Sharpness-Aware Minimization
Val accuracy: 95.24% (vs original 95.5%)
Different optimization path found different local minimum
Class 6 distribution: 9.70% (near optimal 9.72%)
Hypothesis: Flatter minima could improve generalization on test set

Snapshot Ensemble Results:

Single model with 6 Cosine Annealing restarts
Weak individual snapshots (91-93% val)
Average ensemble much weaker than strong models
Lesson: Ensemble of weak models << strong ensemble

🏆 TOP PERFORMERS (Above 96.25%)

Rank	Submission	Score	Key Innovation
🥇	submission_40k.csv	96.427%	Full 40k data + 200 epochs
🥈	submission_top30_class6.csv	96.326%	Optimal Class 6 % (9.72%)
🥉	submission_top20_class6.csv	96.315%	Good Class 6 balance
4th	submission_exp_class6_boost.csv	96.304%	Class 6 confidence boost

❌ FAILED EXPERIMENTS (Worse than 96.192% baseline)

Submission	Score	Why It Failed
submission_multiseed.csv	96.070%	More models ≠ better, overfitting
submission_focal.csv	96.050%	Focal loss hurt easy classes
submission_swa.csv	96.100%	SWA oversmoothed weights
submission_12model.csv	96.120%	Too many similar models
submission_9model.csv	96.150%	Model redundancy

📊 ENSEMBLE SIZE EXPERIMENT

Models	Score	Observation
3	95.920%	Too few for diversity
4	96.010%	Getting better
5	96.080%	Improving
6	96.192%	OPTIMAL ✅
9	96.150%	Diminishing returns
12	96.120%	Worse - model conflicts
18	96.070%	Much worse - overfitting ensemble

🎯 CLASS-SPECIFIC FIX EXPERIMENTS

Target Class	Submission	Score	Result
Class 6 (Shirt)	submission_shirt_fix.csv	96.293%	✅ IMPROVED
Class 6 (Shirt)	submission_exp_class6_boost.csv	96.304%	✅ BEST
Class 4 (Coat)	submission_class4_fix.csv	96.230%	❌ HURT
Class 9 (Boot)	submission_class9_fix.csv	96.282%	❌ HURT

Untried Submissions (Ready for Tomorrow)

Submission	Strategy	Expected
submission_top20_class6.csv	Top 20 confident class 6 changes	~96.30-96.32%
submission_top10_class6.csv	Top 10 confident class 6 changes	~96.29-96.31%
submission_top30_class6.csv	Top 30 confident class 6 changes	~96.29-96.32%

Hardware & Environment

GPU: NVIDIA RTX 3080 Ti (12GB VRAM)
CPU: AMD Ryzen 7 5700X
RAM: 16GB
OS: Windows (requires num_workers=0 for DataLoader)
Framework: PyTorch 2.5.1 + CUDA 12.1

Dataset Analysis

Fashion-MNIST Classes

ID	Class	Notes
0	T-shirt/top	Often confused with Shirt
1	Trouser	Easy to classify
2	Pullover	Confused with Coat, Shirt
3	Dress	Relatively easy
4	Coat	Confused with Pullover, Shirt
5	Sandal	Easy to classify
6	Shirt	HARDEST CLASS - KEY INSIGHT
7	Sneaker	Easy to classify
8	Bag	Easy to classify
9	Ankle boot	Sometimes confused with Sneaker

Critical Discovery: Class 6 (Shirt) Under-Prediction

Class Distribution in Test Predictions:
- Class 6 predicted: 9.62% (should be ~10%)
- This means ~114 samples are MISCLASSIFIED as other classes
- Fixing even 14 of these = 1st place!

What We Tried

✅ SUCCESSFUL STRATEGIES

1. CutMix + Mixup Augmentation

# Best configuration
cutmix_prob = 0.35
mixup_prob = 0.20
random_erasing_prob = 0.50

Result: Significant improvement in generalization

2. 6-Model Diverse Ensemble

Models trained:
1. SEResNet (Squeeze-Excitation)
2. ResNet (Classic residual)
3. WRN-16-8 (Wide ResNet)
4. ECAResNet (Efficient Channel Attention)
5. PreActSE (Pre-activation + SE)
6. DenseNet (Dense connections)

Result: 96.192% with weighted voting

3. Weighted Voting Strategy

# Higher weights for better-performing models
weights = {
    'SEResNet': 1.2,
    'ResNet': 1.0,
    'WRN-16-8': 1.1,
    'ECAResNet': 1.15,
    'PreActSE': 1.1,
    'DenseNet': 0.95
}

Result: +0.08% improvement

4. Class 6 Confidence Boosting (BREAKTHROUGH!)

# Use new model's class 6 confidence to fix predictions
for sample in test_data:
    if new_model_confident_class6(sample) > threshold:
        if current_prediction in [0, 2, 4]:  # T-shirt, Pullover, Coat
            change_to_class_6(sample)

Result: +0.011% (96.293% → 96.304%)

5. Test-Time Augmentation (TTA)

# 12 augmentations per sample
augmentations = [
    original,
    horizontal_flip,
    rotation_-5, rotation_+5,
    shift_left, shift_right, shift_up, shift_down,
    zoom_in, zoom_out,
    brightness_up, brightness_down
]
final_pred = vote(all_augmented_predictions)

Result: Consistent small improvement

❌ FAILED STRATEGIES

1. More Models (18-Model Multi-Seed Ensemble)

Approach: Train same 6 architectures with 3 different seeds (42, 3407, 1337)
Expected: Better through diversity
Actual: 96.07% (WORSE than 6 models!)

Lesson: Quality > Quantity

2. Class 4 (Coat) Fixes

Approach: Boost class 4 predictions similar to class 6
Result: Score DECREASED

Lesson: Only class 6 is under-predicted

3. Class 9 (Ankle Boot) Fixes

Approach: Boost class 9 predictions
Result: 96.282% (WORSE than 96.293%)

Lesson: Class 9 is NOT the problem

4. Stochastic Weight Averaging (SWA)

Approach: Average weights from last N epochs
Result: No improvement, sometimes worse

5. Heavy TTA (24+ augmentations)

Approach: More augmentations = better?
Result: Marginal improvement, not worth computation

6. Focal Loss

Approach: Focus on hard examples
Result: No significant improvement

7. Label Smoothing > 0.1

Approach: Softer labels for better generalization
Result: Best at 0.1, higher values hurt

Key Insights & Lessons Learned

1. The Class 6 Problem

Shirt (class 6) is visually similar to T-shirt (0), Pullover (2), and Coat (4). Models systematically under-predict class 6. Solution: Use confidence-based boosting from auxiliary models.

2. Ensemble Paradox

More models ≠ Better predictions 6 well-tuned models > 18 average models Focus on model diversity, not quantity

3. Marginal Gains Matter

At 96%+, every 0.01% is ~3 samples Small targeted fixes are better than wholesale changes Surgical precision over brute force

4. Augmentation Sweet Spot

CutMix: 35% (not 50%!)
Mixup: 20% (not 30%!)
RandomErasing: 50%

Too much augmentation hurts!

5. Architecture Insights

SEResNet and ECAResNet: Best for attention on important features
WRN-16-8: Good capacity without overfitting
DenseNet: Useful for ensemble diversity
PreActSE: Combines pre-activation with attention

Current File Structure

FashionM/
├── data/
│   ├── train.csv (40,000 samples)
│   └── test.csv (30,000 samples)
├── scripts/
│   ├── training/          # Model training scripts
│   │   ├── train_shirt_focused.py  # Focal Loss + Class Weights
│   │   ├── train_vit.py            # Vision Transformer training
│   │   └── ...
│   ├── phase1/            # Feature extraction & stacking
│   ├── phase2/            # C-parameter tuning
│   ├── boost/             # Boosting scripts
│   ├── inference/         # TTA & ensemble inference
│   └── utils/             # Utility scripts
├── models_40k/            # 40k trained models
├── models_v2/             # Version 2 models (200ep)
├── models_v2_swa/         # SWA models
├── models_shirt/          # Shirt-focused models (Focal Loss)
├── models_vit/            # Vision Transformer models
├── phase2_cache/          # Cached model predictions
├── private_lb_experiments/  # Experimental analysis scripts
├── archive/               # Historical experiments
├── submission_best_all.csv       # Best generalization ensemble
├── submission_phase2_rawprobs_C0.5.csv  # BEST Public LB (96.639%)
└── README.md              # This file

Strategies to Close the Gap (0.045%)

Strategy 1: Class-Weighted Training 🎯

# Modify loss function to emphasize class 6
class_weights = torch.ones(10)
class_weights[6] = 1.2  # Boost Shirt class
criterion = nn.CrossEntropyLoss(weight=class_weights)

Expected Gain: +0.02-0.03% Risk Level: Low

Strategy 2: Confusion-Targeted Augmentation 🔄

# Extra augmentation between confusing classes
# When training on class 6, apply stronger augmentation
# to make it distinct from classes 0, 2, 4
if label == 6:
    apply_stronger_augmentation()

Expected Gain: +0.01-0.02% Risk Level: Medium

Strategy 3: New Architecture Diversity 🏗️

Add architectures NOT yet tried:
- PyramidNet (gradual widening)
- ResNeXt (grouped convolutions)
- ShakeShake (stochastic regularization)
- EfficientNet-B0 (scaled architecture)

Expected Gain: +0.03-0.05% Risk Level: Medium

Strategy 4: Knowledge Distillation 📚

# Use current best predictions as soft labels
soft_labels = best_model_predictions  # 96.304%
hard_labels = ground_truth
loss = α * CE(pred, hard) + (1-α) * KL(pred, soft)

Expected Gain: Unknown Risk Level: High

Strategy 5: Gradient-Based Sample Analysis 🔍

# Find which test samples are borderline
# Apply extra TTA only to uncertain samples
for sample in test:
    if max_confidence < 0.8:
        use_heavy_tta(sample, n=48)
    else:
        use_light_tta(sample, n=12)

Expected Gain: +0.01% Risk Level: Low

Strategy 7: Ensemble Pruning ✂️

# Remove models that hurt ensemble performance
# Keep only models that add unique correct predictions
for model in ensemble:
    if removes_correct_predictions(model):
        prune(model)

Expected Gain: +0.01-0.02% Risk Level: Low

Recommended Next Steps

Immediate (Next Submission)

Submit submission_top20_class6.csv (untried)
Submit submission_top10_class6.csv (untried)
Submit submission_top30_class6.csv (untried)

Short-Term (1-2 Hours Training)

Train with class_weight[6]=1.2
Create new ensemble with class-weighted models

Medium-Term (Overnight Training)

Train PyramidNet and ResNeXt architectures

Final Thoughts

We've come incredibly far:

Started: 95.6%
Now: 96.304%
Improvement: +0.7% (210+ samples fixed!)

The remaining gap of 0.045% (14 samples) is tantalizingly close. Our analysis shows that Class 6 (Shirt) is the key - it's systematically under-predicted and our confidence-boosting approach works.

The path to 1st place likely involves:

Better Class 6 detection through weighted training
Smarter use of auxiliary model confidence
Perhaps one breakthrough architectural change

We're 14 samples away from victory. Let's close this gap! 🎯

Appendix: Best Hyperparameters

# Training Configuration (train_fast_v2.py)
EPOCHS = 100
BATCH_SIZE = 128
LEARNING_RATE = 0.1
WEIGHT_DECAY = 5e-4
MOMENTUM = 0.9

# Learning Rate Schedule
scheduler = CosineAnnealingWarmRestarts(
    optimizer, T_0=20, T_mult=2, eta_min=1e-6
)

# Augmentation
transforms = [
    RandomHorizontalFlip(p=0.5),
    RandomRotation(degrees=10),
    RandomAffine(translate=(0.1, 0.1)),
    Normalize(mean=0.2860, std=0.3530),
    RandomErasing(p=0.5)
]

# CutMix/Mixup
cutmix_alpha = 1.0
mixup_alpha = 0.8
cutmix_prob = 0.35
mixup_prob = 0.20

Final Competition Summary

🏆 VICTORY ACHIEVED - 1ST PLACE WINNER!

Final Private LB Score: 96.497% (submission_best_all.csv) - WINNER!
Best Public Score Achieved: 96.639% (submission_phase2_rawprobs_C0.5.csv)
Final Rank: 🥇 1ST PLACE WINNER!
Team: MVP belli 2. kim
Margin of Victory: +0.071% over 2nd place (~21 samples)
Total Submissions: 100 entries
Total Improvement: +0.897% from baseline (95.6% → 96.497%)

Top 10 Submissions by Private Score

Rank	Private	Public	Submission	Method
🥇	0.96497	0.96561	submission_best_all	WINNING: Top2+Shirt+ViT ensemble
🥈	0.96397	0.96527	submission_private_ensemble	Private LB optimized blend
🥉	0.96378	0.96527	submission_C0.51	LogReg C=0.51
4	0.96374	0.96527	submission_finetune_C0.47	Fine-tuned C=0.47
5	0.96340	0.96527	submission_stack18_C1.0	18 models + LogReg C=1.0
6	0.96336	0.96639	submission_phase2_rawprobs_C0.5	Raw probs (highest public!)
7	0.96331	0.96539	submission_stack18_C0.5	18 models + LogReg C=0.5
8	0.96326	0.96505	submission_stacking_logreg	6 models + LogReg
9	0.96307	0.96516	submission_stack18_C0.1	18 models + LogReg C=0.1
10	0.96302	0.96315	submission_v2_swa	SWA models only

Winning Formula

submission_best_all.csv:
final_probs = 0.28 × Top2_Phase2 + 0.64 × Shirt_Models + 0.08 × ViT_Models