Results Export & Evaluation Guide

📊 Overview

After training completes, ResoMap automatically exports test results and provides tools to aggregate and analyze them.

Data Flow:

Training completes for each model-resolution pair
    ↓
Auto-save JSON: results/test_results/model_resolution.json
    ↓
Run aggregation script
    ↓
CSV: results/all_results.csv
JSON summary: results/results_summary.json
Text report: results/detailed_report.txt
    ↓
Analyze in Excel, Python, or command-line

✅ Automatic JSON Export

What Gets Saved Automatically

After each training completes, ResoMap saves:

File: results/test_results/{model}_{resolution}.json

Example: results/test_results/vgg11_224.json

Contents:

{
  "model": "vgg11",
  "resolution": 224,
  "timestamp": "2026-01-18T14:30:45.123456",
  
  "best_val_loss": 0.5441,
  
  "test_metrics": {
    "loss": 0.5519,
    "accuracy": 0.8934,
    "precision": 0.8901,
    "recall": 0.8834,
    "f1_score": 0.8867
  },
  
  "profiling": {
    "avg_time_sec": 0.0231,
    "throughput_samples_sec": 1234.56,
    "gpu_memory_peak_mb": 4096,
    "cpu_memory_peak_mb": 2048
  }
}

Where Files Are Saved

results/
├── test_results/
│   ├── vgg11_224.json         ✅ Auto-saved
│   ├── vgg11_256.json         ✅ Auto-saved
│   ├── vgg11_320.json         ✅ Auto-saved
│   ├── vgg11_384.json         ✅ Auto-saved
│   ├── vgg11_512.json         ✅ Auto-saved
│   ├── resnet18_224.json      ✅ Auto-saved
│   └── ... (up to 40 files for 8×5 models)
│
├── all_results.csv            ❌ Manual (run aggregation script)
├── results_summary.json       ❌ Manual (run aggregation script)
└── detailed_report.txt        ❌ Manual (run aggregation script)

Automatic Saving Process

In src/experiment.py, after training each model:

def run_experiment(self, model_name, resolution, dataset_path):
    # ... training code ...
    
    # Save test results automatically
    self._save_results_to_json(
        model_name=model_name,
        resolution=resolution,
        test_metrics=test_metrics,
        profiling_data=profiling_data
    )
    # → Saves to results/test_results/{model_name}_{resolution}.json

No manual action needed! Results are saved automatically.

🔄 Manual Aggregation

Command: Aggregate Results

After training is complete, combine all JSON files into CSV and reports:

python scripts/aggregate_results.py

What happens:

Finds all results/test_results/*.json files
Combines into results/all_results.csv
Generates results/results_summary.json
Creates results/detailed_report.txt
Prints console summary

Options

Generate all outputs (default):

python scripts/aggregate_results.py

Generate only CSV:

python scripts/aggregate_results.py --csv-only

Generate only summary:

python scripts/aggregate_results.py --summary-only

Generate only text report:

python scripts/aggregate_results.py --report-only

Custom output filenames:

python scripts/aggregate_results.py \
  --csv my_results.csv \
  --summary my_summary.json \
  --report my_report.txt

Skip console output:

python scripts/aggregate_results.py --no-console

📈 CSV File Analysis

File Location

results/all_results.csv

CSV Columns

Column	Type	Example	Meaning
model	str	vgg11	Model architecture name
resolution	int	224	Input image resolution (pixels)
timestamp	str	2026-01-18T14:30:45	When experiment ran
best_val_loss	float	0.5441	Best validation loss during training
test_loss	float	0.5519	Final test loss
test_accuracy	float	0.8934	Test accuracy (0-1)
test_precision	float	0.8901	Precision score (0-1)
test_recall	float	0.8834	Recall score (0-1)
test_f1_score	float	0.8867	F1 score (0-1)
profile_avg_time_sec	float	0.0231	Avg inference time per image (seconds)
profile_throughput_samples_sec	float	1234.56	Images processed per second
profile_gpu_memory_peak_mb	int	4096	Peak GPU memory during inference (MB)
profile_peak_cpu_memory_mb	int	2048	Peak CPU memory during inference (MB)

Example CSV Content

model,resolution,timestamp,best_val_loss,test_loss,test_accuracy,test_precision,test_recall,test_f1_score,profile_avg_time_sec,profile_throughput_samples_sec,profile_gpu_memory_peak_mb,profile_peak_cpu_memory_mb
vgg11,224,2026-01-18T14:30:45,0.5441,0.5519,0.8934,0.8901,0.8834,0.8867,0.0231,1234.56,4096,2048
vgg11,256,2026-01-18T15:15:22,0.5523,0.5612,0.8876,0.8843,0.8776,0.8809,0.0312,956.34,5120,2256
vgg11,320,2026-01-18T16:45:10,0.5634,0.5723,0.8812,0.8779,0.8712,0.8745,0.0534,467.80,7168,2512
resnet18,224,2026-01-18T17:20:33,0.4234,0.4356,0.9123,0.9087,0.9023,0.9055,0.0089,3456.78,2048,1536
resnet18,256,2026-01-18T18:05:44,0.4345,0.4467,0.9067,0.9031,0.8967,0.8999,0.0134,2341.82,2560,1792

Opening in Excel

Open file: results/all_results.csv
Format as table:
- Select all data
- Format → As Table
Create pivot table:
- Insert → PivotTable
- Rows: model
- Columns: resolution
- Values: test_accuracy (average)
Create chart:
- Select pivot table
- Insert → Chart
- X-axis: resolution
- Y-axis: accuracy
- Series: different models

Python Analysis Examples

Best Overall Model

import pandas as pd

df = pd.read_csv('results/all_results.csv')

# Top 5 best results
top5 = df.nlargest(5, 'test_accuracy')[['model', 'resolution', 'test_accuracy']]
print(top5)

Output:

      model  resolution  test_accuracy
0     vgg16         384           0.8956
1     vgg16         320           0.8923
2     vgg11         384           0.8912
3     resnet50      384           0.8901
4     resnet50      320           0.8876

Accuracy by Model

# Average accuracy per model
model_accuracy = df.groupby('model')['test_accuracy'].agg(['mean', 'min', 'max', 'std'])
print(model_accuracy.sort_values('mean', ascending=False))

Output:

              mean      min      max       std
model                                        
vgg16       0.8812  0.7234  0.8956  0.0512
vgg13       0.8756  0.7156  0.8923  0.0498
resnet50    0.8654  0.6834  0.8901  0.0587
...

Accuracy by Resolution

# Average accuracy per resolution
res_accuracy = df.groupby('resolution')['test_accuracy'].agg(['mean', 'min', 'max', 'std'])
print(res_accuracy.sort_values('resolution'))

Output:

            mean      min      max       std
resolution                                 
224        0.8234  0.5123  0.8956  0.0912
256        0.8456  0.6234  0.8923  0.0834
320        0.8612  0.7123  0.8876  0.0756
384        0.8734  0.7456  0.8945  0.0634
512        0.8756  0.7656  0.8967  0.0567

Speed vs Accuracy Tradeoff

# Find fastest models with good accuracy
df['accuracy_per_ms'] = df['test_accuracy'] / (df['profile_avg_time_sec'] * 1000)

best_tradeoff = df.nlargest(10, 'accuracy_per_ms')[['model', 'resolution', 'test_accuracy', 'profile_avg_time_sec']]
print(best_tradeoff)

Memory-Efficient Models

# Find models with lowest GPU memory
efficient = df.nsmallest(10, 'profile_gpu_memory_peak_mb')[['model', 'resolution', 'test_accuracy', 'profile_gpu_memory_peak_mb']]
print(efficient)

📋 Summary JSON

File Location

results/results_summary.json

Structure

{
  "generated_at": "2026-01-18T19:45:30.123456",
  
  "statistics": {
    "total_experiments": 40,
    "completed_experiments": 40,
    "avg_accuracy": 0.8523,
    "max_accuracy": 0.8956,
    "min_accuracy": 0.5234,
    "avg_inference_time_sec": 0.0345,
    "max_inference_time_sec": 0.0890,
    "min_inference_time_sec": 0.0089
  },
  
  "by_model": {
    "vgg11": {
      "experiments": 5,
      "avg_accuracy": 0.8612,
      "best_accuracy": 0.8923,
      "best_resolution": 320,
      "avg_inference_time": 0.0345
    },
    "resnet18": {
      "experiments": 5,
      "avg_accuracy": 0.8734,
      "best_accuracy": 0.8945,
      "best_resolution": 384,
      "avg_inference_time": 0.0134
    },
    ...
  },
  
  "by_resolution": {
    "224": {
      "experiments": 8,
      "avg_accuracy": 0.8234,
      "best_accuracy": 0.8456,
      "best_model": "resnet50"
    },
    "256": {
      "experiments": 8,
      "avg_accuracy": 0.8456,
      "best_accuracy": 0.8723,
      "best_model": "vgg16"
    },
    ...
  },
  
  "top_5_results": [
    {"model": "vgg16", "resolution": 384, "accuracy": 0.8956},
    {"model": "vgg16", "resolution": 320, "accuracy": 0.8923},
    {"model": "vgg11", "resolution": 384, "accuracy": 0.8912},
    {"model": "resnet50", "resolution": 384, "accuracy": 0.8901},
    {"model": "resnet50", "resolution": 320, "accuracy": 0.8876}
  ]
}

Reading Summary in Python

import json

with open('results/results_summary.json', 'r') as f:
    summary = json.load(f)

# Overall statistics
print(f"Total experiments: {summary['statistics']['total_experiments']}")
print(f"Average accuracy: {summary['statistics']['avg_accuracy']:.4f}")

# Top 5 results
print("\nTop 5 Results:")
for i, result in enumerate(summary['top_5_results'], 1):
    print(f"{i}. {result['model']}@{result['resolution']} - Accuracy: {result['accuracy']:.4f}")

# Best model per resolution
print("\nBest Model per Resolution:")
for res, data in summary['by_resolution'].items():
    print(f"  {res}×{res}: {data['best_model']} ({data['best_accuracy']:.4f})")

📄 Detailed Text Report

File Location

results/detailed_report.txt

Structure

The report contains:

Header - Generation timestamp, total experiments
Overall Statistics - Min/max/avg accuracy, speed metrics
Top 10 Results - Best performing model-resolution combos
By Model - Summary for each trained model
By Resolution - Summary for each resolution tested
Detailed Table - Complete results table

Example Content

============================================================
ResoMap Experiment Results Report
Generated: 2026-01-18 19:45:30.123456
============================================================

OVERALL STATISTICS
------------------
Total Experiments:    40
Completed:            40
Failed:               0

Accuracy:
  Average:            0.8523
  Best:               0.8956 (vgg16@384)
  Worst:              0.5234 (simple_cnn@64)
  Std Dev:            0.0845

Inference Speed:
  Average Time:       0.0345 sec/sample
  Throughput (avg):   567.89 samples/sec
  Fastest:            0.0089 sec (resnet18@224)
  Slowest:            0.0890 sec (vgg16@512)

Memory Usage:
  GPU Memory (avg):   4567.8 MB
  CPU Memory (avg):   2134.5 MB
  GPU Peak:           8192 MB
  CPU Peak:           4096 MB

============================================================
TOP 10 RESULTS
============================================================
1.  vgg16           @ 384×384 | Accuracy: 0.8956 | F1: 0.8923
2.  vgg16           @ 320×320 | Accuracy: 0.8923 | F1: 0.8890
3.  vgg11           @ 384×384 | Accuracy: 0.8912 | F1: 0.8879
4.  resnet50        @ 384×384 | Accuracy: 0.8901 | F1: 0.8868
5.  resnet50        @ 320×320 | Accuracy: 0.8876 | F1: 0.8843
...

============================================================
RESULTS BY MODEL
============================================================

VGG11:
  Experiments:        5
  Resolutions:        224, 256, 320, 384, 512
  Best Accuracy:      0.8923 @ 320×320
  Worst Accuracy:     0.7456 @ 224×224
  Avg Accuracy:       0.8612
  Avg Inference Time: 0.0345 sec
  Best Speed:         0.0231 sec @ 224×224

VGG16:
  Experiments:        5
  Resolutions:        224, 256, 320, 384, 512
  Best Accuracy:      0.8956 @ 384×384
  Worst Accuracy:     0.7634 @ 224×224
  Avg Accuracy:       0.8734
  Avg Inference Time: 0.0412 sec
  Best Speed:         0.0281 sec @ 224×224

...

============================================================
RESULTS BY RESOLUTION
============================================================

224×224:
  Experiments:        8
  Best Model:         resnet50 (0.8456)
  Worst Model:        simple_cnn (0.5123)
  Avg Accuracy:       0.8234
  Speediest:          resnet18 (0.0089 sec)

256×256:
  Experiments:        8
  Best Model:         vgg16 (0.8723)
  Worst Model:        tiny_cnn (0.5456)
  Avg Accuracy:       0.8456
  Speediest:          resnet18 (0.0123 sec)

...

============================================================
DETAILED RESULTS TABLE
============================================================
Model              Resolution  Accuracy  Precision  Recall    F1-Score  Time(ms)  GPU Mem(MB)
─────────────────────────────────────────────────────────────────────────────────────────────
vgg11              224         0.8456    0.8423     0.8334    0.8378    23.1      4096
vgg11              256         0.8612    0.8579     0.8490    0.8534    31.2      5120
vgg11              320         0.8923    0.8890     0.8801    0.8845    53.4      7168
...

============================================================

Viewing the Report

# Print to console
type results/detailed_report.txt

# Or on Mac/Linux
cat results/detailed_report.txt

# Search for specific model
grep "vgg11" results/detailed_report.txt

# Find best accuracy
grep "Best Accuracy" results/detailed_report.txt

🔍 Analysis Workflows

Workflow 1: Find Best Overall Model

# View detailed report
cat results/detailed_report.txt | grep -A2 "TOP 10"

Or in Python:

import pandas as pd
df = pd.read_csv('results/all_results.csv')
best = df.loc[df['test_accuracy'].idxmax()]
print(f"Best: {best['model']} @ {best['resolution']}px")
print(f"Accuracy: {best['test_accuracy']:.4f}")

Workflow 2: Compare Models at Fixed Resolution

# All models at 224×224
grep ",224," results/all_results.csv | sort -t, -k6 -rn

Or in Python:

df = pd.read_csv('results/all_results.csv')
at_224 = df[df['resolution'] == 224].sort_values('test_accuracy', ascending=False)
print(at_224[['model', 'test_accuracy']])

Workflow 3: Resolution Sensitivity per Model

# Plot accuracy vs resolution for VGG11
python -c "
import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('results/all_results.csv')
vgg11 = df[df['model'] == 'vgg11'].sort_values('resolution')

plt.figure(figsize=(8, 5))
plt.plot(vgg11['resolution'], vgg11['test_accuracy'], marker='o')
plt.xlabel('Resolution (pixels)')
plt.ylabel('Accuracy')
plt.title('VGG11: Accuracy vs Resolution')
plt.grid(True)
plt.savefig('accuracy_vs_resolution.png')
"

Workflow 4: Speed vs Accuracy Tradeoff

import pandas as pd
import matplotlib.pyplot as plt

df = pd.read_csv('results/all_results.csv')

plt.figure(figsize=(10, 6))
for model in df['model'].unique():
    model_data = df[df['model'] == model]
    plt.scatter(model_data['profile_avg_time_sec'], 
               model_data['test_accuracy'],
               label=model, s=100)

plt.xlabel('Inference Time (seconds)')
plt.ylabel('Accuracy')
plt.title('Speed vs Accuracy Tradeoff')
plt.legend()
plt.grid(True)
plt.savefig('speed_accuracy_tradeoff.png')

📊 Metrics Explained

Classification Metrics

Accuracy: (TP + TN) / (TP + TN + FP + FN)
- Overall correctness, 0-1 scale
- Best for balanced datasets
Precision: TP / (TP + FP)
- Of positive predictions, how many correct?
- Important when false positives are costly
Recall: TP / (TP + FN)
- Of actual positives, how many found?
- Important when false negatives are costly
F1-Score: 2 × (Precision × Recall) / (Precision + Recall)
- Harmonic mean of precision and recall
- Best for imbalanced data

Example (skin lesion classification):

100 test images total
90 correctly classified
5 wrong positives (predicted melanoma, actually benign)
5 wrong negatives (predicted benign, actually melanoma)

Accuracy = 90/100 = 0.90
Precision = 45/50 = 0.90 (assuming 45 true melanomas)
Recall = 45/50 = 0.90
F1 = 0.90

Performance Metrics

Inference Time: How long to classify one image (seconds)
- Lower is better
- Includes data transfer to GPU
Throughput: Images classified per second
- Reciprocal of inference time
- Higher is better
GPU Memory: Peak memory during inference (MB)
- Lower is better for deployment
- Affects batch sizes possible

🎯 Practical Examples

Example 1: Production Model Selection

import pandas as pd

df = pd.read_csv('results/all_results.csv')

# Target: >90% accuracy with <0.02 sec inference time
candidates = df[(df['test_accuracy'] > 0.90) & (df['profile_avg_time_sec'] < 0.02)]
print(candidates[['model', 'resolution', 'test_accuracy', 'profile_avg_time_sec']])

Example 2: Mobile Deployment

# Find smallest model with >85% accuracy
candidates = df[df['test_accuracy'] > 0.85].nsmallest(5, 'profile_gpu_memory_peak_mb')
print(candidates[['model', 'resolution', 'test_accuracy', 'profile_gpu_memory_peak_mb']])

Example 3: Research Report

# Generate for paper
python scripts/aggregate_results.py --report-only
cat results/detailed_report.txt  # Copy to paper

Back: TRAINING_EXECUTION.md - How to run experiments
Home: PROJECT_SUMMARY.md - Project overview

FilesExpand file tree

RESULTS_EVALUATION.md

Latest commit

History

RESULTS_EVALUATION.md

File metadata and controls

Results Export & Evaluation Guide

📊 Overview

✅ Automatic JSON Export

What Gets Saved Automatically

Where Files Are Saved

Automatic Saving Process

🔄 Manual Aggregation

Command: Aggregate Results

Options

📈 CSV File Analysis

File Location

CSV Columns

Example CSV Content

Opening in Excel

Python Analysis Examples

Best Overall Model

Accuracy by Model

Accuracy by Resolution

Speed vs Accuracy Tradeoff

Memory-Efficient Models

📋 Summary JSON

File Location

Structure

Reading Summary in Python

📄 Detailed Text Report

File Location

Structure

Example Content

Viewing the Report

🔍 Analysis Workflows

Workflow 1: Find Best Overall Model

Workflow 2: Compare Models at Fixed Resolution

Workflow 3: Resolution Sensitivity per Model

Workflow 4: Speed vs Accuracy Tradeoff

📊 Metrics Explained

Classification Metrics

Performance Metrics

🎯 Practical Examples

Example 1: Production Model Selection

Example 2: Mobile Deployment

Example 3: Research Report