Model Pairing: GPT-4 vs. Mistral-3B (4 variants)
Tasks: Reasoning + Summarization
Evaluation: Accuracy, stepwise logic, summarization quality
Sample Size: Start small, scale to 50+ for significance
To compare the accuracy and environmental impact of:
- One commercial LLM (closed-source)
- One open-source LLM in four configurations:
- Original
- Distilled
- RAG-enhanced (Retrieval-Augmented Generation)
- Distilled + RAG
Model: GPT-4 (OpenAI)
Why:
- Industry benchmark for reasoning and summarization
- Strong performance across tasks
- Compatible with G-Eval evaluation
- API access available (paid)
Alternative: Claude 3 Opus (Anthropic): strong in reasoning,
slightly weaker in summarization.
Model: Mistral-3B
Why:
- Lightweight and energy-efficient β smaller carbon footprint than 7B
- Good performance for its size and architecture
- Easy to distill and integrate with RAG
- Active open-source community on Hugging Face
Alternative: Mistral-7B (legacy, more accurate but heavier) or
LLaMA-3-8B (requires stronger GPUs).
- ARC (AI2 Reasoning Challenge / grade-school science questions)
- GSM8K (Math reasoning)
- ProofWriter (Step-by-step inference)
- LogiQA (Logical multiple choice)
- News articles
- Academic abstracts
- Narrative texts
| Sampling Level | Purpose / Use Case | Reasoning | Summarization |
|---|---|---|---|
| Preliminary | Quick validation and failure detection | 50β100 | 50β100 |
| Reliable | Statistically meaningful trends | 200β500+ | 200β500+ |
| Academic | Comprehensive benchmark-level | 1,000β10,000+ | 1,000β10,000+ |
Rationale:
- Preliminary: Initial signal of model behavior.
- Reliable: Minimum for academic validity (500+ examples).
- Academic: Derived from MMLU and MATH benchmarks (1,000+ examples).
| Ref | Benchmark / Source | Justification |
|---|---|---|
| G1 | MMLU Benchmark | 57 subjects, thousands of Qs β 1,000+ needed |
| G2 | MATH Benchmark | 12,500 math problems β 1,000+ subset valid |
| G3 | ANLI / LLM Eval | 1,200 test examples β supports 200β500+ |
| G4 | ML Sample Size | 500+ gives strong validity in ML research |
Unique because:
- Compares versions of the same open-source model.
- Evaluates accuracy + environmental impact (energy, COβ).
Valuable because:
- Helps understand trade-offs between performance and footprint.
- Designed for student teams with limited resources.
- Provides replicable framework for ethical + technical evaluation.
- Supports the global shift toward eco-conscious AI.
References:
- Mistral-7B is a legacy model (as of March 2025) but still benchmarked.
- Mistral-3B offers better efficiency, lower GPU use, smaller footprint.
- Our main open-source model: Mistral-3B
- Mistral-7B appears as a baseline reference.
- Mistral-Nemo: Mentioned as a next-generation model for discussion.