Skip to content

Commit 2c0ec37

Browse files
authored
Merge pull request #5 from MIT-Emerging-Talent/model
Model selection and justification document
2 parents 0b41605 + 4dd8995 commit 2c0ec37

1 file changed

Lines changed: 128 additions & 0 deletions

File tree

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,128 @@
1+
# Comparing Open-Source and Commercial LLMs on Reasoning and Summarization Tasks
2+
3+
## Summary
4+
5+
**Model Pairing:** GPT-4 vs. Mistral-3B (4 variants)
6+
**Tasks:** Reasoning + Summarization
7+
**Evaluation:** Accuracy, stepwise logic, summarization quality
8+
**Sample Size:** Start small, scale to 50+ for significance
9+
10+
---
11+
12+
## Goal
13+
14+
To compare the accuracy and environmental impact of:
15+
16+
- One commercial LLM (closed-source)
17+
- One open-source LLM in four configurations:
18+
- Original
19+
- Distilled
20+
- RAG-enhanced (Retrieval-Augmented Generation)
21+
- Distilled + RAG
22+
23+
---
24+
25+
## Recommended Commercial Model
26+
27+
**Model:** GPT-4 (OpenAI)
28+
29+
**Why:**
30+
31+
- Industry benchmark for reasoning and summarization
32+
- Strong performance across tasks
33+
- Compatible with G-Eval evaluation
34+
- API access available (paid)
35+
36+
**Alternative:** Claude 3 Opus (Anthropic): strong in reasoning,
37+
slightly weaker in summarization.
38+
39+
---
40+
41+
## Recommended Open-Source Model
42+
43+
**Model:** Mistral-3B
44+
45+
**Why:**
46+
47+
- Lightweight and energy-efficient — smaller carbon footprint than 7B
48+
- Good performance for its size and architecture
49+
- Easy to distill and integrate with RAG
50+
- Active open-source community on Hugging Face
51+
52+
**Alternative:** Mistral-7B (legacy, more accurate but heavier) or
53+
LLaMA-3-8B (requires stronger GPUs).
54+
55+
---
56+
57+
## Evaluation Strategy
58+
59+
### 1. Reasoning Tasks
60+
61+
- ARC (AI2 Reasoning Challenge / grade-school science questions)
62+
- GSM8K (Math reasoning)
63+
- ProofWriter (Step-by-step inference)
64+
- LogiQA (Logical multiple choice)
65+
66+
### 2. Summarization Tasks
67+
68+
- News articles
69+
- Academic abstracts
70+
- Narrative texts
71+
72+
---
73+
74+
## Sample Size Recommendations
75+
76+
| Sampling Level | Purpose / Use Case | Reasoning | Summarization |
77+
|----------------|--------------------|------------|----------------|
78+
| Preliminary | Quick validation and failure detection | 50–100 | 50–100 |
79+
| Reliable | Statistically meaningful trends | 200–500+ | 200–500+ |
80+
| Academic | Comprehensive benchmark-level | 1,000–10,000+ | 1,000–10,000+ |
81+
82+
**Rationale:**
83+
84+
- Preliminary: Initial signal of model behavior.
85+
- Reliable: Minimum for academic validity (500+ examples).
86+
- Academic: Derived from MMLU and MATH benchmarks (1,000+ examples).
87+
88+
---
89+
90+
## Academic Justification of Sample Size
91+
92+
| Ref | Benchmark / Source | Justification |
93+
|-----|--------------------|----------------|
94+
| G1 | MMLU Benchmark | 57 subjects, thousands of Qs → 1,000+ needed |
95+
| G2 | MATH Benchmark | 12,500 math problems → 1,000+ subset valid |
96+
| G3 | ANLI / LLM Eval | 1,200 test examples → supports 200–500+ |
97+
| G4 | ML Sample Size | 500+ gives strong validity in ML research |
98+
99+
---
100+
101+
## Why This Project Is Niche and Valuable
102+
103+
**Unique because:**
104+
105+
- Compares *versions* of the same open-source model.
106+
- Evaluates *accuracy + environmental impact* (energy, CO₂).
107+
108+
**Valuable because:**
109+
110+
- Helps understand trade-offs between performance and footprint.
111+
- Designed for student teams with limited resources.
112+
- Provides replicable framework for *ethical + technical* evaluation.
113+
- Supports the global shift toward *eco-conscious AI*.
114+
115+
**References:**
116+
117+
- [DeepSeek vs GPT-4 vs LLaMA vs Mistral vs Cohere](https://www.aubergine.co/insights/deepseek-v3-vs-gpt-4-vs-llama-3-vs-mistral-7b-vs-cohere)
118+
- [Mistral vs GPT comparison](https://dev.to/abhinowww/mistral-vs-gpt-a-comprehensive-comparison-of-leading-ai-models-2lk2)
119+
120+
---
121+
122+
## Note on Mistral Model Selection
123+
124+
- Mistral-7B is a *legacy model* (as of March 2025) but still benchmarked.
125+
- Mistral-3B offers better efficiency, lower GPU use, smaller footprint.
126+
- Our main open-source model: **Mistral-3B**
127+
- Mistral-7B appears as a baseline reference.
128+
- Mistral-Nemo: Mentioned as a next-generation model for discussion.

0 commit comments

Comments
 (0)