Skip to content

Commit 4848d6e

Browse files
authored
Merge pull request #1 from nmdra/copilot/fine-tune-smollm2-135m-lora
Add SmolLM2 training pipeline with script + notebook outputs
2 parents b524e0d + 2bad1a3 commit 4848d6e

6 files changed

Lines changed: 676 additions & 1 deletion

File tree

.gitignore

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
__pycache__/
2+
*.py[cod]
3+
.venv/
4+
outputs/
5+
smollm-student-extractor/
6+
smollm-student-gguf/
7+
data/dataset.json

README.md

Lines changed: 55 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,56 @@
11
# Assignment-Metadata-Extractor
2-
A lightweight, locally hosted LLM pipeline that extracts and normalizes unstructured student assignment text into a strict JSON schema using a fine-tuned SmolLM2-135M model served via Ollama.
2+
3+
A lightweight, locally hosted LLM pipeline that extracts and normalizes unstructured student assignment text into a strict JSON schema using a fine-tuned SmolLM2 model served via Ollama.
4+
5+
## Model
6+
7+
Hugging Face model: https://huggingface.co/nimendraai/SmolLM2-360M-Assignment-Metadata-Extractor
8+
9+
## Training (Part 1)
10+
11+
This repository includes:
12+
13+
- `data/generate_dataset.py`: Generates a diverse synthetic training dataset.
14+
- `training/train.py`: Fine-tunes `HuggingFaceTB/SmolLM2-135M-Instruct` with Unsloth + LoRA and exports GGUF.
15+
- `training/train.ipynb`: Jupyter notebook version of the same training pipeline.
16+
17+
## Environment Setup (UV)
18+
19+
This project uses **uv** for package management.
20+
21+
```bash
22+
uv venv
23+
source .venv/bin/activate # Linux/macOS
24+
# .venv\Scripts\activate # Windows PowerShell
25+
26+
uv sync
27+
```
28+
29+
If you want to run the notebook locally:
30+
31+
```bash
32+
uv run python -m ipykernel install --user --name assignment-metadata-extractor
33+
```
34+
35+
## Generate Dataset
36+
37+
```bash
38+
uv run python data/generate_dataset.py --size 400 --output data/dataset.json
39+
```
40+
41+
## Run Fine-Tuning Script
42+
43+
```bash
44+
uv run python training/train.py
45+
```
46+
47+
Outputs:
48+
49+
- `./smollm-student-extractor/` (HuggingFace format)
50+
- `./smollm-student-gguf/model-Q4_K_M.gguf` (Ollama-ready GGUF)
51+
52+
## Run Notebook Version
53+
54+
```bash
55+
uv run jupyter notebook training/train.ipynb
56+
```

data/generate_dataset.py

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,89 @@
1+
import argparse
2+
import json
3+
import random
4+
from pathlib import Path
5+
6+
STUDENT_NUMBER_KEYS = [
7+
"Student No",
8+
"Stu. ID",
9+
"Student ID",
10+
"ID",
11+
"Reg No",
12+
"Registration No",
13+
"Index No",
14+
"Reg. Number",
15+
]
16+
STUDENT_NAME_KEYS = ["Name", "Full Name", "Student Name", "Student", "Stu. Name"]
17+
ASSIGNMENT_KEYS = [
18+
"Assignment #",
19+
"Assignment No",
20+
"HW",
21+
"Task No",
22+
"Submission No",
23+
"Assgn #",
24+
"Worksheet No",
25+
]
26+
SEPARATORS = [": ", " - ", " = ", ": "]
27+
LINE_BREAKS = ["\n", " | ", ", ", " "]
28+
29+
NAMES = [
30+
"Amal Perera",
31+
"Nimal Silva",
32+
"Kasun Fernando",
33+
"Dilini Rathnayake",
34+
"Chamara Bandara",
35+
"Sithum Jayawardena",
36+
"Amali Gunasekara",
37+
]
38+
39+
40+
def make_example(student_num: int, name: str, assign_num: int) -> dict:
41+
sk = random.choice(STUDENT_NUMBER_KEYS)
42+
nk = random.choice(STUDENT_NAME_KEYS)
43+
ak = random.choice(ASSIGNMENT_KEYS)
44+
sep = random.choice(SEPARATORS)
45+
lb = random.choice(LINE_BREAKS)
46+
text = f"{sk}{sep}{student_num}{lb}{nk}{sep}{name}{lb}{ak}{sep}{assign_num}"
47+
return {
48+
"instruction": "Extract student info as JSON from the following text.",
49+
"input": text,
50+
"output": json.dumps(
51+
{
52+
"student_number": str(student_num),
53+
"student_name": name,
54+
"assignment_number": str(assign_num),
55+
}
56+
),
57+
}
58+
59+
60+
def generate_dataset(size: int) -> list[dict]:
61+
dataset = [
62+
make_example(20210000 + i, random.choice(NAMES), (i % 10) + 1) for i in range(size)
63+
]
64+
random.shuffle(dataset)
65+
return dataset
66+
67+
68+
def main() -> None:
69+
parser = argparse.ArgumentParser(description="Generate student extraction dataset.")
70+
parser.add_argument("--size", type=int, default=400, help="Number of examples to generate.")
71+
parser.add_argument(
72+
"--output",
73+
default="data/dataset.json",
74+
help="Output JSON path.",
75+
)
76+
args = parser.parse_args()
77+
78+
output_path = Path(args.output)
79+
output_path.parent.mkdir(parents=True, exist_ok=True)
80+
81+
dataset = generate_dataset(args.size)
82+
with output_path.open("w", encoding="utf-8") as f:
83+
json.dump(dataset, f, indent=2)
84+
85+
print(f"Generated {len(dataset)} examples at {output_path}")
86+
87+
88+
if __name__ == "__main__":
89+
main()

pyproject.toml

Lines changed: 18 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
[project]
2+
name = "assignment-metadata-extractor"
3+
version = "0.1.0"
4+
description = "SmolLM2 fine-tuning pipeline for student assignment metadata extraction"
5+
readme = "README.md"
6+
requires-python = ">=3.10,<3.12"
7+
dependencies = [
8+
"accelerate>=1.2.1",
9+
"datasets>=3.2.0",
10+
"ipykernel>=6.29.5",
11+
"jupyter>=1.1.1",
12+
"llama-cpp-python>=0.3.7",
13+
"trl>=0.12.2",
14+
"unsloth>=2025.2.15",
15+
]
16+
17+
[tool.uv]
18+
package = false

0 commit comments

Comments
 (0)