martysai
diff --git a/‎.gitattributes‎
Lines changed: 27 additions & 0 deletions b/‎.gitattributes‎
Lines changed: 27 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 125 additions & 11 deletions b/‎README.md‎
Lines changed: 125 additions & 11 deletions
@@ -0,0 +1,27 @@
+# Auto detect text files and perform LF normalization
+* text=auto
+
+# Python files
+*.py text eol=lf
+
+# Markdown files
+*.md text eol=lf
+
+# JSON files
+*.json text eol=lf
+
+# YAML files
+*.yml text eol=lf
+*.yaml text eol=lf
+
+# Shell scripts
+*.sh text eol=lf
+
+# Configuration files
+*.toml text eol=lf
+*.cfg text eol=lf
+*.ini text eol=lf
+
+# Keep Windows batch files with CRLF
+*.bat text eol=crlf
+*.cmd text eol=crlf
@@ -54,8 +54,10 @@ Test Dataset + Model Predictions --> [benchmark.py] --> Metrics Report
   QLoRA (4-bit quantization) for training on 1-2 A100 GPUs.
 
 - **`serve.py`** - FastAPI inference server that uses ollama API to generate
-  docstrings. The server uses a hard-coded system prompt for NumPy-style docstring
-  generation.
+  docstrings. Supports multiple Qwen Coder models with model-specific configurations.
+
+- **`models.py`** - Model configuration registry with sampling parameters for
+  Qwen 2.5 Coder and Qwen3 Coder variants.
 
 ### Evaluation (`src/evaluation/`)
 
@@ -97,15 +99,22 @@ ollama as the backend. The server uses a system prompt stored in
 ### Prerequisites
 
 1. **Install ollama**: Make sure [ollama](https://ollama.ai/) is installed and running locally
-2. **Pull a model**: Download a code model (e.g., `qwen2.5-coder:32b`):
+2. **Pull a model**: Download one of the supported code models:
    ```bash
-   ollama pull qwen2.5-coder:32b
+   # Qwen 2.5 Coder (dense models)
+   ollama pull qwen2.5-coder:32b  # Default, ~18GB Q4
+   ollama pull qwen2.5-coder:14b  # Mid-size, ~8GB Q4
+   ollama pull qwen2.5-coder:7b   # Fast, ~4GB Q4
+
+   # Qwen3 Coder (MoE model)
+   ollama pull qwen3-coder:30b-a3b  # Best quality, ~18GB Q4, 256K context
    ```
 
 ### Starting the Server
 
 Start the FastAPI server using uvicorn:
 
+**Linux/macOS:**
 ```bash
 # Using uvicorn directly
 uvicorn src.training.serve:app --host 0.0.0.0 --port 8000
@@ -114,19 +123,75 @@ uvicorn src.training.serve:app --host 0.0.0.0 --port 8000
 python -m src.training.serve
 ```
 
+**Windows (PowerShell):**
+```powershell
+uvicorn src.training.serve:app --host 0.0.0.0 --port 8000
+```
+
 The server will start on `http://localhost:8000` by default.
 
 ### Configuration
 
 The server can be configured using environment variables:
 
 - `OLLAMA_URL` - Ollama API endpoint (default: `http://localhost:11434/api/chat`)
-- `OLLAMA_MODEL` - Model name to use (default: `qwen2.5-coder:32b`)
+- `OLLAMA_MODEL` - Model key or Ollama model name (default: `qwen2.5-coder-32b`)
 - `REQUEST_TIMEOUT` - Request timeout in seconds (default: `120.0`)
 
-Example:
+**Linux/macOS:**
+```bash
+OLLAMA_MODEL=qwen3-coder-30b uvicorn src.training.serve:app --port 8000
+```
+
+**Windows (PowerShell):**
+```powershell
+$env:OLLAMA_MODEL="qwen3-coder-30b"; uvicorn src.training.serve:app --port 8000
+```
+
+**Windows (CMD):**
+```cmd
+set OLLAMA_MODEL=qwen3-coder-30b && uvicorn src.training.serve:app --port 8000
+```
+
+### Available Models
+
+| Model Key | Ollama Model | Architecture | Memory (Q4) | Context | Description |
+|-----------|--------------|--------------|-------------|---------|-------------|
+| `qwen2.5-coder-32b` | `qwen2.5-coder:32b` | Dense | ~18GB | 32K | Default, balanced quality/speed |
+| `qwen2.5-coder-14b` | `qwen2.5-coder:14b` | Dense | ~8GB | 32K | Mid-size, good performance |
+| `qwen2.5-coder-7b` | `qwen2.5-coder:7b` | Dense | ~4GB | 32K | Fast inference |
+| `qwen3-coder-30b` | `qwen3-coder:30b-a3b` | MoE | ~18GB | 256K | Best quality, 3.3B active params |
+
+Each model has optimized sampling parameters:
+- **Qwen 2.5 Coder**: temperature=0.7, top_p=0.9, top_k=40
+- **Qwen3 Coder**: temperature=1.0, top_p=0.95, top_k=40 (per official recommendations)
+
+### Model Selection
+
+You can select a model in two ways:
+
+1. **Environment variable** (applies to all requests):
+   ```bash
+   OLLAMA_MODEL=qwen3-coder-30b uvicorn src.training.serve:app
+   ```
+
+2. **Per-request** (via API):
+   ```bash
+   curl -X POST http://localhost:8000/generate \
+     -H "Content-Type: application/json" \
+     -d '{"code": "def add(x, y): return x + y", "model": "qwen3-coder-30b"}'
+   ```
+
+### List Available Models
+
+**Via CLI:**
+```bash
+python scripts/run_ollama.py --list-models
+```
+
+**Via API:**
 ```bash
-OLLAMA_MODEL=qwen2.5-coder:7b uvicorn src.training.serve:app --port 8000
+curl http://localhost:8000/models
 ```
 
 ### API Endpoints
@@ -143,7 +208,9 @@ curl http://localhost:8000/health
 ```json
 {
   "status": "healthy",
-  "service": "ollama"
+  "service": "ollama",
+  "active_model": "Qwen 2.5 Coder 32B",
+  "ollama_model": "qwen2.5-coder:32b"
 }
 ```
 
@@ -169,12 +236,14 @@ curl -X POST http://localhost:8000/generate \
 
 **Request Body:**
 - `code` (required): Python function code as a string
-- `max_new_tokens` (optional): Maximum number of tokens to generate (default: 256)
+- `max_new_tokens` (optional): Maximum number of tokens to generate (uses model default if not specified)
+- `model` (optional): Model key or Ollama model name to use for this request
 
 **Response (200 OK):**
 ```json
 {
-  "docstring": "\"\"\"Compute the sum of two numbers.\n\nParameters\n----------\nx : int\n    First number.\ny : int\n    Second number.\n\nReturns\n-------\nint\n    Sum of x and y.\n\"\"\""
+  "docstring": "\"\"\"Compute the sum of two numbers.\n\nParameters\n----------\nx : int\n    First number.\ny : int\n    Second number.\n\nReturns\n-------\nint\n    Sum of x and y.\"\"\"",
+  "model": "qwen2.5-coder:32b"
 }
 ```
 
@@ -185,12 +254,57 @@ curl -X POST http://localhost:8000/generate \
 }
 ```
 
+#### List Models
+
+Get available model configurations:
+
+```bash
+curl http://localhost:8000/models
+```
+
+**Response (200 OK):**
+```json
+{
+  "default": "qwen2.5-coder-32b",
+  "active": "qwen2.5-coder-32b",
+  "models": [
+    {
+      "key": "qwen2.5-coder-32b",
+      "name": "Qwen 2.5 Coder 32B",
+      "ollama_model": "qwen2.5-coder:32b",
+      "context_window": 32768,
+      "architecture": "dense",
+      "memory_q4": "~18GB",
+      "description": "Dense 32B model, good balance of quality and speed"
+    }
+  ]
+}
+```
+
+### CLI Tool
+
+The CLI tool allows testing docstring generation directly:
+
+```bash
+# Use default model
+python scripts/run_ollama.py --user "def add(x, y): return x + y"
+
+# Use specific model by key
+python scripts/run_ollama.py --model-key qwen3-coder-30b --user "def foo(): pass"
+
+# Use raw Ollama model name
+python scripts/run_ollama.py --model qwen2.5-coder:7b --user "def bar(): pass"
+
+# List available models
+python scripts/run_ollama.py --list-models
+```
+
 ### Testing
 
 Run the test suite to verify the API endpoints:
 
 ```bash
-pytest tests/test_serve.py -v
+pytest tests/test_serve.py tests/test_models.py -v
 ```
 
 ## Dataset