StewAlexander-com
diff --git a/‎README.md‎
Lines changed: 42 additions & 0 deletions b/‎README.md‎
Lines changed: 42 additions & 0 deletions
diff --git a/‎backend/README.md‎
Lines changed: 92 additions & 9 deletions b/‎backend/README.md‎
Lines changed: 92 additions & 9 deletions
diff --git a/‎backend/app/main.py‎
Lines changed: 146 additions & 0 deletions b/‎backend/app/main.py‎
Lines changed: 146 additions & 0 deletions
@@ -164,6 +164,48 @@ The chat module reads (in order): `window.TUTOR_BACKEND_URL` → `<meta
 name="tutor-backend">` → `localStorage["tutor-backend"]` → port heuristic →
 same origin.
 
+## UX Workflow — read · run · evaluate
+
+Every section view now ends with an inline **code lab**: an editor seeded
+with the section's example snippet, a **Run** button that executes the code
+locally, and an **Evaluate** button that sends the code + the actual run
+output to the tutor for evidence-based feedback. The floating chat panel
+stays available for free-form questions.
+
+```mermaid
+flowchart LR
+  Lesson["Lesson view"] --> Lab["Code lab editor"]
+  Lab -- "Run" --> RunApi["/api/run<br/>subprocess + timeout"]
+  RunApi --> Lab
+  Lab -- "Evaluate" --> EvalApi["/api/evaluate<br/>evidence packet → LLM"]
+  EvalApi --> Lab
+  Lab -. "free-form" .-> Chat["Floating chat → /api/chat"]
+```
+
+The five candidate workflows considered, the trade-offs, and the chosen
+blend (lesson-first spine + inline code-lab + evidence-packet evaluation)
+are written up in [`docs/ux-workflow.md`](docs/ux-workflow.md).
+
+### New backend endpoints
+
+- `POST /api/run` — runs the submitted code in an isolated Python subprocess
+  (`python -I`, empty env, temp cwd, hard wall-clock timeout, size-limited
+  output). Returns `{stdout, stderr, exit_code, duration_ms, timed_out,
+  truncated}`. This is **prototype safety only** — subprocess + timeout +
+  restricted env. Not a real sandbox. See
+  [`docs/safety-and-sandboxing.md`](docs/safety-and-sandboxing.md) for the
+  controls a serious deployment would add (containers, seccomp, network
+  namespaces, CPU/memory limits).
+- `POST /api/evaluate` — accepts `{code, section?, question?, run_output?}`,
+  runs the code if `run_output` is missing, builds an evidence packet, and
+  asks the LLM for a hint-first assessment. Returns
+  `{assessment, feedback, next_step, run, model}` where `assessment` is one
+  of `passed | needs_work | error`.
+
+Configurable via env: `TUTOR_RUN_TIMEOUT` (default 5s, clamped 0.5–30s),
+`TUTOR_RUN_MAX_CODE_BYTES` (default 50 000), `TUTOR_RUN_MAX_OUTPUT_BYTES`
+(default 32 000).
+
 ## Core Components
 
 - **Tutor UI**: A local web app, terminal interface, or desktop shell where the student reads lessons, submits code, and receives feedback.
 
@@ -4,9 +4,12 @@ A small, local-first FastAPI service that proxies an [Ollama](https://ollama.com
 LLM (default: `gemma3:4b`) and exposes a tutor-shaped HTTP API for the
 [`frontend/`](../frontend/) PWA and other clients.
 
-The backend is intentionally minimal. It does not yet execute student code; the
-sandboxed runner described in [`docs/safety-and-sandboxing.md`](../docs/safety-and-sandboxing.md)
-is a separate milestone.
+The backend now also exposes a *prototype-grade* Python runner
+(`POST /api/run`) and an LLM evaluator (`POST /api/evaluate`) used by the
+frontend's inline code lab. The runner uses subprocess isolation with a
+hard wall-clock timeout and a restricted env — see
+[`docs/safety-and-sandboxing.md`](../docs/safety-and-sandboxing.md) for the
+controls a real deployment would still need to add.
 
 ## Layout
 
@@ -16,6 +19,7 @@ backend/
 │   ├── config.py         # env-driven Settings + tutor system prompt loader
 │   ├── main.py           # FastAPI app factory and routes
 │   ├── ollama_client.py  # async client for /api/tags and /api/chat
+│   ├── runner.py         # prototype Python subprocess runner (timeout + restricted env)
 │   └── schemas.py        # pydantic request/response models
 ├── tests/
 │   └── test_api.py       # mocked Ollama tests via respx
@@ -75,6 +79,9 @@ before launching uvicorn — the static frontend will be mounted at `/`.
 | `TUTOR_SERVE_FRONTEND` | `0` | Set to `1` to mount `frontend/` at `/`. |
 | `TUTOR_FRONTEND_DIR` | `../frontend` | Override the directory used when serving the frontend. |
 | `TUTOR_SYSTEM_PROMPT_PATH` | `../prompts/tutor-system-prompt.md` | Markdown file whose first fenced block is used as the default system prompt. |
+| `TUTOR_RUN_TIMEOUT` | `5` | Wall-clock seconds for `/api/run` and `/api/evaluate` code execution. Clamped to 0.5–30s. |
+| `TUTOR_RUN_MAX_CODE_BYTES` | `50000` | Max UTF-8 bytes accepted for a single submission. Clamped to 1 000–200 000. |
+| `TUTOR_RUN_MAX_OUTPUT_BYTES` | `32000` | Each of stdout/stderr is truncated past this. Clamped to 1 000–200 000. |
 
 ## Endpoints
 
@@ -145,6 +152,81 @@ curl -N http://localhost:8001/api/chat \
 
 Each streamed line is a JSON object forwarded from Ollama's `/api/chat` stream.
 
+### `POST /api/run`
+
+Executes student code in an isolated Python subprocess. **Prototype safety
+only** — subprocess + hard timeout + restricted env (`python -I`, empty env
+except `LC_ALL`/`PYTHONIOENCODING`, temp cwd). This is *not* a real sandbox.
+
+Request:
+
+```jsonc
+{
+  "code": "print(2 + 2)\n",
+  "stdin": "",            // optional
+  "timeout": 3.0          // optional, default 5s, clamped 0.5–30s
+}
+```
+
+Response:
+
+```json
+{
+  "stdout": "4\n",
+  "stderr": "",
+  "exit_code": 0,
+  "duration_ms": 16,
+  "timed_out": false,
+  "truncated": false
+}
+```
+
+Errors:
+
+- `400` if `code` exceeds `TUTOR_RUN_MAX_CODE_BYTES`.
+- `422` for malformed bodies.
+- Student-side failures (syntax errors, non-zero exits, timeouts) are
+  **not** errors — they come back in the normal response with
+  `exit_code != 0` and/or `timed_out: true`.
+
+### `POST /api/evaluate`
+
+Wraps a `/api/run` + LLM call into one request. Builds a compact evidence
+packet (code + actual runtime output + optional section context and
+learner question) and asks the tutor model for a hint-first assessment.
+
+Request:
+
+```jsonc
+{
+  "code": "for n in [1,2,3]: print(n)\n",
+  "section": "10 — Loops",            // optional
+  "question": "Is this idiomatic?",   // optional
+  "run_output": {                     // optional — if present, /api/run is skipped
+    "stdout": "...", "stderr": "", "exit_code": 0,
+    "duration_ms": 5, "timed_out": false, "truncated": false
+  },
+  "model": "gemma3:4b",               // optional
+  "temperature": 0.2                  // optional
+}
+```
+
+Response:
+
+```json
+{
+  "assessment": "passed",
+  "feedback": "Your loop iterates correctly and prints each item...",
+  "next_step": "Try the same with a list comprehension.",
+  "run": { "stdout": "1\n2\n3\n", "stderr": "", "exit_code": 0, "duration_ms": 14, "timed_out": false, "truncated": false },
+  "model": "gemma3:4b"
+}
+```
+
+`assessment` is one of `passed | needs_work | error`. `next_step` is a
+best-effort extraction from the model's reply; it may be `null` if the
+tutor's response did not include a recognisable next-step line.
+
 ## Tests
 
 ```bash
@@ -154,13 +236,14 @@ cd backend
 
 Tests use `respx` to mock the Ollama HTTP API, so they run without a real model
 server. The suite covers health (reachable + degraded), config, default and
-custom system prompt injection, and upstream error handling.
+custom system prompt injection, upstream error handling, the frontend chat
+wiring, and the `/api/run` + `/api/evaluate` endpoints (including the runner
+module's timeout, isolation, and output-truncation behaviour).
 
 ## Roadmap
 
-- Add a `/api/run` endpoint that wraps the sandboxed Python runner described in
-  [`docs/safety-and-sandboxing.md`](../docs/safety-and-sandboxing.md).
-- Add a `/api/tutor/turn` endpoint that orchestrates: run code → collect
-  evidence → call LLM with the structured context template from
-  [`prompts/tutor-system-prompt.md`](../prompts/tutor-system-prompt.md).
+- Tighten `/api/run` isolation: container or microVM, network namespace,
+  CPU/memory limits, seccomp/AppArmor where available.
+- Stream `/api/evaluate` responses (the LLM call already streams; the
+  evidence-packet shape just needs an NDJSON variant).
 - Persist learner state (see roadmap M4).
@@ -12,12 +12,22 @@
 
 from .config import Settings, get_settings
 from .ollama_client import OllamaClient, OllamaError
+from .runner import (
+    DEFAULT_TIMEOUT_SEC,
+    MAX_CODE_BYTES,
+    RunnerError,
+    run_python,
+)
 from .schemas import (
     ChatMessage,
     ChatRequest,
     ChatResponse,
     ConfigResponse,
+    EvaluateRequest,
+    EvaluateResponse,
     HealthResponse,
+    RunRequest,
+    RunResponse,
 )
 
 
@@ -76,6 +86,142 @@ async def config() -> ConfigResponse:
             ollama_url=settings.ollama_url,
             default_model=settings.model,
             request_timeout=settings.request_timeout,
+            run_timeout_default=DEFAULT_TIMEOUT_SEC,
+            run_max_code_bytes=MAX_CODE_BYTES,
+        )
+
+    def _result_to_response(result) -> RunResponse:
+        return RunResponse(
+            stdout=result.stdout,
+            stderr=result.stderr,
+            exit_code=result.exit_code,
+            duration_ms=result.duration_ms,
+            timed_out=result.timed_out,
+            truncated=result.truncated,
+        )
+
+    @app.post("/api/run", response_model=RunResponse)
+    async def run(req: RunRequest) -> RunResponse:
+        try:
+            result = await run_python(
+                req.code, stdin=req.stdin, timeout=req.timeout
+            )
+        except RunnerError as exc:
+            raise HTTPException(status_code=400, detail=str(exc)) from exc
+        return _result_to_response(result)
+
+    def _build_evaluation_prompt(
+        code: str,
+        run_resp: RunResponse,
+        section: str | None,
+        question: str | None,
+    ) -> str:
+        # Build a compact, factual evidence packet. The LLM is told to act
+        # on these facts and not to invent runtime behaviour.
+        lines: list[str] = []
+        lines.append(
+            "You are reviewing a student's Python attempt. Use only the runtime"
+            " evidence below — do not claim outputs or behaviour you can't see."
+            " Reply in three short parts:"
+        )
+        lines.append("  1. Assessment — one line: passed | needs_work | error.")
+        lines.append(
+            "  2. Feedback — 2-4 sentences, hint-first. If the code errored,"
+            " explain the error in beginner terms. If it ran cleanly, judge"
+            " whether the approach is right; otherwise give a hint, not a fix."
+        )
+        lines.append(
+            "  3. Next step — one short concrete suggestion (a small change to"
+            " try, or a follow-up exercise)."
+        )
+        lines.append("")
+        if section:
+            lines.append(f'Section context: "{section}".')
+        if question:
+            lines.append(f"Student question: {question}")
+        lines.append("")
+        lines.append("Student code:")
+        lines.append("```python")
+        lines.append(code)
+        lines.append("```")
+        lines.append("")
+        lines.append(f"Exit code: {run_resp.exit_code}")
+        lines.append(f"Duration: {run_resp.duration_ms} ms")
+        if run_resp.timed_out:
+            lines.append("NOTE: execution hit the runner's timeout.")
+        lines.append("Stdout:")
+        lines.append("```")
+        lines.append(run_resp.stdout or "(empty)")
+        lines.append("```")
+        lines.append("Stderr:")
+        lines.append("```")
+        lines.append(run_resp.stderr or "(empty)")
+        lines.append("```")
+        return "\n".join(lines)
+
+    def _classify_assessment(text: str, run_resp: RunResponse) -> str:
+        """Best-effort parse of the model's first line; fall back to evidence."""
+        first = (text or "").strip().splitlines()[0].lower() if text else ""
+        for label in ("passed", "needs_work", "needs work", "error"):
+            if label in first:
+                return "needs_work" if label == "needs work" else label
+        if run_resp.timed_out or run_resp.exit_code != 0:
+            return "error" if run_resp.stderr else "needs_work"
+        return "needs_work"
+
+    def _extract_next_step(text: str) -> str | None:
+        if not text:
+            return None
+        for line in text.splitlines():
+            stripped = line.strip().lstrip("-*0123456789. ").strip()
+            low = stripped.lower()
+            if low.startswith("next step"):
+                # "Next step: ..." or "Next step — ..."
+                for sep in (":", "—", "-"):
+                    if sep in stripped:
+                        return stripped.split(sep, 1)[1].strip() or None
+                return stripped
+        return None
+
+    @app.post("/api/evaluate", response_model=EvaluateResponse)
+    async def evaluate(req: EvaluateRequest) -> EvaluateResponse:
+        if req.run_output is not None:
+            run_resp = req.run_output
+        else:
+            try:
+                result = await run_python(
+                    req.code, stdin=req.stdin, timeout=None
+                )
+            except RunnerError as exc:
+                raise HTTPException(status_code=400, detail=str(exc)) from exc
+            run_resp = _result_to_response(result)
+
+        prompt = _build_evaluation_prompt(
+            req.code, run_resp, req.section, req.question
+        )
+        model = req.model or settings.model
+        messages = [
+            ChatMessage(role="system", content=settings.system_prompt),
+            ChatMessage(role="user", content=prompt),
+        ]
+        client = make_client()
+        try:
+            raw = await client.chat(
+                model=model,
+                messages=messages,
+                temperature=req.temperature,
+            )
+        except OllamaError as exc:
+            raise HTTPException(status_code=502, detail=str(exc)) from exc
+
+        msg = raw.get("message") or {}
+        feedback = msg.get("content", "") or ""
+        return EvaluateResponse(
+            assessment=_classify_assessment(feedback, run_resp),
+            feedback=feedback,
+            next_step=_extract_next_step(feedback),
+            run=run_resp,
+            model=raw.get("model", model),
         )
 
     @app.post("/api/chat", response_model=ChatResponse)