StewAlexander-com
diff --git a/‎README.md‎
Lines changed: 101 additions & 16 deletions b/‎README.md‎
Lines changed: 101 additions & 16 deletions
diff --git a/‎backend/README.md‎
Lines changed: 161 additions & 16 deletions b/‎backend/README.md‎
Lines changed: 161 additions & 16 deletions
@@ -188,23 +188,108 @@ are written up in [`docs/ux-workflow.md`](docs/ux-workflow.md).
 
 ### New backend endpoints
 
-- `POST /api/run` — runs the submitted code in an isolated Python subprocess
-  (`python -I`, empty env, temp cwd, hard wall-clock timeout, size-limited
-  output). Returns `{stdout, stderr, exit_code, duration_ms, timed_out,
-  truncated}`. This is **prototype safety only** — subprocess + timeout +
-  restricted env. Not a real sandbox. See
-  [`docs/safety-and-sandboxing.md`](docs/safety-and-sandboxing.md) for the
-  controls a serious deployment would add (containers, seccomp, network
-  namespaces, CPU/memory limits).
+- `POST /api/run` — runs the submitted code in an isolated Python subprocess.
+  See [Sandbox controls](#sandbox-controls) below for what's in force. The
+  response now includes `blocked` and `safety_events` so the frontend can
+  show why a static scanner refusal happened.
 - `POST /api/evaluate` — accepts `{code, section?, question?, run_output?}`,
-  runs the code if `run_output` is missing, builds an evidence packet, and
-  asks the LLM for a hint-first assessment. Returns
-  `{assessment, feedback, next_step, run, model}` where `assessment` is one
-  of `passed | needs_work | error`.
-
-Configurable via env: `TUTOR_RUN_TIMEOUT` (default 5s, clamped 0.5–30s),
-`TUTOR_RUN_MAX_CODE_BYTES` (default 50 000), `TUTOR_RUN_MAX_OUTPUT_BYTES`
-(default 32 000).
+  runs the code if `run_output` is missing, builds an evidence packet, looks
+  up credible Python documentation references (see
+  [Documentation references](#documentation-references)), and asks the LLM
+  for a hint-first assessment. Returns
+  `{assessment, feedback, next_step, run, model, docs}`.
+- `POST /api/chat` — chat with the tutor. The response includes a `docs`
+  block with the same reference policy when the user's last turn matches a
+  curated topic.
+- `GET /api/exercises` / `GET /api/exercises/{id}` /
+  `POST /api/exercises/{id}/grade` — structured exercises (prompt, starter
+  code, visible + hidden tests). The grader appends a JSON-emitting harness
+  to the student's code, runs it through the sandbox, and reports
+  per-assertion pass/fail. See
+  [`curriculum/exercises/README.md`](curriculum/exercises/README.md) for
+  the schema.
+- `POST /api/docs/lookup` — return reference URLs for a given
+  code/question/section without involving the LLM. Useful for the frontend
+  to surface docs alongside any view.
+
+### Sandbox controls
+
+The runner remains **prototype safety**, not production isolation, but is
+stronger than the original subprocess + timeout combo:
+
+- subprocess with `python -I -B` (isolated mode, no `.pyc`),
+- empty environment — `PATH` is not propagated; `HOME` is `/nonexistent`,
+- per-call tempdir at mode `0o700`, removed after the run,
+- hard wall-clock timeout (`TUTOR_RUN_TIMEOUT`, default 5s, clamped 0.5–30s),
+- POSIX resource limits via `setrlimit` in a `preexec_fn` (CPU seconds,
+  address space, file size, core file, process count),
+- process-group kill on timeout so any children die with the parent,
+- static AST scan ([`backend/app/safety.py`](backend/app/safety.py)) that
+  refuses obvious hostile patterns (`subprocess`, `socket`, `ctypes`,
+  `urllib`, `pickle`, `os.system`, `exec`, `eval`, `__import__`, …) before
+  the subprocess starts and reports `safety_events`,
+- output truncation per stream,
+- code-size cap.
+
+Resource-limit defaults are configurable: `TUTOR_RUN_CPU_SECONDS=5`,
+`TUTOR_RUN_MEM_MB=256`, `TUTOR_RUN_FSIZE_MB=16`, `TUTOR_RUN_NPROC=64`. Set
+`TUTOR_STRICT_IMPORTS=1` to also block `os`, `pathlib`, `shutil`,
+`tempfile`, `glob`, `importlib`, and bare `open()`.
+
+**Known limits.** None of these defend against kernel-level escape or
+side-channel attacks. macOS does not honour `RLIMIT_AS` for Python (we
+log + continue). Windows has no `resource` module — the runner still
+applies timeout, env scrubbing, tempdir, and the static scan, but no
+rlimits. For multi-tenant or hostile workloads, run inside a container,
+microVM, or restricted user — see
+[`docs/safety-and-sandboxing.md`](docs/safety-and-sandboxing.md).
+
+### Documentation references
+
+Tutor answers now include source links to **official Python documentation**
+when topics match a curated allowlist. The references the tutor surfaces
+are always one of:
+
+1. a hit from the in-repo curated map
+   ([`backend/app/docs_refs.py`](backend/app/docs_refs.py)) keyed off
+   tokens in the student's code, question, or section title,
+2. an exercise-supplied URL on the allowlist
+   ([`curriculum/exercises/*.json` `references`](curriculum/exercises/README.md)).
+
+There is **no LLM-authored URL generation**. The model is instructed (via
+the evaluation prompt and the augmented chat system message) to cite only
+from the supplied list verbatim or not at all.
+
+When `TUTOR_DOCS_ONLINE=1` (default) the backend issues a short HEAD
+request to each candidate URL with `TUTOR_DOCS_TIMEOUT` (default 2s,
+clamped 0.5–10s). Unreachable URLs are dropped; if *every* URL fails the
+references are still returned with `online_ok=false` and a `note` so the
+UI can show them dimmed and labelled "unverified". Set
+`TUTOR_DOCS_ONLINE=0` to skip the network entirely (fully offline).
+
+Override the host allowlist with `TUTOR_DOCS_ALLOWLIST="docs.python.org,…"`
+(comma-separated). Defaults are listed in `docs_refs.DEFAULT_ALLOWED_HOSTS`
+and cover docs.python.org, packaging.python.org, peps.python.org,
+docs.pytest.org, mypy/typing readthedocs, pip/setuptools, and the official
+docs sites of NumPy, pandas, Matplotlib, SciPy, Flask, FastAPI, Django,
+Requests, HTTPX, and SQLAlchemy.
+
+### Configurable via env
+
+| Variable                   | Default | Notes                                     |
+|----------------------------|--------:|-------------------------------------------|
+| `TUTOR_RUN_TIMEOUT`        | 5       | wall-clock timeout (s, clamp 0.5–30)      |
+| `TUTOR_RUN_MAX_CODE_BYTES` | 50 000  | refuse oversize submissions               |
+| `TUTOR_RUN_MAX_OUTPUT_BYTES` | 32 000 | per-stream truncation                    |
+| `TUTOR_RUN_CPU_SECONDS`    | 5       | POSIX RLIMIT_CPU                          |
+| `TUTOR_RUN_MEM_MB`         | 256     | POSIX RLIMIT_AS                           |
+| `TUTOR_RUN_FSIZE_MB`       | 16      | POSIX RLIMIT_FSIZE                        |
+| `TUTOR_RUN_NPROC`          | 64      | POSIX RLIMIT_NPROC                        |
+| `TUTOR_STRICT_IMPORTS`     | 0       | also block `os`, `pathlib`, `open(…)`     |
+| `TUTOR_DOCS_ONLINE`        | 1       | HEAD-check candidate URLs                 |
+| `TUTOR_DOCS_TIMEOUT`       | 2.0     | online check timeout (s, clamp 0.5–10)    |
+| `TUTOR_DOCS_ALLOWLIST`     | curated | CSV override of allowed doc hosts         |
+| `TUTOR_EXERCISES_DIR`      | repo    | override exercise directory               |
 
 ## Core Components
 
 
@@ -4,12 +4,21 @@ A small, local-first FastAPI service that proxies an [Ollama](https://ollama.com
 LLM (default: `gemma3:4b`) and exposes a tutor-shaped HTTP API for the
 [`frontend/`](../frontend/) PWA and other clients.
 
-The backend now also exposes a *prototype-grade* Python runner
-(`POST /api/run`) and an LLM evaluator (`POST /api/evaluate`) used by the
-frontend's inline code lab. The runner uses subprocess isolation with a
-hard wall-clock timeout and a restricted env — see
-[`docs/safety-and-sandboxing.md`](../docs/safety-and-sandboxing.md) for the
-controls a real deployment would still need to add.
+The backend exposes:
+
+- `POST /api/chat` and `POST /api/chat` (streaming) — the LLM proxy,
+- `POST /api/run` — sandboxed Python execution (timeout + rlimits + static
+  scan),
+- `POST /api/evaluate` — runs the student's code, looks up curated docs,
+  and asks the LLM for hint-first feedback,
+- `GET /api/exercises`, `GET /api/exercises/{id}`,
+  `POST /api/exercises/{id}/grade` — structured exercises with a
+  visible/hidden test split,
+- `POST /api/docs/lookup` — curated Python documentation references.
+
+Reference URLs come from a curated allowlist; no LLM-authored URLs are
+ever shown. See [Documentation references](#documentation-references) and
+[Sandbox controls](#sandbox-controls) below for the policy details.
 
 ## Layout
 
@@ -19,10 +28,18 @@ backend/
 │   ├── config.py         # env-driven Settings + tutor system prompt loader
 │   ├── main.py           # FastAPI app factory and routes
 │   ├── ollama_client.py  # async client for /api/tags and /api/chat
-│   ├── runner.py         # prototype Python subprocess runner (timeout + restricted env)
+│   ├── runner.py         # prototype Python subprocess runner (timeout + rlimits)
+│   ├── safety.py         # static AST scanner — blocks hostile imports / calls
+│   ├── docs_refs.py      # curated docs allowlist + optional online HEAD check
+│   ├── exercises.py      # JSON exercise loader + grading harness
 │   └── schemas.py        # pydantic request/response models
 ├── tests/
-│   └── test_api.py       # mocked Ollama tests via respx
+│   ├── test_api.py
+│   ├── test_run_evaluate.py
+│   ├── test_runner_sandbox.py
+│   ├── test_safety.py
+│   ├── test_exercises.py
+│   └── test_docs_refs.py
 ├── requirements.txt
 ├── requirements-dev.txt
 └── pytest.ini
@@ -82,6 +99,15 @@ before launching uvicorn — the static frontend will be mounted at `/`.
 | `TUTOR_RUN_TIMEOUT` | `5` | Wall-clock seconds for `/api/run` and `/api/evaluate` code execution. Clamped to 0.5–30s. |
 | `TUTOR_RUN_MAX_CODE_BYTES` | `50000` | Max UTF-8 bytes accepted for a single submission. Clamped to 1 000–200 000. |
 | `TUTOR_RUN_MAX_OUTPUT_BYTES` | `32000` | Each of stdout/stderr is truncated past this. Clamped to 1 000–200 000. |
+| `TUTOR_RUN_CPU_SECONDS` | `5` | POSIX `RLIMIT_CPU` (CPU seconds). Clamped 1–60. |
+| `TUTOR_RUN_MEM_MB` | `256` | POSIX `RLIMIT_AS` (address space, MB). Clamped 32–4096. |
+| `TUTOR_RUN_FSIZE_MB` | `16` | POSIX `RLIMIT_FSIZE` (max file size, MB). Clamped 1–256. |
+| `TUTOR_RUN_NPROC` | `64` | POSIX `RLIMIT_NPROC` (max processes). Clamped 8–1024. |
+| `TUTOR_STRICT_IMPORTS` | `0` | Also block `os`, `pathlib`, `shutil`, `tempfile`, `glob`, `importlib`, and bare `open(...)`. |
+| `TUTOR_DOCS_ONLINE` | `1` | HEAD-check each candidate doc URL before returning it. |
+| `TUTOR_DOCS_TIMEOUT` | `2.0` | Online check timeout (s). Clamped 0.5–10. |
+| `TUTOR_DOCS_ALLOWLIST` | curated | CSV of allowed doc hostnames; overrides defaults entirely. |
+| `TUTOR_EXERCISES_DIR` | repo `curriculum/exercises` | Override exercise directory. |
 
 ## Endpoints
 
@@ -155,8 +181,8 @@ Each streamed line is a JSON object forwarded from Ollama's `/api/chat` stream.
 ### `POST /api/run`
 
 Executes student code in an isolated Python subprocess. **Prototype safety
-only** — subprocess + hard timeout + restricted env (`python -I`, empty env
-except `LC_ALL`/`PYTHONIOENCODING`, temp cwd). This is *not* a real sandbox.
+only** — see [Sandbox controls](#sandbox-controls). Static AST scanner runs
+first and may short-circuit with `blocked: true`.
 
 Request:
 
@@ -177,7 +203,26 @@ Response:
   "exit_code": 0,
   "duration_ms": 16,
   "timed_out": false,
-  "truncated": false
+  "truncated": false,
+  "blocked": false,
+  "safety_events": []
+}
+```
+
+When the static scanner refuses execution, `blocked` is true, `exit_code`
+is `-1`, and `safety_events` lists each finding (`type`, `detail`,
+`lineno`):
+
+```json
+{
+  "stdout": "",
+  "stderr": "[safety] execution blocked: blocked_import: subprocess\n",
+  "exit_code": -1,
+  "duration_ms": 0,
+  "timed_out": false,
+  "truncated": false,
+  "blocked": true,
+  "safety_events": [{"type": "blocked_import", "detail": "subprocess", "lineno": 1}]
 }
 ```
 
@@ -227,18 +272,118 @@ Response:
 best-effort extraction from the model's reply; it may be `null` if the
 tutor's response did not include a recognisable next-step line.
 
+The `docs` field carries any references found by the lookup pipeline (see
+[Documentation references](#documentation-references)). The same evidence
+packet is sent to the LLM with the URLs spelled out so the model can cite
+them verbatim — and so it has no incentive to invent.
+
+### `GET /api/exercises` and grading
+
+```bash
+curl -s http://localhost:8001/api/exercises | jq
+curl -s http://localhost:8001/api/exercises/loops.counting-evens | jq
+curl -s -X POST http://localhost:8001/api/exercises/loops.counting-evens/grade \
+  -H 'content-type: application/json' \
+  -d '{"code":"def count_even(numbers):\n    return sum(1 for n in numbers if n%2==0)\n"}' | jq
+```
+
+The detail endpoint never exposes `hidden_tests`. The grade endpoint
+appends a small JSON-emitting harness to the student's code, runs it
+through the sandbox, and reports per-test outcomes; the harness chatter
+is stripped from the visible `stdout`.
+
+See [`curriculum/exercises/README.md`](../curriculum/exercises/README.md)
+for the exercise schema and authoring rules.
+
+### `POST /api/docs/lookup`
+
+Returns curated reference URLs for a code/question/section without
+involving the LLM. Useful for the frontend to surface docs anywhere.
+
+```bash
+curl -s http://localhost:8001/api/docs/lookup \
+  -H 'content-type: application/json' \
+  -d '{"code":"for i in range(3): print(i)", "section":"Loops"}' | jq
+```
+
+## Sandbox controls
+
+In addition to the env-var knobs above:
+
+- `python -I -B` (isolated mode, no `.pyc`).
+- Environment is hand-built: only `PYTHONIOENCODING`, `PYTHONDONTWRITEBYTECODE`,
+  `LC_ALL`, and a placeholder `HOME=/nonexistent` are passed. No `PATH`.
+- Per-call tempdir at mode `0o700`, removed after the run.
+- POSIX `setrlimit` in a `preexec_fn` for CPU, address space, file size,
+  core files, and process count.
+- `start_new_session=True` plus `killpg` on timeout so any descendant
+  processes die with the parent.
+- Static AST scan ([`app/safety.py`](app/safety.py)) — blocks `subprocess`,
+  `socket`, `ctypes`, `urllib`, `http`, `pickle`, `multiprocessing`,
+  `ssl`, `os.system`, `os.popen`, raw `exec`/`eval`/`__import__`, …
+- `TUTOR_STRICT_IMPORTS=1` adds `os`, `pathlib`, `shutil`, `tempfile`,
+  `glob`, `importlib`, and bare `open(…)` to the block list.
+
+**Known limits.** None of these defend against kernel-level escape or
+side-channel attacks. macOS does not honour `RLIMIT_AS` for Python (we
+log + continue). Windows lacks `resource` — the runner still applies the
+timeout, env scrubbing, tempdir, and static scan. For multi-tenant or
+hostile workloads, run inside a container/microVM/restricted user — see
+[`docs/safety-and-sandboxing.md`](../docs/safety-and-sandboxing.md).
+
+## Documentation references
+
+The tutor cites only official Python documentation, and only via URLs
+from an allowlist (`docs.python.org`, `packaging.python.org`,
+`peps.python.org`, `docs.pytest.org`, `mypy.readthedocs.io`,
+`typing.readthedocs.io`, `pip.pypa.io`, `setuptools.pypa.io`, plus the
+official sites for NumPy, pandas, Matplotlib, SciPy, Flask, FastAPI,
+Django, Requests, HTTPX, SQLAlchemy).
+
+The lookup pipeline:
+
+1. Tokenise the student's code, question, section title, and any
+   `concepts` passed in.
+2. Match tokens against the curated map in
+   [`app/docs_refs.py`](app/docs_refs.py) — only allowlisted URLs.
+3. Add exercise-supplied URLs that pass the allowlist filter.
+4. If `TUTOR_DOCS_ONLINE=1` (default), issue a HEAD request to each URL
+   with `TUTOR_DOCS_TIMEOUT` (default 2s); drop unreachable URLs. If
+   every URL fails, return the curated list anyway with `online_ok=false`
+   and a note so the UI can label them "unverified".
+
+No URL is ever sourced from the LLM. The evaluation prompt and the chat
+system message are explicit: cite only from the supplied list verbatim,
+or don't cite.
+
 ## Tests
 
 ```bash
 cd backend
 .venv/bin/pytest -q
 ```
 
-Tests use `respx` to mock the Ollama HTTP API, so they run without a real model
-server. The suite covers health (reachable + degraded), config, default and
-custom system prompt injection, upstream error handling, the frontend chat
-wiring, and the `/api/run` + `/api/evaluate` endpoints (including the runner
-module's timeout, isolation, and output-truncation behaviour).
+Tests use `respx` to mock the Ollama HTTP API, so they run without a
+real model server. The suite covers:
+
+- health (reachable + degraded), config, system-prompt injection, and
+  upstream error handling;
+- the `/api/run` and `/api/evaluate` endpoints, the runner's timeout,
+  environment isolation, output truncation, and oversized-code rejection;
+- the **strengthened sandbox controls**: subprocess static-block of
+  `subprocess`/`socket`, `PATH` non-propagation, tempdir CWD, and
+  (on Linux) the address-space rlimit (`test_runner_sandbox.py`);
+- the **safety AST scanner**: hostile imports, dangerous calls, syntax
+  errors flagged but not blocked, strict-mode behaviour
+  (`test_safety.py`);
+- the **exercise schema and grader**: loader validation, allowlist
+  filtering of references, passing/failing solutions, runtime errors,
+  and the harness output-stripping (`test_exercises.py`);
+- the **docs reference layer**: allowlist filtering, curated lookups,
+  offline-only behaviour, mocked online HEAD verification with both
+  full-success and full-failure cases, the `405 → GET` fallback, the
+  `/api/docs/lookup` endpoint, and the `docs` block on `/api/evaluate`
+  and `/api/chat` responses (`test_docs_refs.py`).
 
 ## Roadmap