@@ -4,12 +4,21 @@ A small, local-first FastAPI service that proxies an [Ollama](https://ollama.com
44LLM (default: ` gemma3:4b ` ) and exposes a tutor-shaped HTTP API for the
55[ ` frontend/ ` ] ( ../frontend/ ) PWA and other clients.
66
7- The backend now also exposes a * prototype-grade* Python runner
8- (` POST /api/run ` ) and an LLM evaluator (` POST /api/evaluate ` ) used by the
9- frontend's inline code lab. The runner uses subprocess isolation with a
10- hard wall-clock timeout and a restricted env — see
11- [ ` docs/safety-and-sandboxing.md ` ] ( ../docs/safety-and-sandboxing.md ) for the
12- controls a real deployment would still need to add.
7+ The backend exposes:
8+
9+ - ` POST /api/chat ` and ` POST /api/chat ` (streaming) — the LLM proxy,
10+ - ` POST /api/run ` — sandboxed Python execution (timeout + rlimits + static
11+ scan),
12+ - ` POST /api/evaluate ` — runs the student's code, looks up curated docs,
13+ and asks the LLM for hint-first feedback,
14+ - ` GET /api/exercises ` , ` GET /api/exercises/{id} ` ,
15+ ` POST /api/exercises/{id}/grade ` — structured exercises with a
16+ visible/hidden test split,
17+ - ` POST /api/docs/lookup ` — curated Python documentation references.
18+
19+ Reference URLs come from a curated allowlist; no LLM-authored URLs are
20+ ever shown. See [ Documentation references] ( #documentation-references ) and
21+ [ Sandbox controls] ( #sandbox-controls ) below for the policy details.
1322
1423## Layout
1524
@@ -19,10 +28,18 @@ backend/
1928│ ├── config.py # env-driven Settings + tutor system prompt loader
2029│ ├── main.py # FastAPI app factory and routes
2130│ ├── ollama_client.py # async client for /api/tags and /api/chat
22- │ ├── runner.py # prototype Python subprocess runner (timeout + restricted env)
31+ │ ├── runner.py # prototype Python subprocess runner (timeout + rlimits)
32+ │ ├── safety.py # static AST scanner — blocks hostile imports / calls
33+ │ ├── docs_refs.py # curated docs allowlist + optional online HEAD check
34+ │ ├── exercises.py # JSON exercise loader + grading harness
2335│ └── schemas.py # pydantic request/response models
2436├── tests/
25- │ └── test_api.py # mocked Ollama tests via respx
37+ │ ├── test_api.py
38+ │ ├── test_run_evaluate.py
39+ │ ├── test_runner_sandbox.py
40+ │ ├── test_safety.py
41+ │ ├── test_exercises.py
42+ │ └── test_docs_refs.py
2643├── requirements.txt
2744├── requirements-dev.txt
2845└── pytest.ini
@@ -82,6 +99,15 @@ before launching uvicorn — the static frontend will be mounted at `/`.
8299| ` TUTOR_RUN_TIMEOUT ` | ` 5 ` | Wall-clock seconds for ` /api/run ` and ` /api/evaluate ` code execution. Clamped to 0.5–30s. |
83100| ` TUTOR_RUN_MAX_CODE_BYTES ` | ` 50000 ` | Max UTF-8 bytes accepted for a single submission. Clamped to 1 000–200 000. |
84101| ` TUTOR_RUN_MAX_OUTPUT_BYTES ` | ` 32000 ` | Each of stdout/stderr is truncated past this. Clamped to 1 000–200 000. |
102+ | ` TUTOR_RUN_CPU_SECONDS ` | ` 5 ` | POSIX ` RLIMIT_CPU ` (CPU seconds). Clamped 1–60. |
103+ | ` TUTOR_RUN_MEM_MB ` | ` 256 ` | POSIX ` RLIMIT_AS ` (address space, MB). Clamped 32–4096. |
104+ | ` TUTOR_RUN_FSIZE_MB ` | ` 16 ` | POSIX ` RLIMIT_FSIZE ` (max file size, MB). Clamped 1–256. |
105+ | ` TUTOR_RUN_NPROC ` | ` 64 ` | POSIX ` RLIMIT_NPROC ` (max processes). Clamped 8–1024. |
106+ | ` TUTOR_STRICT_IMPORTS ` | ` 0 ` | Also block ` os ` , ` pathlib ` , ` shutil ` , ` tempfile ` , ` glob ` , ` importlib ` , and bare ` open(...) ` . |
107+ | ` TUTOR_DOCS_ONLINE ` | ` 1 ` | HEAD-check each candidate doc URL before returning it. |
108+ | ` TUTOR_DOCS_TIMEOUT ` | ` 2.0 ` | Online check timeout (s). Clamped 0.5–10. |
109+ | ` TUTOR_DOCS_ALLOWLIST ` | curated | CSV of allowed doc hostnames; overrides defaults entirely. |
110+ | ` TUTOR_EXERCISES_DIR ` | repo ` curriculum/exercises ` | Override exercise directory. |
85111
86112## Endpoints
87113
@@ -155,8 +181,8 @@ Each streamed line is a JSON object forwarded from Ollama's `/api/chat` stream.
155181### ` POST /api/run `
156182
157183Executes student code in an isolated Python subprocess. ** Prototype safety
158- only** — subprocess + hard timeout + restricted env ( ` python -I ` , empty env
159- except ` LC_ALL ` / ` PYTHONIOENCODING ` , temp cwd). This is * not * a real sandbox .
184+ only** — see [ Sandbox controls ] ( #sandbox-controls ) . Static AST scanner runs
185+ first and may short-circuit with ` blocked: true ` .
160186
161187Request:
162188
@@ -177,7 +203,26 @@ Response:
177203 "exit_code" : 0 ,
178204 "duration_ms" : 16 ,
179205 "timed_out" : false ,
180- "truncated" : false
206+ "truncated" : false ,
207+ "blocked" : false ,
208+ "safety_events" : []
209+ }
210+ ```
211+
212+ When the static scanner refuses execution, ` blocked ` is true, ` exit_code `
213+ is ` -1 ` , and ` safety_events ` lists each finding (` type ` , ` detail ` ,
214+ ` lineno ` ):
215+
216+ ``` json
217+ {
218+ "stdout" : " " ,
219+ "stderr" : " [safety] execution blocked: blocked_import: subprocess\n " ,
220+ "exit_code" : -1 ,
221+ "duration_ms" : 0 ,
222+ "timed_out" : false ,
223+ "truncated" : false ,
224+ "blocked" : true ,
225+ "safety_events" : [{"type" : " blocked_import" , "detail" : " subprocess" , "lineno" : 1 }]
181226}
182227```
183228
@@ -227,18 +272,118 @@ Response:
227272best-effort extraction from the model's reply; it may be ` null ` if the
228273tutor's response did not include a recognisable next-step line.
229274
275+ The ` docs ` field carries any references found by the lookup pipeline (see
276+ [ Documentation references] ( #documentation-references ) ). The same evidence
277+ packet is sent to the LLM with the URLs spelled out so the model can cite
278+ them verbatim — and so it has no incentive to invent.
279+
280+ ### ` GET /api/exercises ` and grading
281+
282+ ``` bash
283+ curl -s http://localhost:8001/api/exercises | jq
284+ curl -s http://localhost:8001/api/exercises/loops.counting-evens | jq
285+ curl -s -X POST http://localhost:8001/api/exercises/loops.counting-evens/grade \
286+ -H ' content-type: application/json' \
287+ -d ' {"code":"def count_even(numbers):\n return sum(1 for n in numbers if n%2==0)\n"}' | jq
288+ ```
289+
290+ The detail endpoint never exposes ` hidden_tests ` . The grade endpoint
291+ appends a small JSON-emitting harness to the student's code, runs it
292+ through the sandbox, and reports per-test outcomes; the harness chatter
293+ is stripped from the visible ` stdout ` .
294+
295+ See [ ` curriculum/exercises/README.md ` ] ( ../curriculum/exercises/README.md )
296+ for the exercise schema and authoring rules.
297+
298+ ### ` POST /api/docs/lookup `
299+
300+ Returns curated reference URLs for a code/question/section without
301+ involving the LLM. Useful for the frontend to surface docs anywhere.
302+
303+ ``` bash
304+ curl -s http://localhost:8001/api/docs/lookup \
305+ -H ' content-type: application/json' \
306+ -d ' {"code":"for i in range(3): print(i)", "section":"Loops"}' | jq
307+ ```
308+
309+ ## Sandbox controls
310+
311+ In addition to the env-var knobs above:
312+
313+ - ` python -I -B ` (isolated mode, no ` .pyc ` ).
314+ - Environment is hand-built: only ` PYTHONIOENCODING ` , ` PYTHONDONTWRITEBYTECODE ` ,
315+ ` LC_ALL ` , and a placeholder ` HOME=/nonexistent ` are passed. No ` PATH ` .
316+ - Per-call tempdir at mode ` 0o700 ` , removed after the run.
317+ - POSIX ` setrlimit ` in a ` preexec_fn ` for CPU, address space, file size,
318+ core files, and process count.
319+ - ` start_new_session=True ` plus ` killpg ` on timeout so any descendant
320+ processes die with the parent.
321+ - Static AST scan ([ ` app/safety.py ` ] ( app/safety.py ) ) — blocks ` subprocess ` ,
322+ ` socket ` , ` ctypes ` , ` urllib ` , ` http ` , ` pickle ` , ` multiprocessing ` ,
323+ ` ssl ` , ` os.system ` , ` os.popen ` , raw ` exec ` /` eval ` /` __import__ ` , …
324+ - ` TUTOR_STRICT_IMPORTS=1 ` adds ` os ` , ` pathlib ` , ` shutil ` , ` tempfile ` ,
325+ ` glob ` , ` importlib ` , and bare ` open(…) ` to the block list.
326+
327+ ** Known limits.** None of these defend against kernel-level escape or
328+ side-channel attacks. macOS does not honour ` RLIMIT_AS ` for Python (we
329+ log + continue). Windows lacks ` resource ` — the runner still applies the
330+ timeout, env scrubbing, tempdir, and static scan. For multi-tenant or
331+ hostile workloads, run inside a container/microVM/restricted user — see
332+ [ ` docs/safety-and-sandboxing.md ` ] ( ../docs/safety-and-sandboxing.md ) .
333+
334+ ## Documentation references
335+
336+ The tutor cites only official Python documentation, and only via URLs
337+ from an allowlist (` docs.python.org ` , ` packaging.python.org ` ,
338+ ` peps.python.org ` , ` docs.pytest.org ` , ` mypy.readthedocs.io ` ,
339+ ` typing.readthedocs.io ` , ` pip.pypa.io ` , ` setuptools.pypa.io ` , plus the
340+ official sites for NumPy, pandas, Matplotlib, SciPy, Flask, FastAPI,
341+ Django, Requests, HTTPX, SQLAlchemy).
342+
343+ The lookup pipeline:
344+
345+ 1 . Tokenise the student's code, question, section title, and any
346+ ` concepts ` passed in.
347+ 2 . Match tokens against the curated map in
348+ [ ` app/docs_refs.py ` ] ( app/docs_refs.py ) — only allowlisted URLs.
349+ 3 . Add exercise-supplied URLs that pass the allowlist filter.
350+ 4 . If ` TUTOR_DOCS_ONLINE=1 ` (default), issue a HEAD request to each URL
351+ with ` TUTOR_DOCS_TIMEOUT ` (default 2s); drop unreachable URLs. If
352+ every URL fails, return the curated list anyway with ` online_ok=false `
353+ and a note so the UI can label them "unverified".
354+
355+ No URL is ever sourced from the LLM. The evaluation prompt and the chat
356+ system message are explicit: cite only from the supplied list verbatim,
357+ or don't cite.
358+
230359## Tests
231360
232361``` bash
233362cd backend
234363.venv/bin/pytest -q
235364```
236365
237- Tests use ` respx ` to mock the Ollama HTTP API, so they run without a real model
238- server. The suite covers health (reachable + degraded), config, default and
239- custom system prompt injection, upstream error handling, the frontend chat
240- wiring, and the ` /api/run ` + ` /api/evaluate ` endpoints (including the runner
241- module's timeout, isolation, and output-truncation behaviour).
366+ Tests use ` respx ` to mock the Ollama HTTP API, so they run without a
367+ real model server. The suite covers:
368+
369+ - health (reachable + degraded), config, system-prompt injection, and
370+ upstream error handling;
371+ - the ` /api/run ` and ` /api/evaluate ` endpoints, the runner's timeout,
372+ environment isolation, output truncation, and oversized-code rejection;
373+ - the ** strengthened sandbox controls** : subprocess static-block of
374+ ` subprocess ` /` socket ` , ` PATH ` non-propagation, tempdir CWD, and
375+ (on Linux) the address-space rlimit (` test_runner_sandbox.py ` );
376+ - the ** safety AST scanner** : hostile imports, dangerous calls, syntax
377+ errors flagged but not blocked, strict-mode behaviour
378+ (` test_safety.py ` );
379+ - the ** exercise schema and grader** : loader validation, allowlist
380+ filtering of references, passing/failing solutions, runtime errors,
381+ and the harness output-stripping (` test_exercises.py ` );
382+ - the ** docs reference layer** : allowlist filtering, curated lookups,
383+ offline-only behaviour, mocked online HEAD verification with both
384+ full-success and full-failure cases, the ` 405 → GET ` fallback, the
385+ ` /api/docs/lookup ` endpoint, and the ` docs ` block on ` /api/evaluate `
386+ and ` /api/chat ` responses (` test_docs_refs.py ` ).
242387
243388## Roadmap
244389
0 commit comments