Skip to content

Commit f9d1be4

Browse files
Merge pull request #5 from StewAlexander-com/sandbox-docs-exercises
Stronger sandbox, structured exercises, and credible docs references
2 parents 069730f + e051ad7 commit f9d1be4

24 files changed

Lines changed: 2688 additions & 62 deletions

README.md

Lines changed: 101 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -188,23 +188,108 @@ are written up in [`docs/ux-workflow.md`](docs/ux-workflow.md).
188188

189189
### New backend endpoints
190190

191-
- `POST /api/run` — runs the submitted code in an isolated Python subprocess
192-
(`python -I`, empty env, temp cwd, hard wall-clock timeout, size-limited
193-
output). Returns `{stdout, stderr, exit_code, duration_ms, timed_out,
194-
truncated}`. This is **prototype safety only** — subprocess + timeout +
195-
restricted env. Not a real sandbox. See
196-
[`docs/safety-and-sandboxing.md`](docs/safety-and-sandboxing.md) for the
197-
controls a serious deployment would add (containers, seccomp, network
198-
namespaces, CPU/memory limits).
191+
- `POST /api/run` — runs the submitted code in an isolated Python subprocess.
192+
See [Sandbox controls](#sandbox-controls) below for what's in force. The
193+
response now includes `blocked` and `safety_events` so the frontend can
194+
show why a static scanner refusal happened.
199195
- `POST /api/evaluate` — accepts `{code, section?, question?, run_output?}`,
200-
runs the code if `run_output` is missing, builds an evidence packet, and
201-
asks the LLM for a hint-first assessment. Returns
202-
`{assessment, feedback, next_step, run, model}` where `assessment` is one
203-
of `passed | needs_work | error`.
204-
205-
Configurable via env: `TUTOR_RUN_TIMEOUT` (default 5s, clamped 0.5–30s),
206-
`TUTOR_RUN_MAX_CODE_BYTES` (default 50 000), `TUTOR_RUN_MAX_OUTPUT_BYTES`
207-
(default 32 000).
196+
runs the code if `run_output` is missing, builds an evidence packet, looks
197+
up credible Python documentation references (see
198+
[Documentation references](#documentation-references)), and asks the LLM
199+
for a hint-first assessment. Returns
200+
`{assessment, feedback, next_step, run, model, docs}`.
201+
- `POST /api/chat` — chat with the tutor. The response includes a `docs`
202+
block with the same reference policy when the user's last turn matches a
203+
curated topic.
204+
- `GET /api/exercises` / `GET /api/exercises/{id}` /
205+
`POST /api/exercises/{id}/grade` — structured exercises (prompt, starter
206+
code, visible + hidden tests). The grader appends a JSON-emitting harness
207+
to the student's code, runs it through the sandbox, and reports
208+
per-assertion pass/fail. See
209+
[`curriculum/exercises/README.md`](curriculum/exercises/README.md) for
210+
the schema.
211+
- `POST /api/docs/lookup` — return reference URLs for a given
212+
code/question/section without involving the LLM. Useful for the frontend
213+
to surface docs alongside any view.
214+
215+
### Sandbox controls
216+
217+
The runner remains **prototype safety**, not production isolation, but is
218+
stronger than the original subprocess + timeout combo:
219+
220+
- subprocess with `python -I -B` (isolated mode, no `.pyc`),
221+
- empty environment — `PATH` is not propagated; `HOME` is `/nonexistent`,
222+
- per-call tempdir at mode `0o700`, removed after the run,
223+
- hard wall-clock timeout (`TUTOR_RUN_TIMEOUT`, default 5s, clamped 0.5–30s),
224+
- POSIX resource limits via `setrlimit` in a `preexec_fn` (CPU seconds,
225+
address space, file size, core file, process count),
226+
- process-group kill on timeout so any children die with the parent,
227+
- static AST scan ([`backend/app/safety.py`](backend/app/safety.py)) that
228+
refuses obvious hostile patterns (`subprocess`, `socket`, `ctypes`,
229+
`urllib`, `pickle`, `os.system`, `exec`, `eval`, `__import__`, …) before
230+
the subprocess starts and reports `safety_events`,
231+
- output truncation per stream,
232+
- code-size cap.
233+
234+
Resource-limit defaults are configurable: `TUTOR_RUN_CPU_SECONDS=5`,
235+
`TUTOR_RUN_MEM_MB=256`, `TUTOR_RUN_FSIZE_MB=16`, `TUTOR_RUN_NPROC=64`. Set
236+
`TUTOR_STRICT_IMPORTS=1` to also block `os`, `pathlib`, `shutil`,
237+
`tempfile`, `glob`, `importlib`, and bare `open()`.
238+
239+
**Known limits.** None of these defend against kernel-level escape or
240+
side-channel attacks. macOS does not honour `RLIMIT_AS` for Python (we
241+
log + continue). Windows has no `resource` module — the runner still
242+
applies timeout, env scrubbing, tempdir, and the static scan, but no
243+
rlimits. For multi-tenant or hostile workloads, run inside a container,
244+
microVM, or restricted user — see
245+
[`docs/safety-and-sandboxing.md`](docs/safety-and-sandboxing.md).
246+
247+
### Documentation references
248+
249+
Tutor answers now include source links to **official Python documentation**
250+
when topics match a curated allowlist. The references the tutor surfaces
251+
are always one of:
252+
253+
1. a hit from the in-repo curated map
254+
([`backend/app/docs_refs.py`](backend/app/docs_refs.py)) keyed off
255+
tokens in the student's code, question, or section title,
256+
2. an exercise-supplied URL on the allowlist
257+
([`curriculum/exercises/*.json` `references`](curriculum/exercises/README.md)).
258+
259+
There is **no LLM-authored URL generation**. The model is instructed (via
260+
the evaluation prompt and the augmented chat system message) to cite only
261+
from the supplied list verbatim or not at all.
262+
263+
When `TUTOR_DOCS_ONLINE=1` (default) the backend issues a short HEAD
264+
request to each candidate URL with `TUTOR_DOCS_TIMEOUT` (default 2s,
265+
clamped 0.5–10s). Unreachable URLs are dropped; if *every* URL fails the
266+
references are still returned with `online_ok=false` and a `note` so the
267+
UI can show them dimmed and labelled "unverified". Set
268+
`TUTOR_DOCS_ONLINE=0` to skip the network entirely (fully offline).
269+
270+
Override the host allowlist with `TUTOR_DOCS_ALLOWLIST="docs.python.org,…"`
271+
(comma-separated). Defaults are listed in `docs_refs.DEFAULT_ALLOWED_HOSTS`
272+
and cover docs.python.org, packaging.python.org, peps.python.org,
273+
docs.pytest.org, mypy/typing readthedocs, pip/setuptools, and the official
274+
docs sites of NumPy, pandas, Matplotlib, SciPy, Flask, FastAPI, Django,
275+
Requests, HTTPX, and SQLAlchemy.
276+
277+
### Configurable via env
278+
279+
| Variable | Default | Notes |
280+
|----------------------------|--------:|-------------------------------------------|
281+
| `TUTOR_RUN_TIMEOUT` | 5 | wall-clock timeout (s, clamp 0.5–30) |
282+
| `TUTOR_RUN_MAX_CODE_BYTES` | 50 000 | refuse oversize submissions |
283+
| `TUTOR_RUN_MAX_OUTPUT_BYTES` | 32 000 | per-stream truncation |
284+
| `TUTOR_RUN_CPU_SECONDS` | 5 | POSIX RLIMIT_CPU |
285+
| `TUTOR_RUN_MEM_MB` | 256 | POSIX RLIMIT_AS |
286+
| `TUTOR_RUN_FSIZE_MB` | 16 | POSIX RLIMIT_FSIZE |
287+
| `TUTOR_RUN_NPROC` | 64 | POSIX RLIMIT_NPROC |
288+
| `TUTOR_STRICT_IMPORTS` | 0 | also block `os`, `pathlib`, `open(…)` |
289+
| `TUTOR_DOCS_ONLINE` | 1 | HEAD-check candidate URLs |
290+
| `TUTOR_DOCS_TIMEOUT` | 2.0 | online check timeout (s, clamp 0.5–10) |
291+
| `TUTOR_DOCS_ALLOWLIST` | curated | CSV override of allowed doc hosts |
292+
| `TUTOR_EXERCISES_DIR` | repo | override exercise directory |
208293

209294
## Core Components
210295

backend/README.md

Lines changed: 161 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -4,12 +4,21 @@ A small, local-first FastAPI service that proxies an [Ollama](https://ollama.com
44
LLM (default: `gemma3:4b`) and exposes a tutor-shaped HTTP API for the
55
[`frontend/`](../frontend/) PWA and other clients.
66

7-
The backend now also exposes a *prototype-grade* Python runner
8-
(`POST /api/run`) and an LLM evaluator (`POST /api/evaluate`) used by the
9-
frontend's inline code lab. The runner uses subprocess isolation with a
10-
hard wall-clock timeout and a restricted env — see
11-
[`docs/safety-and-sandboxing.md`](../docs/safety-and-sandboxing.md) for the
12-
controls a real deployment would still need to add.
7+
The backend exposes:
8+
9+
- `POST /api/chat` and `POST /api/chat` (streaming) — the LLM proxy,
10+
- `POST /api/run` — sandboxed Python execution (timeout + rlimits + static
11+
scan),
12+
- `POST /api/evaluate` — runs the student's code, looks up curated docs,
13+
and asks the LLM for hint-first feedback,
14+
- `GET /api/exercises`, `GET /api/exercises/{id}`,
15+
`POST /api/exercises/{id}/grade` — structured exercises with a
16+
visible/hidden test split,
17+
- `POST /api/docs/lookup` — curated Python documentation references.
18+
19+
Reference URLs come from a curated allowlist; no LLM-authored URLs are
20+
ever shown. See [Documentation references](#documentation-references) and
21+
[Sandbox controls](#sandbox-controls) below for the policy details.
1322

1423
## Layout
1524

@@ -19,10 +28,18 @@ backend/
1928
│ ├── config.py # env-driven Settings + tutor system prompt loader
2029
│ ├── main.py # FastAPI app factory and routes
2130
│ ├── ollama_client.py # async client for /api/tags and /api/chat
22-
│ ├── runner.py # prototype Python subprocess runner (timeout + restricted env)
31+
│ ├── runner.py # prototype Python subprocess runner (timeout + rlimits)
32+
│ ├── safety.py # static AST scanner — blocks hostile imports / calls
33+
│ ├── docs_refs.py # curated docs allowlist + optional online HEAD check
34+
│ ├── exercises.py # JSON exercise loader + grading harness
2335
│ └── schemas.py # pydantic request/response models
2436
├── tests/
25-
│ └── test_api.py # mocked Ollama tests via respx
37+
│ ├── test_api.py
38+
│ ├── test_run_evaluate.py
39+
│ ├── test_runner_sandbox.py
40+
│ ├── test_safety.py
41+
│ ├── test_exercises.py
42+
│ └── test_docs_refs.py
2643
├── requirements.txt
2744
├── requirements-dev.txt
2845
└── pytest.ini
@@ -82,6 +99,15 @@ before launching uvicorn — the static frontend will be mounted at `/`.
8299
| `TUTOR_RUN_TIMEOUT` | `5` | Wall-clock seconds for `/api/run` and `/api/evaluate` code execution. Clamped to 0.5–30s. |
83100
| `TUTOR_RUN_MAX_CODE_BYTES` | `50000` | Max UTF-8 bytes accepted for a single submission. Clamped to 1 000–200 000. |
84101
| `TUTOR_RUN_MAX_OUTPUT_BYTES` | `32000` | Each of stdout/stderr is truncated past this. Clamped to 1 000–200 000. |
102+
| `TUTOR_RUN_CPU_SECONDS` | `5` | POSIX `RLIMIT_CPU` (CPU seconds). Clamped 1–60. |
103+
| `TUTOR_RUN_MEM_MB` | `256` | POSIX `RLIMIT_AS` (address space, MB). Clamped 32–4096. |
104+
| `TUTOR_RUN_FSIZE_MB` | `16` | POSIX `RLIMIT_FSIZE` (max file size, MB). Clamped 1–256. |
105+
| `TUTOR_RUN_NPROC` | `64` | POSIX `RLIMIT_NPROC` (max processes). Clamped 8–1024. |
106+
| `TUTOR_STRICT_IMPORTS` | `0` | Also block `os`, `pathlib`, `shutil`, `tempfile`, `glob`, `importlib`, and bare `open(...)`. |
107+
| `TUTOR_DOCS_ONLINE` | `1` | HEAD-check each candidate doc URL before returning it. |
108+
| `TUTOR_DOCS_TIMEOUT` | `2.0` | Online check timeout (s). Clamped 0.5–10. |
109+
| `TUTOR_DOCS_ALLOWLIST` | curated | CSV of allowed doc hostnames; overrides defaults entirely. |
110+
| `TUTOR_EXERCISES_DIR` | repo `curriculum/exercises` | Override exercise directory. |
85111

86112
## Endpoints
87113

@@ -155,8 +181,8 @@ Each streamed line is a JSON object forwarded from Ollama's `/api/chat` stream.
155181
### `POST /api/run`
156182

157183
Executes student code in an isolated Python subprocess. **Prototype safety
158-
only**subprocess + hard timeout + restricted env (`python -I`, empty env
159-
except `LC_ALL`/`PYTHONIOENCODING`, temp cwd). This is *not* a real sandbox.
184+
only**see [Sandbox controls](#sandbox-controls). Static AST scanner runs
185+
first and may short-circuit with `blocked: true`.
160186

161187
Request:
162188

@@ -177,7 +203,26 @@ Response:
177203
"exit_code": 0,
178204
"duration_ms": 16,
179205
"timed_out": false,
180-
"truncated": false
206+
"truncated": false,
207+
"blocked": false,
208+
"safety_events": []
209+
}
210+
```
211+
212+
When the static scanner refuses execution, `blocked` is true, `exit_code`
213+
is `-1`, and `safety_events` lists each finding (`type`, `detail`,
214+
`lineno`):
215+
216+
```json
217+
{
218+
"stdout": "",
219+
"stderr": "[safety] execution blocked: blocked_import: subprocess\n",
220+
"exit_code": -1,
221+
"duration_ms": 0,
222+
"timed_out": false,
223+
"truncated": false,
224+
"blocked": true,
225+
"safety_events": [{"type": "blocked_import", "detail": "subprocess", "lineno": 1}]
181226
}
182227
```
183228

@@ -227,18 +272,118 @@ Response:
227272
best-effort extraction from the model's reply; it may be `null` if the
228273
tutor's response did not include a recognisable next-step line.
229274

275+
The `docs` field carries any references found by the lookup pipeline (see
276+
[Documentation references](#documentation-references)). The same evidence
277+
packet is sent to the LLM with the URLs spelled out so the model can cite
278+
them verbatim — and so it has no incentive to invent.
279+
280+
### `GET /api/exercises` and grading
281+
282+
```bash
283+
curl -s http://localhost:8001/api/exercises | jq
284+
curl -s http://localhost:8001/api/exercises/loops.counting-evens | jq
285+
curl -s -X POST http://localhost:8001/api/exercises/loops.counting-evens/grade \
286+
-H 'content-type: application/json' \
287+
-d '{"code":"def count_even(numbers):\n return sum(1 for n in numbers if n%2==0)\n"}' | jq
288+
```
289+
290+
The detail endpoint never exposes `hidden_tests`. The grade endpoint
291+
appends a small JSON-emitting harness to the student's code, runs it
292+
through the sandbox, and reports per-test outcomes; the harness chatter
293+
is stripped from the visible `stdout`.
294+
295+
See [`curriculum/exercises/README.md`](../curriculum/exercises/README.md)
296+
for the exercise schema and authoring rules.
297+
298+
### `POST /api/docs/lookup`
299+
300+
Returns curated reference URLs for a code/question/section without
301+
involving the LLM. Useful for the frontend to surface docs anywhere.
302+
303+
```bash
304+
curl -s http://localhost:8001/api/docs/lookup \
305+
-H 'content-type: application/json' \
306+
-d '{"code":"for i in range(3): print(i)", "section":"Loops"}' | jq
307+
```
308+
309+
## Sandbox controls
310+
311+
In addition to the env-var knobs above:
312+
313+
- `python -I -B` (isolated mode, no `.pyc`).
314+
- Environment is hand-built: only `PYTHONIOENCODING`, `PYTHONDONTWRITEBYTECODE`,
315+
`LC_ALL`, and a placeholder `HOME=/nonexistent` are passed. No `PATH`.
316+
- Per-call tempdir at mode `0o700`, removed after the run.
317+
- POSIX `setrlimit` in a `preexec_fn` for CPU, address space, file size,
318+
core files, and process count.
319+
- `start_new_session=True` plus `killpg` on timeout so any descendant
320+
processes die with the parent.
321+
- Static AST scan ([`app/safety.py`](app/safety.py)) — blocks `subprocess`,
322+
`socket`, `ctypes`, `urllib`, `http`, `pickle`, `multiprocessing`,
323+
`ssl`, `os.system`, `os.popen`, raw `exec`/`eval`/`__import__`, …
324+
- `TUTOR_STRICT_IMPORTS=1` adds `os`, `pathlib`, `shutil`, `tempfile`,
325+
`glob`, `importlib`, and bare `open(…)` to the block list.
326+
327+
**Known limits.** None of these defend against kernel-level escape or
328+
side-channel attacks. macOS does not honour `RLIMIT_AS` for Python (we
329+
log + continue). Windows lacks `resource` — the runner still applies the
330+
timeout, env scrubbing, tempdir, and static scan. For multi-tenant or
331+
hostile workloads, run inside a container/microVM/restricted user — see
332+
[`docs/safety-and-sandboxing.md`](../docs/safety-and-sandboxing.md).
333+
334+
## Documentation references
335+
336+
The tutor cites only official Python documentation, and only via URLs
337+
from an allowlist (`docs.python.org`, `packaging.python.org`,
338+
`peps.python.org`, `docs.pytest.org`, `mypy.readthedocs.io`,
339+
`typing.readthedocs.io`, `pip.pypa.io`, `setuptools.pypa.io`, plus the
340+
official sites for NumPy, pandas, Matplotlib, SciPy, Flask, FastAPI,
341+
Django, Requests, HTTPX, SQLAlchemy).
342+
343+
The lookup pipeline:
344+
345+
1. Tokenise the student's code, question, section title, and any
346+
`concepts` passed in.
347+
2. Match tokens against the curated map in
348+
[`app/docs_refs.py`](app/docs_refs.py) — only allowlisted URLs.
349+
3. Add exercise-supplied URLs that pass the allowlist filter.
350+
4. If `TUTOR_DOCS_ONLINE=1` (default), issue a HEAD request to each URL
351+
with `TUTOR_DOCS_TIMEOUT` (default 2s); drop unreachable URLs. If
352+
every URL fails, return the curated list anyway with `online_ok=false`
353+
and a note so the UI can label them "unverified".
354+
355+
No URL is ever sourced from the LLM. The evaluation prompt and the chat
356+
system message are explicit: cite only from the supplied list verbatim,
357+
or don't cite.
358+
230359
## Tests
231360

232361
```bash
233362
cd backend
234363
.venv/bin/pytest -q
235364
```
236365

237-
Tests use `respx` to mock the Ollama HTTP API, so they run without a real model
238-
server. The suite covers health (reachable + degraded), config, default and
239-
custom system prompt injection, upstream error handling, the frontend chat
240-
wiring, and the `/api/run` + `/api/evaluate` endpoints (including the runner
241-
module's timeout, isolation, and output-truncation behaviour).
366+
Tests use `respx` to mock the Ollama HTTP API, so they run without a
367+
real model server. The suite covers:
368+
369+
- health (reachable + degraded), config, system-prompt injection, and
370+
upstream error handling;
371+
- the `/api/run` and `/api/evaluate` endpoints, the runner's timeout,
372+
environment isolation, output truncation, and oversized-code rejection;
373+
- the **strengthened sandbox controls**: subprocess static-block of
374+
`subprocess`/`socket`, `PATH` non-propagation, tempdir CWD, and
375+
(on Linux) the address-space rlimit (`test_runner_sandbox.py`);
376+
- the **safety AST scanner**: hostile imports, dangerous calls, syntax
377+
errors flagged but not blocked, strict-mode behaviour
378+
(`test_safety.py`);
379+
- the **exercise schema and grader**: loader validation, allowlist
380+
filtering of references, passing/failing solutions, runtime errors,
381+
and the harness output-stripping (`test_exercises.py`);
382+
- the **docs reference layer**: allowlist filtering, curated lookups,
383+
offline-only behaviour, mocked online HEAD verification with both
384+
full-success and full-failure cases, the `405 → GET` fallback, the
385+
`/api/docs/lookup` endpoint, and the `docs` block on `/api/evaluate`
386+
and `/api/chat` responses (`test_docs_refs.py`).
242387

243388
## Roadmap
244389

0 commit comments

Comments
 (0)