Skip to content

Commit 80c2106

Browse files
authored
Merge pull request #67 from benoitc/security-fixes
Security and robustness fixes from NIF audit
2 parents 09f3191 + 07a78a8 commit 80c2106

22 files changed

Lines changed: 1052 additions & 187 deletions

CHANGELOG.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,67 @@
22

33
## Unreleased
44

5+
### Fixed
6+
7+
- **NIF robustness hardening** - `make_py_error` no longer passes a NULL message/type
8+
to `enif_make_string`/`enif_make_atom` when a Python exception's text isn't
9+
UTF-8-encodable; `binary_to_string` rejects names/code containing an embedded NUL
10+
(which would silently truncate a module/function/attr/code string) rather than
11+
truncating; a leaked `split` method object in the reactor buffer is released; and a
12+
stray debug `fprintf` on the normal worker send path is removed.
13+
14+
### Security
15+
16+
- **No shell for venv/installer commands** - `py:ensure_venv` and dependency
17+
installation now run the executables via `open_port({spawn_executable, ...})` with an
18+
argument list instead of building a shell string for `os:cmd`. Venv paths, requirement
19+
files, and extras are passed literally, so shell metacharacters can't be injected. For
20+
`uv`, `VIRTUAL_ENV` is passed via the port `{env, ...}` option rather than a shell prefix.
21+
- **Bounded shared state + safe stream/log builders** - `py_state` gained an optional
22+
`max_state_entries` cap (default `infinity`, unchanged behavior) enforced with atomic
23+
admission so Python-driven `state_set` can't exhaust node memory, and its size counter
24+
is protected from corruption. The `py:stream` and logging helpers that build Python
25+
source now strictly validate module/function/kwarg names as identifiers (rejecting
26+
injection at positions where quoting is meaningless) and escape string-literal values
27+
including control characters.
28+
- **Validated event-loop fd handles** - The asyncio reader/writer integration no longer
29+
hands Python a raw `fd_resource` pointer as an integer key. Each handle is an opaque id
30+
validated against a registry on every use, so a stale, duplicate, or fabricated id is a
31+
safe no-op (or clean error) instead of a double-free or arbitrary-pointer dereference
32+
that crashed the node. `fd_read`/`fd_write` also moved to dirty IO schedulers.
33+
- **OWN_GIL worker robustness** (Python 3.14+) - A per-request allocation failure in
34+
a subinterpreter worker no longer `break`s (and permanently kills) the worker command
35+
loop; it returns an error and keeps serving. The `owngil_*` dispatch NIFs now run on
36+
dirty IO schedulers and use non-blocking, deadline-bounded pipe reads and writes, so a
37+
stalled or dead worker can't wedge a scheduler forever. The internal `SuspensionRequired`
38+
exception is now looked up per-interpreter (like `ProcessError`), avoiding cross-
39+
interpreter object use under OWN_GIL.
40+
- **Callback suspend/resume lifetime hardening** - The worker resource is now kept
41+
alive for the lifetime of a suspended callback (it could previously be GC'd mid-
42+
suspension, causing a use-after-free on resume). A resume frees any prior result
43+
before storing a new one (no leak/double-replay on a duplicate resume), the
44+
pending-callback thread-local is cleared at the worker request boundary, and the
45+
callback-response pipe writes run on dirty schedulers with non-blocking, deadline-
46+
bounded writes so a stalled reader or large payload can't wedge a scheduler or
47+
desync the framed protocol.
48+
- **Zero-copy buffer pinning** - `py_buffer` no longer relocates (and frees) its
49+
storage while a Python `memoryview` points into it. A write that would grow the
50+
buffer while a view is held now returns an error instead of dangling the view into
51+
freed memory (a use-after-free that crashed the whole node).
52+
- **Bounded recursion in type conversion** - The Erlang<->Python converters now cap
53+
nesting depth, so a deeply nested term (or Python structure) returns a clean error
54+
instead of overflowing the C stack and crashing the whole node.
55+
- **NULL-checked tuple allocation** - Argument-tuple allocations in the call/eval paths
56+
are checked before use, and the Python->Erlang map conversion is bounded against
57+
mid-iteration dict mutation, closing two ways an allocation failure or re-entrant
58+
`__str__` could corrupt memory.
59+
- **Safe term decoding at the NIF boundary** - All `enif_binary_to_term` calls now
60+
pass `ERL_NIF_BIN2TERM_SAFE`, preventing attacker-influenced data (notably a Python
61+
`"__etf__:<base64>"` callback result) from minting new, non-GC'd atoms and exhausting
62+
the atom table. Local-node pids/refs and already-existing atoms still round-trip
63+
unchanged; only brand-new atoms, remote-node pids/refs, and external funs in
64+
Python-supplied payloads are now rejected.
65+
566
### Changed
667

768
- **Support Erlang/OTP 28 and 29** - Validated builds and the full Common Test

c_src/py_buffer.c

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -152,6 +152,14 @@ int py_buffer_write(py_buffer_resource_t *buf, const unsigned char *data, size_t
152152
/* Check if we need to grow the buffer */
153153
size_t required = buf->write_pos + size;
154154
if (required > buf->capacity) {
155+
/* A live Python memoryview (from PyBuffer_getbuffer) holds a raw pointer
156+
* into buf->data. Relocating/freeing the buffer now would leave that
157+
* pointer dangling -> use-after-free that crashes the whole node. Refuse
158+
* to grow while any view is pinned; the caller gets a write error. */
159+
if (buf->view_count > 0) {
160+
pthread_mutex_unlock(&buf->mutex);
161+
return -1;
162+
}
155163
/* Calculate new capacity */
156164
size_t new_capacity = buf->capacity;
157165
while (new_capacity < required) {

c_src/py_callback.c

Lines changed: 59 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -147,6 +147,33 @@ static PyObject *get_current_process_error(void) {
147147
return exc_class;
148148
}
149149

150+
/**
151+
* Get the SuspensionRequired exception class from the current interpreter's
152+
* erlang module. Under OWN_GIL subinterpreters each interpreter has its own
153+
* erlang module/class, so raising the process-global object (which belongs to
154+
* whichever interpreter initialized last) is cross-interpreter UB. Mirrors
155+
* get_current_process_error().
156+
*/
157+
static PyObject *get_current_suspension_required(void) {
158+
PyObject *erlang_module = PyImport_ImportModule("erlang");
159+
if (erlang_module == NULL) {
160+
PyErr_Clear();
161+
return SuspensionRequiredException; /* Fallback to global */
162+
}
163+
164+
PyObject *exc_class = PyObject_GetAttrString(erlang_module, "SuspensionRequired");
165+
Py_DECREF(erlang_module);
166+
167+
if (exc_class == NULL) {
168+
PyErr_Clear();
169+
return SuspensionRequiredException; /* Fallback to global */
170+
}
171+
172+
/* See get_current_process_error: decref and rely on the module keeping it alive. */
173+
Py_DECREF(exc_class);
174+
return exc_class;
175+
}
176+
150177
/* ============================================================================
151178
* Callback Name Registry
152179
*
@@ -399,6 +426,14 @@ static suspended_state_t *create_suspended_state_ex(
399426
} else {
400427
state->worker = source->data.existing->worker;
401428
}
429+
/* Keep the worker resource alive for as long as the suspended state exists.
430+
* Without this the worker can be GC'd while a callback is suspended, and
431+
* nif_resume_callback_dirty would dereference a freed worker (use-after-free
432+
* with the GIL held). Mirrors the enif_keep_resource(ctx) on the context path;
433+
* suspended_state_destructor releases it. */
434+
if (state->worker != NULL) {
435+
enif_keep_resource(state->worker);
436+
}
402437

403438
state->callback_id = PyLong_AsUnsignedLongLong(callback_id_obj);
404439

@@ -977,7 +1012,7 @@ static PyObject *decode_etf_string(const char *str, Py_ssize_t len) {
9771012

9781013
/* Decode the ETF binary to an Erlang term */
9791014
ERL_NIF_TERM term;
980-
if (enif_binary_to_term(tmp_env, (unsigned char *)bin_data, bin_len, &term, 0) == 0) {
1015+
if (enif_binary_to_term(tmp_env, (unsigned char *)bin_data, bin_len, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
9811016
/* Decoding failed */
9821017
enif_free_env(tmp_env);
9831018
Py_DECREF(decoded);
@@ -2109,7 +2144,7 @@ static PyObject *erlang_call_impl(PyObject *self, PyObject *args) {
21092144
Py_XSETREF(tl_pending_args, call_args);
21102145

21112146
/* Raise exception to abort Python execution */
2112-
PyErr_SetString(SuspensionRequiredException, "callback pending");
2147+
PyErr_SetString(get_current_suspension_required(), "callback pending");
21132148
return NULL;
21142149
}
21152150

@@ -2859,7 +2894,7 @@ static PyObject *erlang_channel_try_receive_impl(PyObject *self, PyObject *args)
28592894
}
28602895

28612896
ERL_NIF_TERM term;
2862-
if (enif_binary_to_term(tmp_env, data, size, &term, 0) == 0) {
2897+
if (enif_binary_to_term(tmp_env, data, size, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
28632898
enif_free(data);
28642899
enif_free_env(tmp_env);
28652900
PyErr_SetString(PyExc_RuntimeError, "failed to decode term");
@@ -2939,7 +2974,7 @@ static PyObject *erlang_channel_receive_impl(PyObject *self, PyObject *args) {
29392974
}
29402975

29412976
ERL_NIF_TERM term;
2942-
if (enif_binary_to_term(tmp_env, data, size, &term, 0) == 0) {
2977+
if (enif_binary_to_term(tmp_env, data, size, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
29432978
enif_free(data);
29442979
enif_free_env(tmp_env);
29452980
PyErr_SetString(PyExc_RuntimeError, "failed to decode term");
@@ -3251,7 +3286,7 @@ static PyObject *erlang_channel_wait_impl(PyObject *self, PyObject *args) {
32513286
}
32523287

32533288
ERL_NIF_TERM term;
3254-
if (enif_binary_to_term(tmp_env, data, msg_size, &term, 0) == 0) {
3289+
if (enif_binary_to_term(tmp_env, data, msg_size, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
32553290
enif_free(data);
32563291
enif_free_env(tmp_env);
32573292
PyErr_SetString(PyExc_RuntimeError, "failed to decode term");
@@ -4316,7 +4351,14 @@ static ERL_NIF_TERM nif_resume_callback(ErlNifEnv *env, int argc, const ERL_NIF_
43164351
/* Store the result in the suspended state */
43174352
pthread_mutex_lock(&state->mutex);
43184353

4319-
/* Copy result data */
4354+
/* Copy result data. Free any prior result first: a duplicate/raced resume
4355+
* would otherwise leak the previous buffer. (has_result is not a one-shot
4356+
* flag -- it toggles during nested replay -- so result_data is the real
4357+
* pending-result indicator.) */
4358+
if (state->result_data != NULL) {
4359+
enif_free(state->result_data);
4360+
state->result_data = NULL;
4361+
}
43204362
state->result_data = enif_alloc(result_bin.size);
43214363
if (state->result_data == NULL) {
43224364
pthread_mutex_unlock(&state->mutex);
@@ -4364,6 +4406,12 @@ static ERL_NIF_TERM nif_resume_callback_dirty(ErlNifEnv *env, int argc, const ER
43644406
return make_error(env, "no_result");
43654407
}
43664408

4409+
/* The worker is kept alive for the lifetime of the suspended state, but
4410+
* guard rather than dereference NULL in the replay below. */
4411+
if (state->worker == NULL) {
4412+
return make_error(env, "no_worker");
4413+
}
4414+
43674415
/* Set up thread-local state for replay */
43684416
tl_current_worker = state->worker;
43694417
tl_callback_env = env;
@@ -4430,6 +4478,11 @@ static ERL_NIF_TERM nif_resume_callback_dirty(ErlNifEnv *env, int argc, const ER
44304478
}
44314479

44324480
PyObject *args = PyTuple_New(args_len);
4481+
if (args == NULL) {
4482+
Py_DECREF(func);
4483+
result = make_error(env, "alloc_failed");
4484+
goto call_cleanup;
4485+
}
44334486
ERL_NIF_TERM head, tail = state->orig_args;
44344487
for (unsigned int i = 0; i < args_len; i++) {
44354488
enif_get_list_cell(state->orig_env, tail, &head, &tail);

c_src/py_channel.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -555,7 +555,7 @@ static ERL_NIF_TERM nif_channel_receive(ErlNifEnv *env, int argc, const ERL_NIF_
555555
if (result == 0) {
556556
/* Data available - convert back to term */
557557
ERL_NIF_TERM term;
558-
if (enif_binary_to_term(env, data, size, &term, 0) == 0) {
558+
if (enif_binary_to_term(env, data, size, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
559559
enif_free(data);
560560
return make_error(env, "binary_to_term_failed");
561561
}
@@ -590,7 +590,7 @@ static ERL_NIF_TERM nif_channel_try_receive(ErlNifEnv *env, int argc, const ERL_
590590
if (result == 0) {
591591
/* Data available - convert back to term */
592592
ERL_NIF_TERM term;
593-
if (enif_binary_to_term(env, data, size, &term, 0) == 0) {
593+
if (enif_binary_to_term(env, data, size, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
594594
enif_free(data);
595595
return make_error(env, "binary_to_term_failed");
596596
}
@@ -760,7 +760,7 @@ ERL_NIF_TERM nif_channel_wait(ErlNifEnv *env, int argc, const ERL_NIF_TERM argv[
760760

761761
/* Convert back to term */
762762
ERL_NIF_TERM term;
763-
if (enif_binary_to_term(env, data, msg_size, &term, 0) == 0) {
763+
if (enif_binary_to_term(env, data, msg_size, &term, ERL_NIF_BIN2TERM_SAFE) == 0) {
764764
enif_free(data);
765765
return make_error(env, "binary_to_term_failed");
766766
}

0 commit comments

Comments
 (0)