Skip to content

Commit a15c188

Browse files
authored
Forbid randomness inside variable formulas (#500)
* Forbid randomness inside variable formulas A rules-engine formula must be a pure, deterministic function of its inputs: identical inputs must always produce identical outputs. Calling a random number generator inside a formula breaks that contract and makes datasets non-reproducible (a property_purchased assignment built on unseeded np.random recently spiked a UK income decile's tax rate and blocked data releases for ~2 weeks). While a formula runs, forbid_randomness replaces the public callables of numpy.random and the stdlib random module with functions that raise NonDeterministicFormulaError. Seeding does not make randomness acceptable in a formula; stochastic inputs must be precomputed deterministically when building the dataset and stored as inputs. The guard is re-entrant and restores the namespaces once the outermost formula returns. Removes the per-variable np.random.seed previously applied at the top of calculate(), which existed only to make formula-level randomness reproducible and now conflicts with the guard. Core-internal seeding in __init__ (for non-formula randomness) is unchanged. No formula in policyengine-uk or policyengine-us uses randomness, so this is enforcement of an already-held invariant. Full core suite passes (580 tests). * Fix review findings: document RNG-guard limits, pin with tests Independent review of the determinism guard surfaced two interception gaps and missing exception-safety coverage: - A generator hoisted to module scope (rng = np.random.default_rng(0)) and used inside a formula is not caught: numpy's generator classes are immutable C extension types, so their bound methods cannot be patched. (Building a generator inside the formula IS caught, since the np.random.default_rng constructor is patched.) - A drawing function imported by name before the guard installs (from numpy.random import random) is not caught. Neither is closeable without heavy machinery, and no PolicyEngine formula uses randomness, so document both accurately in the module docstring and pin them with tests so the boundary cannot drift silently. Corrects an earlier draft docstring that wrongly claimed the generator classes were patched. Adds tests: - randomness namespaces are restored after a formula raises a non-RNG exception (pins _run_formula guarded-block exception safety), - constructing a generator inside a formula raises, - the two known gaps behave as documented.
1 parent 02a4926 commit a15c188

4 files changed

Lines changed: 334 additions & 5 deletions

File tree

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
Forbid random number generation inside variable formulas. A rules-engine formula must be a pure, deterministic function of its inputs, so any call to `numpy.random` or the standard library `random` module while a formula runs now raises `NonDeterministicFormulaError`. Stochastic inputs must be precomputed deterministically when building the dataset. Removes the per-variable `np.random.seed` previously applied before each calculation, which existed only to make formula-level randomness reproducible.
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
"""Forbid non-deterministic randomness inside variable formulas.
2+
3+
A rules engine must be a pure function of its inputs: identical inputs must
4+
always produce identical outputs. Calling a random number generator inside a
5+
formula breaks that contract — the same household can get different results on
6+
different runs — and makes whole datasets non-reproducible.
7+
8+
While a formula is executing, this guard replaces the callables exposed by
9+
``numpy.random`` and the standard library ``random`` module with functions that
10+
raise :class:`NonDeterministicFormulaError`. This includes the RNG constructors
11+
``np.random.default_rng``/``Generator``/``RandomState``, so *building* a
12+
generator inside a formula — even a seeded one — also raises. Seeding does not
13+
make randomness acceptable in a formula: stochastic inputs belong in the dataset
14+
(computed once, deterministically, and stored), not in the formula. The guard is
15+
re-entrant, so nested formula evaluation is handled correctly, and it restores
16+
the originals once the outermost guarded formula returns.
17+
18+
Known limitations (pinned by tests in ``test_randomness_guard`` so they cannot
19+
drift silently):
20+
21+
* A ``Generator``/``RandomState`` instance built *before* the formula runs
22+
(e.g. ``rng = np.random.default_rng(0)`` at module scope) is not intercepted
23+
when its methods are called inside the formula. ``numpy``'s generator classes
24+
are immutable C extension types, so their bound methods cannot be patched.
25+
* A *bare drawing function* bound into another module's namespace before the
26+
guard installs (e.g. ``from numpy.random import random``) is not intercepted,
27+
because the guard patches the ``numpy.random`` module attribute, not every
28+
rebinding of it.
29+
30+
Both require deliberately hoisting randomness out of the ``np.random.<fn>`` form;
31+
use ``np.random.<fn>(...)`` or construct the generator inside the formula (both
32+
caught) rather than importing or pre-building drawing callables.
33+
"""
34+
35+
from __future__ import annotations
36+
37+
import random as _stdlib_random
38+
from typing import Callable
39+
40+
import numpy as np
41+
42+
43+
class NonDeterministicFormulaError(RuntimeError):
44+
"""Raised when a formula invokes a random number generator."""
45+
46+
47+
# Public callables exposed by each randomness namespace, captured once at
48+
# import so the per-formula swap is a cheap dict iteration rather than a
49+
# fresh ``dir()`` scan.
50+
def _public_callables(module) -> dict[str, Callable]:
51+
return {
52+
name: getattr(module, name)
53+
for name in dir(module)
54+
if not name.startswith("_") and callable(getattr(module, name))
55+
}
56+
57+
58+
_GUARDED_NAMESPACES = (
59+
("numpy.random", np.random, _public_callables(np.random)),
60+
("random", _stdlib_random, _public_callables(_stdlib_random)),
61+
)
62+
63+
# Re-entrancy bookkeeping: only the outermost guarded formula installs and
64+
# removes the patches; the active variable name is tracked as a stack so the
65+
# error message always names the formula that actually made the call.
66+
_depth = 0
67+
_variable_stack: list[str] = []
68+
69+
70+
def _make_raiser(namespace: str, attribute: str) -> Callable:
71+
qualified = f"{namespace}.{attribute}"
72+
73+
def _raise(*args, **kwargs):
74+
variable = _variable_stack[-1] if _variable_stack else "<unknown>"
75+
raise NonDeterministicFormulaError(
76+
f"The formula for '{variable}' called {qualified}(), but rules-engine "
77+
f"formulas must be deterministic functions of their inputs. Remove the "
78+
f"random call. If you need a stochastic input, compute it once when "
79+
f"building the dataset (with a seeded generator) and store it as an "
80+
f"input variable instead."
81+
)
82+
83+
return _raise
84+
85+
86+
# Pre-build the raisers once per (namespace, attribute).
87+
_RAISERS = {
88+
id(module): {name: _make_raiser(namespace, name) for name in originals}
89+
for namespace, module, originals in _GUARDED_NAMESPACES
90+
}
91+
92+
93+
def _install() -> None:
94+
for _namespace, module, originals in _GUARDED_NAMESPACES:
95+
raisers = _RAISERS[id(module)]
96+
for name in originals:
97+
setattr(module, name, raisers[name])
98+
99+
100+
def _restore() -> None:
101+
for _namespace, module, originals in _GUARDED_NAMESPACES:
102+
for name, original in originals.items():
103+
setattr(module, name, original)
104+
105+
106+
class forbid_randomness:
107+
"""Context manager that bans RNG use while a formula runs.
108+
109+
Re-entrant: nested formulas reuse the single installed patch set and only
110+
the outermost context restores the originals.
111+
"""
112+
113+
__slots__ = ("variable_name",)
114+
115+
def __init__(self, variable_name: str):
116+
self.variable_name = variable_name
117+
118+
def __enter__(self) -> "forbid_randomness":
119+
global _depth
120+
if _depth == 0:
121+
_install()
122+
_depth += 1
123+
_variable_stack.append(self.variable_name)
124+
return self
125+
126+
def __exit__(self, exc_type, exc_value, traceback) -> bool:
127+
global _depth
128+
_variable_stack.pop()
129+
_depth -= 1
130+
if _depth == 0:
131+
_restore()
132+
return False

policyengine_core/simulations/simulation.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@
1414
from policyengine_core.entities.entity import Entity
1515
from policyengine_core.enums import Enum, EnumArray
1616
from policyengine_core.errors import CycleError, SpiralError
17+
from policyengine_core.simulations.randomness_guard import forbid_randomness
1718
from policyengine_core.holders.holder import Holder
1819
from policyengine_core.periods import Period
1920
from policyengine_core.periods.config import ETERNITY, MONTH, YEAR
@@ -591,7 +592,9 @@ def calculate(
591592

592593
self.tracer.record_calculation_start(variable_name, period, self.branch_name)
593594

594-
np.random.seed(_stable_hash_to_seed(variable_name + str(period)))
595+
# No per-variable RNG seeding: formulas may not use randomness at all
596+
# (enforced by forbid_randomness in _run_formula), so there is nothing
597+
# to make reproducible here.
595598

596599
try:
597600
result = self._calculate(variable_name, period)
@@ -1102,10 +1105,14 @@ def _run_formula(
11021105
self.tax_benefit_system.parameters.tracer = self.tracer
11031106
parameters_at = self.tax_benefit_system.parameters
11041107

1105-
if formula.__code__.co_argcount == 2:
1106-
array = formula(population, period)
1107-
else:
1108-
array = formula(population, period, parameters_at)
1108+
# A rules-engine formula must be a pure, deterministic function of its
1109+
# inputs. Forbid any random number generation while it runs so the same
1110+
# inputs always produce the same outputs.
1111+
with forbid_randomness(variable.name):
1112+
if formula.__code__.co_argcount == 2:
1113+
array = formula(population, period)
1114+
else:
1115+
array = formula(population, period, parameters_at)
11091116

11101117
return array
11111118

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
"""A variable formula must not invoke a random number generator.
2+
3+
Rules-engine formulas have to be deterministic functions of their inputs, so
4+
calling ``numpy.random`` or the stdlib ``random`` module inside a formula raises
5+
:class:`NonDeterministicFormulaError`. These tests pin that behaviour and verify
6+
the guard restores the randomness namespaces afterwards.
7+
"""
8+
9+
import random
10+
from numpy.random import random as _bare_random_import
11+
12+
import numpy as np
13+
import pytest
14+
15+
# A generator hoisted to module scope, built before any formula runs. Its bound
16+
# methods cannot be patched (numpy generator classes are immutable C types), so
17+
# using it inside a formula is a known, pinned gap in the guard.
18+
_PREBUILT_RNG = np.random.default_rng(0)
19+
20+
from policyengine_core import periods
21+
from policyengine_core.country_template import CountryTaxBenefitSystem, entities
22+
from policyengine_core.simulations import SimulationBuilder
23+
from policyengine_core.simulations.randomness_guard import (
24+
NonDeterministicFormulaError,
25+
forbid_randomness,
26+
)
27+
from policyengine_core.variables import Variable
28+
29+
PERIOD = "2013-01"
30+
31+
32+
def _simulation_with(*variable_classes):
33+
system = CountryTaxBenefitSystem()
34+
system.add_variables(*variable_classes)
35+
return SimulationBuilder().build_default_simulation(system)
36+
37+
38+
class uses_numpy_random(Variable):
39+
value_type = float
40+
entity = entities.Person
41+
definition_period = periods.MONTH
42+
label = "formula that draws from numpy.random"
43+
44+
def formula(person, period):
45+
return np.random.random(person.count)
46+
47+
48+
class uses_stdlib_random(Variable):
49+
value_type = float
50+
entity = entities.Person
51+
definition_period = periods.MONTH
52+
label = "formula that draws from the random module"
53+
54+
def formula(person, period):
55+
return random.random()
56+
57+
58+
class uses_seeded_generator(Variable):
59+
value_type = float
60+
entity = entities.Person
61+
definition_period = periods.MONTH
62+
label = "formula that builds a seeded generator"
63+
64+
def formula(person, period):
65+
# Seeding does not make randomness acceptable inside a formula.
66+
return np.random.default_rng(0).random(person.count)
67+
68+
69+
class deterministic(Variable):
70+
value_type = int
71+
entity = entities.Person
72+
definition_period = periods.MONTH
73+
label = "deterministic formula"
74+
75+
def formula(person, period):
76+
return person.count
77+
78+
79+
class raises_value_error(Variable):
80+
value_type = float
81+
entity = entities.Person
82+
definition_period = periods.MONTH
83+
label = "formula that raises a non-RNG exception"
84+
85+
def formula(person, period):
86+
raise ValueError("boom")
87+
88+
89+
class uses_prebuilt_generator(Variable):
90+
value_type = float
91+
entity = entities.Person
92+
definition_period = periods.MONTH
93+
label = "formula that uses a module-scope generator (known gap)"
94+
95+
def formula(person, period):
96+
return _PREBUILT_RNG.random(person.count)
97+
98+
99+
class uses_bare_imported_function(Variable):
100+
value_type = float
101+
entity = entities.Person
102+
definition_period = periods.MONTH
103+
label = "formula that uses a by-name imported drawing function (known gap)"
104+
105+
def formula(person, period):
106+
return _bare_random_import()
107+
108+
109+
def test_numpy_random_in_formula_raises():
110+
simulation = _simulation_with(uses_numpy_random)
111+
with pytest.raises(NonDeterministicFormulaError, match="uses_numpy_random"):
112+
simulation.calculate("uses_numpy_random", PERIOD)
113+
114+
115+
def test_stdlib_random_in_formula_raises():
116+
simulation = _simulation_with(uses_stdlib_random)
117+
with pytest.raises(NonDeterministicFormulaError, match="random"):
118+
simulation.calculate("uses_stdlib_random", PERIOD)
119+
120+
121+
def test_seeded_generator_in_formula_still_raises():
122+
simulation = _simulation_with(uses_seeded_generator)
123+
with pytest.raises(NonDeterministicFormulaError):
124+
simulation.calculate("uses_seeded_generator", PERIOD)
125+
126+
127+
def test_deterministic_formula_is_unaffected():
128+
simulation = _simulation_with(deterministic)
129+
result = simulation.calculate("deterministic", PERIOD)
130+
assert (result == 1).all()
131+
132+
133+
def test_randomness_restored_after_guarded_formula():
134+
simulation = _simulation_with(uses_numpy_random)
135+
with pytest.raises(NonDeterministicFormulaError):
136+
simulation.calculate("uses_numpy_random", PERIOD)
137+
# Outside any formula, numpy and stdlib randomness work normally again.
138+
assert isinstance(float(np.random.random()), float)
139+
assert isinstance(random.random(), float)
140+
141+
142+
def test_randomness_restored_after_non_rng_exception_in_formula():
143+
# If a formula raises a normal exception, _run_formula's guarded block must
144+
# still restore the randomness namespaces (no leak of the patched state).
145+
simulation = _simulation_with(raises_value_error)
146+
with pytest.raises(ValueError, match="boom"):
147+
simulation.calculate("raises_value_error", PERIOD)
148+
assert isinstance(float(np.random.random()), float)
149+
assert isinstance(random.random(), float)
150+
151+
152+
def test_constructing_generator_inside_formula_is_caught():
153+
# Building a generator inside the formula hits the patched np.random
154+
# constructor, so even this seeded form raises (covered by
155+
# uses_seeded_generator too; kept explicit for the boundary).
156+
simulation = _simulation_with(uses_seeded_generator)
157+
with pytest.raises(NonDeterministicFormulaError):
158+
simulation.calculate("uses_seeded_generator", PERIOD)
159+
160+
161+
def test_prebuilt_generator_is_a_known_gap():
162+
# Documented limitation: a generator built before the formula runs cannot be
163+
# intercepted (numpy generator classes are immutable). Pin the behaviour so a
164+
# future change to it is noticed.
165+
simulation = _simulation_with(uses_prebuilt_generator)
166+
result = simulation.calculate("uses_prebuilt_generator", PERIOD)
167+
assert result is not None
168+
169+
170+
def test_by_name_imported_function_is_a_known_gap():
171+
# Documented limitation: a drawing function imported by name before the guard
172+
# installs is not intercepted. Pin the behaviour.
173+
simulation = _simulation_with(uses_bare_imported_function)
174+
result = simulation.calculate("uses_bare_imported_function", PERIOD)
175+
assert result is not None
176+
177+
178+
def test_guard_is_reentrant():
179+
# Entering twice and leaving the inner context must not restore the
180+
# originals while the outer context is still active.
181+
with forbid_randomness("outer"):
182+
with forbid_randomness("inner"):
183+
with pytest.raises(NonDeterministicFormulaError, match="inner"):
184+
np.random.random()
185+
# Still guarded: the outer context owns the patch.
186+
with pytest.raises(NonDeterministicFormulaError, match="outer"):
187+
np.random.random()
188+
# Fully restored once the outermost context exits.
189+
assert isinstance(float(np.random.random()), float)

0 commit comments

Comments
 (0)