Disclaimer: This code and the statistical methods it implements were generated by AI. A statistician has not yet verified the implementation or the appropriateness of the methods for this application, although it is reassuring that the results of the distinct methods are consistent with each other and that this sort of problem is one AI likely handles well.
Important note on data: Raw data from Karwowski et al. (2018) is not publicly available. All simulations use plausible approximations based on the paper's reported summaries (n, median, IQR, range, and typical vaccination clustering). Results are robust across multiple tie distributions, which provides reasonable confidence despite the lack of exact data.
Statistical power and confidence-interval framework for Spearman rank correlations, designed to evaluate the non-significant associations reported in Karwowski et al. (2018) between blood/hair aluminum levels and cumulative vaccine aluminum exposure.
Karwowski et al. (2018) found no significant Spearman correlations between cumulative vaccine aluminum and either blood aluminum (B-Al) or hair aluminum (H-Al) in a cohort of 9-13-month-old infants. This simulation framework asks: what correlations could the study have detected?
Because most children followed a similar vaccination schedule, cumulative aluminum values cluster around a narrow range (median 2.9 mg, IQR 0.11 mg), producing ties in the x-variable. The framework explicitly models these ties to assess their impact on statistical power.
Karwowski et al. (2018) recruited 85 infants. Missing B-Al and/or H-Al data and extreme outliers reduce the available sample for each calculation (hence the lower N values in the table below). The paper does not state whether the reported Spearman correlations with vaccine aluminum load were computed with or without outliers. However, when reporting the correlation between B-Al and H-Al, the authors exclude participants with missing data and all extreme outliers, which suggests the vaccine correlations may also have been computed with outliers excluded. To cover both possibilities, the simulation runs four cases: B-Al and H-Al, each with outliers included and excluded.
| Case | Variable | Outliers | N | Observed rho | P |
|---|---|---|---|---|---|
| 1 | B-Al | Included | 80 | -0.13 | 0.26 |
| 2 | H-Al | Included | 82 | +0.06 | 0.56 |
| 3 | B-Al | Excluded | 73 | -0.13 | 0.26 |
| 4 | H-Al | Excluded | 81 | +0.06 | 0.56 |
We test H₀: ρ = 0 vs H₁: ρ ≠ 0 (two-sided) at α = 0.05. Power is the probability of rejecting H₀ when the true correlation is ρ.
- Minimum detectable correlation -- the smallest |ρ| detectable with 80% power at α = 0.05, for each case × tie structure × distribution shape.
- Confidence intervals for observed rho -- bootstrap and asymptotic CIs. These are the expected intervals under the model when the true correlation equals the observed rho, averaged over many simulated datasets, showing the plausible range of true correlations consistent with the data.
Note on search bounds: The bisection for min detectable rho searches [0.25, 0.42] (positive) or [-0.42, -0.25] (negative). These bounds are tuned for the Karwowski study (n = 73-82, α = 0.05); for other applications they should be widened. The asymptotic search uses wider bounds [0, 0.6].
-
Y marginal distribution (aluminum levels): Y is modeled as log-normal using the reported median and IQR, or when using the empirical generator, via the empirical distribution (resampling from digitized Figure of Karwowski B-Al/H-Al data). Because Spearman's rank correlation depends only on ranks, the specific continuous distribution of Y has no effect on the correlation coefficient, its null distribution, power, or confidence intervals in the non-parametric, linear, or asymptotic methods. The log-normal choice is kept purely for realism (capturing skewness and outliers) and because the Gaussian copula method requires a continuous marginal. The log-normal is fitted using
q75 = median + IQR/2(symmetric split of the reported IQR); for a truly log-normal distribution the split is asymmetric, so sigma is slightly underestimated. This has no effect on rank-based results since only the ordering of y matters. Based on the Karwowski (2018) Figure showing the B-Al vs H-Al correlation (N = 71, excluding missing values and outliers), these values have few and mild ties, so choosing a distribution of distinct values for generating y in our simulation is a good approximation. It should be noted though that those data points show B-Al follows a log-normal distribution, and so likely does also at N = 73, but the outliers likely shift it away from log-normal, and the H-Al data is not log-normal. (Implementation note: in the default batch execution path, lognormal y values are never actually generated for the nonparametric, copula, or linear generators. The skip-y optimization returns rank arrays directly, bypassing lognormal/exp generation entirely — these generators consume no RNG draws for y marginals in production. Only the empirical generator generates actual y values. See Skip-y for data generation.) -
Tie structure distributions: Three frequency distributions (
even,heavy_tail,heavy_center) are tested for sensitivity; differences across them are small, which is reassuring given unavailable raw data.heavy_centeris considered most plausible for real vaccination schedules: most children followed the schedule and would be clustered in the middle for ages 9–13 months, with the middle 50% under 12 months. More realistic skewed patterns have also been explored but are currently undocumented here (e.g., for k=4 distinct values: only 2–5 observations in the tails, heavy concentration in the second value, and substantial but lower counts in the third). Users can test custom frequencies withscripts/run_single_scenario.py --freq. -
Method consistency: The linear Monte Carlo method produces results that are virtually identical to the recommended non-parametric rank-mixing method. The Gaussian copula method now uses multipoint calibration by default (same as nonparametric). Calibration runs on the same data-generating process that includes jittering, so it compensates for the bias; with multipoint, the copula achieves target Spearman within tolerance and can be used for power and CI estimation. It remains a useful alternative to nonparametric and performs very well in all-distinct (no-ties) scenarios.
- X-values are generated with 4-10 distinct values using three frequency distributions:
even,heavy_tail, andheavy_center. - An additional all-distinct baseline (no ties) is computed for each case.
- Total: 84 tied scenarios + 4 all-distinct = 88 scenarios per method.
This is a bivariate simplification of the Iman-Conover method, applied directly in standardized rank space using Cholesky decomposition to induce correlation, followed by reordering of y draws. Like standard Iman-Conover, it preserves the marginal distribution of y exactly via reordering and handles ties in x naturally through midranks. It is the recommended Monte Carlo approach, especially when x has ties. It works by:
- Computing the standardised midranks of x (handling ties naturally).
- Generating independent noise ranks via random permutation.
- Mixing the two at weight rho:
mixed = rho_cal * s_x + sqrt(1 - rho_cal^2) * s_noise. This is essentially the Cholesky decomposition (Iman-Conover) in the bivariate case applied directly to ranks. - Mapping the resulting ordering onto random draws from the target log-normal y-marginal. (In the default batch path, this step is skipped entirely:
_return_ranks=Truereturnsrank(mixed)directly without drawing any lognormal values. See Skip-y for data generation.)
Calibration: Because ties attenuate the realised Spearman rho relative to the mixing weight, the method includes an automatic calibration step. Two modes are available (configurable via config.CALIBRATION_MODE or --calibration-mode). In both modes, calibration runs once per unique tie structure and is reused across all rho values during bisection — a major performance improvement over per-rho calibration:
- Multipoint (default) (non-parametric rank-mixing and copula): Probes at several rho values (e.g. 0.10, 0.30, 0.50) and interpolates the calibration curve. Captures nonlinear attenuation (ratio varies with rho) and reduces bias to typically under 0.01. For the copula, calibration runs on the same DGP that includes jittering, so it compensates for jittering-induced attenuation. Accuracy: Better than single-point when attenuation is nonlinear (e.g. custom or extreme tie structures). Performance: ~3× slower calibration on first run per tie structure (~9s vs ~3s) because it probes multiple values instead of one; cached thereafter and reused for all rho targets.
- Single-point (non-parametric rank-mixing and copula): Probes only at rho = 0.30 and applies a constant ratio. Faster (~3× less calibration cost) but can have 0.01–0.03 bias when attenuation is nonlinear. Use for quick exploratory runs.
The method induces an exact Pearson correlation
where:
-
$s_x$ = standardized midranks of x (mean 0, variance 1, incorporating ties via averaging) -
$s_{\mathrm{noise}}$ = standardized ranks from a random permutation (independent, mean 0, variance 1)
Derivation of exact Pearson correlation: Let
This holds purely from covariance algebra — no assumption is required that
The final Spearman
Approximate attenuation factor:
where
All rank-based simulation methods (including full Iman-Conover) face similar attenuation with heavy ties; the calibration step ensures the realised Spearman matches the target within tolerance (typically < 0.01 bias).
Why this method? Unlike the Gaussian copula, it does not rely on the continuous-marginals assumption. Unlike the linear model, it does not assume a parametric relationship between x and y. It handles tied x-values naturally because the mixing operates directly in rank space.
The rank-mixing step assigns y-values by sorting argsort(mixed):
y_final[argsort(mixed)[k]] = y_{(k+1)}
Claim. When all y draws are distinct, the orderings of mixed and y_final are identical:
Proof. Let position mixed (i.e. mixed). The scatter assignment gives
-
$\mathrm{mixed}[j_1] < \mathrm{mixed}[j_2] \iff k_{j_1} < k_{j_2}$ (by definition of rank) -
$k_{j_1} < k_{j_2} \iff y_{(k_{j_1})} < y_{(k_{j_2})}$ (since sorted y is strictly increasing) - Therefore
$\mathrm{mixed}[j_1] < \mathrm{mixed}[j_2] \iff y_{\mathrm{final}}[j_1] < y_{\mathrm{final}}[j_2]$ .$\square$
Since mixed is almost surely tie-free (see below), both vectors have identical orderings with no ties, so their rank vectors are equal element-wise:
Consequence for Spearman correlation. Spearman's
The right-hand side depends only on the x-template, the noise permutations, and rank(mixed). This is implemented in _eval_mean_rho_fast and _precompute_calibration_arrays_fast.
Batch vs sequential RNG. The fast path (_precompute_calibration_arrays_fast) generates all cal_rng.permuted), whereas the legacy single-point path generates each shuffle sequentially in a Python loop, resetting the RNG to seed at the start of each _mean_rho call. Both produce IID uniform random permutations per dataset — the same estimator bias (
Tie probability for lognormal y. For
so the identity holds for all practical purposes.
Why mixed is almost surely tie-free. Even when
Tied y (empirical marginal). When the y pool has ties (e.g. from digitization precision in the Karwowski data), the identity breaks: tied y-values at adjacent sorted positions receive averaged ranks that differ from rank(mixed). The empirical generator calibration path handles this separately and does not use the skip-y optimization.
Uses a Gaussian copula with non-parametric marginals and a fitted log-normal for Y. Target Spearman rho is converted to Pearson correlation for sampling; then draws from a conditional bivariate normal; Y-values are mapped through the inverse CDF of the fitted log-normal.
Conversion: data_generator._spearman_to_pearson.
Steps: (1) x → ranks (average) →
; data_generator.generate_y_copula. (In the default batch path, the _return_ranks=True returns rank(z_y) directly.)
With ties: Jittering collapses rank information, so without calibration the realised Spearman would be attenuated. Calibration runs on the same data-generating process (including jittering): we probe at several rho values and interpolate to find the input rho that yields each target. So calibration compensates for the bias. The copula now supports the same calibration modes as nonparametric: multipoint (default) and single-point. With multipoint, mean realised rho matches the target within tolerance (e.g. ±0.01), so the copula can be used for power and CI estimation even with ties.
Historical limitation without multipoint calibration: When x has heavy ties (e.g. k=4), the jittering step that breaks tied ranks collapses rank information and attenuates the realised Spearman rho, leading to underestimated power. Alternatives to the current jitter (distributional transform, adaptive jitter) were tested and did not meet the 0.01 accuracy threshold without multipoint calibration: single-point calibration (probe at rho = 0.30) helped compensate but was not enough. Alternatives tested — distributional transform (random uniform within each tie group's CDF band) and adaptive jitter (scaling jitter by tie group size) — improved the mild cases (k=10) but still failed the 0.01 accuracy threshold for heavy ties (k=4). This is a fundamental limitation of the continuous-marginals assumption, though it works reliably when there are no ties. Multipoint calibration compensates by learning the mapping on the jittered process, so the method remains usable. Nonparametric rank-mixing remains the recommended default because it operates directly in rank space and avoids the continuous-marginals assumption, but the copula is a valid alternative when multipoint calibration is used.
The copula maps
Since
Therefore rank(z_y).
Clipping caveat. The production code clips _lognormal_quantile, making the composed transform non-decreasing but flat outside
so the identity holds for all practical purposes. The fast path bypasses the clipping entirely (Spearman is computed from
| Identity | Probability of failure |
|---|---|
Nonparametric: rank(y_final) = rank(mixed) |
≈ 3.5 × 10⁻¹³ (float64 lognormal tie) |
Copula: rank(y) = rank(z_y) |
< 6.3 × 10⁻¹³ (clipping creates tie) |
| Batch vs sequential RNG | N/A — same distribution, different numbers |
The skip-y identities above were originally developed for calibration. The same identities extend to the actual data generation step: in the default batch execution path, all three non-empirical generators are called with _return_ranks=True in both estimate_power (power simulation) and bootstrap_ci_averaged with batch bootstrap (CI). The empirical generator is unchanged.
Why every downstream consumer is correct. All uses of y_all in both paths depend only on the ranks of y, not on the scale of y values. The key property is that rank(y) is an order-isomorphism: for any two positions
and all values in the rank array are distinct (continuous y ⇒ no ties). This makes the rank array interchangeable with the float array for any function that operates via ranks.
spearman_rho_2d(x_all, y_all)andspearman_rho_pvalue_2d(...)— both call_rank_rows(y)internally. Ranking a tie-free rank array is idempotent:rank({1,2,...,n} permuted) = {1,2,...,n} permuted(each value already equals its own rank). Identical rho values.pvalues_from_precomputed_null(rhos_obs, ...)— depends only onrhos_obs, not directly ony_all.- Permutation p-values (
pvalues_mc) — see below. - Batch bootstrap (
_batch_bootstrap_rhos_jit) — see below.
Each permutation pvalues_mc is a bijection — perm_idx[b, rep, :] is generated by np.argsort(rng.random(...)), which always produces a permutation of
Claim. For any bijection _tied_rank(y_float[σ]) = _tied_rank(y_ranks[σ]).
Proof. Let y_float[σ] nor y_ranks[σ] contains any duplicates. For any two positions
Identical strict ordering, no ties in either array → _tied_rank produces the same output.
The bootstrap step in _batch_bootstrap_rhos_jit draws with replacement, so a bootstrap resample
Claim. For any bootstrap resample _tied_rank(y_float[B]) = _tied_rank(y_ranks[B]).
Proof. Let
Tie structure. Two positions
- Float:
$y[b_i] = y[b_j] \iff b_i = b_j$ (y has no original ties, so equal values imply same source index). - Ranks:
$r[b_i] = r[b_j] \iff b_i = b_j$ (r is a permutation of distinct integers).
Both arrays have identical tie patterns: a tie occurs precisely when the same original index is drawn more than once.
Ordering of non-tied positions. For
since rank is strictly order-preserving. Therefore both resampled arrays have the same strict ordering and the same tie structure, so _tied_rank (which uses average-rank tie-breaking) assigns identical values.
Intuition. A bootstrap resample is a multiset of y values. Whether we represent each element as its float value or as its rank, the relative order and equality pattern within the multiset are preserved. _tied_rank depends only on these, so it cannot distinguish the two representations.
RNG note (nonparametric only). With _return_ranks=True, the nonparametric generator skips rng.lognormal(...), consuming estimate_power. In the CI path this is harmless because data and bootstrap use separate RNG streams (SeedSequence.spawn(2)). In the power simulation path, the p-value step (permutation indices or null binary search) draws from a different RNG state than the float-y path would — but any valid random permutations give an unbiased permutation test. Results are statistically equivalent, not bit-identical, for the nonparametric generator. Copula and linear have no RNG divergence (the skipped operations are deterministic transforms).
Regression tests. tests/test_skip_y_ranks.py Steps 1–6 verify: rank identity (rank(y_float) = y_ranks), Spearman rho identity, batch bootstrap rho matrix identity (bit-exact), rho=0 edge case, all-distinct x, and permutation MC p-value identity (pvalues_mc bit-exact for float vs rank y). The last test directly covers the permutation subtlety above.
Model: data_generator.generate_y_linear. (In the default batch path, exp() is never applied: _return_ranks=True returns rank(log_y) directly.)
Target:
Caveat: Spearman(x, y) is only approximate to
Uses the same rank-mixing core as nonparametric (standardised midranks of x, mix with noise ranks, reorder); y-values come from resampling the digitized Karwowski pool (no log-normal). The 71 digitized values (B-Al and H-Al each have 71) appear exactly once in every simulated dataset; only the remainder (2 for n=73, 10 for n=81, plus outliers for 80/82) is resampled per sim/rep. Pool structure: 73 ⊂ 80 (the first 73 elements of an n=80 pool equal the n=73 pool) and 81 ⊂ 82. Sample sizes:
- n=73 → B-Al, outliers excluded (71 digitized + 2 resampled)
- n=80 → B-Al, full sample (73 clean + 7 outliers)
- n=81 → H-Al, outliers excluded (71 digitized + 10 resampled)
- n=82 → H-Al, full sample (81 clean + 1 outlier)
Requires data/digitized.py (see Digitized data below). Use --skip-empirical to exclude this method when digitized data is unavailable. Calibration uses the same multipoint/single-point logic as nonparametric, with a separate cache so empirical and nonparametric do not share calibration state.
For p-value and calibration decisions (new y per sim, single reference pool, per-dataset permutation for empirical) and the alternative design for future generalization, see Empirical generator: p-value and calibration below.
What we do (current design):
- Pool and y per sim/rep: The empirical generator uses a pool of y-values (e.g. 71 digitized + 2 or 10 “fill” for n=73 or 81; or 71 fixed + remainder resampled per sim if the “fixed 71” design is applied). Each n_sim or n_rep produces a new y dataset: either by resampling n values from the pool with replacement (current), or by using exactly the 71 digitized values plus a new random draw for the remainder (fixed-71 design). We do not reuse the same y across sims; we intentionally generate new y-data each sim/rep.
- Why new y each time is correct: We are averaging over many (x, y) realizations. The pool (or the 71) is fixed; the draws from it (or the remainder) are random. So power and CI estimates average over the randomness in y, giving unbiased expectations and valid assessment of variability. This is the same principle as other generators (nonparametric, copula, linear), where y is newly generated every sim; here the “population” is the finite pool (or 71 + random remainder) instead of a parametric distribution.
- Calibration: Calibration for empirical runs once per scenario (cached by n, n_distinct, distribution_type, etc.). It uses a single reference pool from
get_pool(n)(fixed seed), not the per-sim pool. So the samecal_rhois used for all n_sims/n_reps in that scenario. Any small mismatch between that reference pool’s tie structure and a given sim’s pool (e.g. different 2 or 10 fill values) is averaged over by the randomness of the different y-data across sims: over many n_sims we average over (a) different x, (b) different y-pools/realizations, and (c) rank-mix assignment. The Monte Carlo mean remains close to the target; per-rep variation contributes to the variance we are estimating. So one reference pool and onecal_rhoper scenario is a reasonable and consistent setup. - P-value: By default (
EMPIRICAL_USE_PRECOMPUTED_NULL = True), the empirical generator uses the same precomputed null as other generators (keyed on x tie structure, built from all-distinct y-ranks 1..n). Empirical y always has ties: the pool is 71 digitized values with repeated measurements, resampled into n = 73–81, so by pigeonhole and pre-existing duplicates, tied y is guaranteed in every realization. The approximation error is nevertheless negligible (< 1.4 × 10⁻⁵ on p-values; see Precomputed null approximation). SetEMPIRICAL_USE_PRECOMPUTED_NULL = Falseto revert to per-dataset Monte Carlo permutation (exact, much slower). The MC path uses adaptive n_perm (1k when n_sims ≥ 5000, else 2k), batched over all (sim, perm) pairs via Numba.
Why this is legitimate:
- New y per sim: Intentional and correct for operating characteristics (power, CI). We are not reusing the same dataset; we are drawing new (x, y) each time and averaging. No analogue to the earlier n_boot bug (where data depended on n_boot); here data and bootstrap/RNG streams are separated as intended.
- Single reference pool for calibration: Cost and simplicity (one calibration per scenario). The approximation (one
cal_rhofor all sims) is valid because any bias from pool-to-pool variation is averaged out over n_sims.
Alternative: “generate new y datasets each time” as population sampling (future generalization):
A different design would be to treat the pool as the population and, each sim, sample n values with replacement from that pool (current “resampling from whole pool” is already in this spirit). That answers: “If the pool were the true population and we ran many experiments, each drawing n from it, what would we see?” — i.e. what might happen if multiple such experiments were run, with the 71 not special but part of the population (so in a given simulated sample, the 71 can appear 0, 1, 2, … times).
- Contrast with current (fixed-71) target: Our target is “For this Karwowski study (these 71), what would power/CI look like under uncertainty in the missing/extra values?” So the 71 are fixed in every sim; only the remainder varies. That is a conditional question given this study’s digitized values.
- Generalization: The “pool as population, new sample of n each time” approach generalizes naturally to other scenarios: different studies, different n, or different pools. Scripts could be extended to accept an arbitrary pool (or parametric model) and simulate “many replications from a representative population.” This is documented as a potential future generalization so readers understand the design choice (conditional on this study vs. unconditional over replications) and how one might extend the code.
Closed-form expressions (no simulation). Power uses the non-central t-distribution with df =
Power simulation uses permutation-based p-values instead of the t-approximation, because the t-approximation is unreliable with heavy ties. Two paths:
-
Non-empirical generators (nonparametric, copula, linear): A precomputed null distribution of Spearman rhos is built once per (n, tie structure) and cached. P-values are computed by binary search against sorted
$|\rho_{\mathrm{null}}|$ . On first use for a scenario (~1s for$n \approx 80$ , 50k null samples), the null is built and cached; subsequent calls reuse it — including across different generators for the same scenario. This is exact (not an approximation): all non-empirical generators produce continuous y with no ties, so the permutation null depends only on n and the x tie structure, not the generator. Optional: callpermutation_pvalue.warm_precomputed_null_cache()to pre-build nulls for all scenarios before long runs. -
Empirical generator (default: precomputed null): By default, the empirical generator uses the same precomputed null as other generators (
config.EMPIRICAL_USE_PRECOMPUTED_NULL = True). The approximation error from ignoring y-ties is < 1.4 × 10⁻⁵ on p-values — negligible for all precision targets (see Precomputed null approximation for empirical generator below). Setconfig.EMPIRICAL_USE_PRECOMPUTED_NULL = Falseto revert to per-dataset Monte Carlo permutation p-values (exact but much slower).
Set config.USE_PERMUTATION_PVALUE = False to revert to the t-based p-value for power (e.g. for comparison or tests). For the full equations (precomputed null construction, binary-search p-value, per-dataset MC, fallback t-approximation) and the MC-on-cache-miss option, see Statistical methods below.
MC-on-cache-miss: When config.PVALUE_MC_ON_CACHE_MISS = True, the code does not build the precomputed null on a cache miss. It looks up the cache only; if the null is present it is used, otherwise the per-dataset Monte Carlo permutation path is used (no build, no cache write). Default is False (always build and cache on first use). Use this to avoid precomputed-null build and memory for custom frequencies or one-off scenarios.
- H₀: ρ = 0 vs H₁: ρ ≠ 0, two-sided, α = 0.05.
- The same test statistic is used in both asymptotic power and in the Monte Carlo power loop (see below).
Default: permutation-based p-values (config.USE_PERMUTATION_PVALUE = True). Two paths:
-
Precomputed null (non-empirical generators: nonparametric, copula, linear): Build once per (n, tie structure): generate
n_pre(default 50,000) random permutations of y-ranks (1…n) while keeping x-midranks fixed; compute Spearman$\rho$ for each → sort$|\rho_{\mathrm{null}}|$ . P-value for observed$|\rho_{\mathrm{obs}}|$ :$p = (1 + N) / (1 + n_{\mathrm{pre}})$ , where N is the number of null$\rho$ with$|\rho_{\mathrm{null}}| \geq |\rho_{\mathrm{obs}}|$ , using binary search against the sorted array. Cached and reused across all sims in a scenario. Seepermutation_pvalue.get_precomputed_nullandpvalues_from_precomputed_null.-
Why the null is shared across generators (exact, not an approximation): All three non-empirical generators produce continuous y, so y has no ties and its ranks are a uniform random permutation of {1,…,n} under H₀, regardless of which generator was used. The permutation null distribution of Spearman ρ therefore depends only on n and the x tie structure — not on the generator. The in-memory cache key is
(n, all_distinct, tuple(x_counts), n_pre); the generator is intentionally excluded, and the same cached null is correctly reused across nonparametric, copula, and linear for the same scenario. The empirical generator also uses this null by default (EMPIRICAL_USE_PRECOMPUTED_NULL = True), with a negligible approximation from y-ties (< 1.4 × 10⁻⁵; see Precomputed null approximation). -
MC on cache miss: When
config.PVALUE_MC_ON_CACHE_MISS = True, the code does not build the null on a cache miss. It uses a cache lookup only; if the null is in cache it is used, otherwise the per-dataset MC path is used (no build, no cache write). Default isFalse.
-
Why the null is shared across generators (exact, not an approximation): All three non-empirical generators produce continuous y, so y has no ties and its ranks are a uniform random permutation of {1,…,n} under H₀, regardless of which generator was used. The permutation null distribution of Spearman ρ therefore depends only on n and the x tie structure — not on the generator. The in-memory cache key is
-
Per-dataset Monte Carlo (empirical generator when
EMPIRICAL_USE_PRECOMPUTED_NULL = False; also used as fallback whenPVALUE_MC_ON_CACHE_MISS = Trueand null is not cached): For each sim, generaten_permpermutations of y (adaptive: 1k when n_sims ≥ 5000, else 2k), compute Spearman$\rho$ for each, and$p = (1 + N) / (1 + n_{\mathrm{perm}})$ , where N is the number of permutation$\rho$ with$|\rho_{\mathrm{perm}}| \geq |\rho_{\mathrm{obs}}|$ . Batched over all (sim, perm) pairs via Numba; chunked when n_sims is large. Seepermutation_pvalue.pvalues_mc.
Power = proportion of sims with p < α.
Fallback: t-approximation p-value (config.USE_PERMUTATION_PVALUE = False):
-
Equation:
$t = \rho\sqrt{(n-2)/(1-\rho^2)}$ with df =$n-2$ ; two-sided p-value =$2 \times P(T \geq |t|)$ for$T \sim t_{n-2}$ . Seespearman_helpers.spearman_rho_pvalue_2d. - Where also used: The asymptotic power derivation uses the same t-statistic to define the noncentrality parameter (below).
-
Caveat: The t-approximation is unreliable with heavy ties in x (few distinct values); the null distribution of
$\rho$ is not well approximated by the t-distribution in this setting, and the approximation can be anticonservative (p-values too small, type I error inflated). This is why permutation is the default. The t-approximation remains available for comparison or backward compatibility.
- Noncentrality (all-distinct):
Power =
- Tie adjustment (FHP): When ties are present, nc is scaled by
(see FHP below).
-
Removed factor (historical note): A previous AI-generated version of this code multiplied nc by an additional
$\sqrt{1/1.06}$ in the tie-correction branch, attempting to account for the Bonett-Wright efficiency loss (the 1.06 factor used in the CI formula). No published source justifies applying this factor to the noncentrality parameter; the 1.06 is specific to the Fisher z standard error for confidence intervals and does not belong in the power formula. The factor has been removed. The Bonett-Wright 1.06 remains inasymptotic_ciwhere it is well-established.
. CI in z-space then back-transformed via tanh.
Tx = Σ(tᵢ³ − tᵢ) over x tie groups (similarly Ty for y); Dx = (n³−n−Tx)/12, Dy = (n³−n−Ty)/12; Var(ρ) = (1/(n−1)) × (n³−n)²/(144·Dx·Dy). No ties ⇒ 1/(n−1).
Tie correction is currently for x only in asymptotic expressions: In this codebase, the asymptotic power and CI use FHP for x only. Only x_counts is passed to the variance calculation; y_counts is not used (y is effectively treated as having no ties for the asymptotic expressions). So the implemented var₀ uses Dx from x tie groups and Dy = (n³−n)/12 (no y ties). This matches the design: x is cumulative vaccine aluminum (heavy ties); y is aluminum level (typically few or no ties in our generators). The code does have implemented some options to include ties for y also.
Average-rank tie handling (midranks) throughout; same as scipy.stats.rankdata(..., method="average").
Bootstrap uses separate RNG streams for data and bootstrap so the same seed yields the same n_reps datasets regardless of n_boot; see Separate RNG streams for data and bootstrap under Bootstrap CI.
The empirical generator uses digitized H-Al and B-Al values from the Karwowski scatterplot: the 71 digitized values are fixed in every dataset; only the remainder is resampled per sim/rep (see Empirical generator above). The values in data/digitized.py were obtained using WebPlotDigitizer v4. To use the empirical generator yourself, create data/digitized.py with arrays H_AL71, B_AL71, H_AL_OUTLIER, B_AL_OUTLIER_MIN, B_AL_OUTLIER_MAX (see the module docstring in data_generator.py for details). If data/digitized.py is missing or fails to import, a warning is emitted: when only empirical was selected it falls back to nonparametric; when multiple generators were selected empirical is skipped. Use --skip-empirical to avoid the warning when you do not have digitized data.
pip install -r requirements.txtRequired packages: numpy, scipy, pandas, joblib. Numba (recommended) is included in requirements.txt for significant speedups. If Numba cannot be installed on your platform, the code falls back to pure NumPy automatically.
On the first run after installation, Numba compiles JIT functions (5-15 seconds). To pre-warm the cache:
python scripts/warm_up_numba.pyTo copy the cache to a cloud VM for zero compile delay:
rsync -avz ~/.cache/numba/ user@YOUR-IP:~/.cache/numba/All scripts expose a main() function that can be called from Python:
# Full simulation
from scripts.run_simulation import main as run_sim
power_df, ci_df, all_distinct_df = run_sim(n_sims=500, skip_linear=True)
# Single scenario
from scripts.run_single_scenario import main as run_single
result = run_single(case=3, n_distinct=4, dist_type="heavy_center", n_sims=500)
result = run_single(case=3, freq=[19, 18, 18, 18], power_only=True, verbose=False)
# Accuracy testing
from tests.test_calibration_accuracy import main as test_accuracy
df = test_accuracy(n_sims=50, cases=[3], custom_freq=[(3, [19, 18, 18, 18])])python scripts/run_simulation.pypython scripts/run_simulation.py --n-sims 500 --seed 42python scripts/run_simulation.py --n-sims 500 --seed 42 --no-numba
python scripts/run_single_scenario.py --case 3 --n-distinct 4 --dist-type heavy_center --n-sims 500 --no-numbapython scripts/run_simulation.py --skip-linear --skip-copula # nonparametric + empirical + asymptotic
python scripts/run_simulation.py --skip-nonparametric # copula + linear + empirical + asymptotic
python scripts/run_simulation.py --skip-empirical # omit empirical (e.g. when digitized data unavailable)python scripts/run_simulation.py --calibration-mode multipoint # default: more accurate, ~3× calibration cost
python scripts/run_simulation.py --calibration-mode single # faster calibration for exploratory runspython scripts/run_simulation.py --cases 1,3 --n-distinct 4,10 --dist-types even,heavy_centerRun power and/or CI for a specific (case, k, distribution) combination:
# Power + CI for Case 3, k=4, heavy_center
python scripts/run_single_scenario.py --case 3 --n-distinct 4 --dist-type heavy_center --n-sims 500
# Power only, skip copula, linear, and empirical
python scripts/run_single_scenario.py --case 3 --n-distinct 4 --dist-type even --power-only --skip-copula --skip-linear --skip-empirical
# CI only for all-distinct baseline
python scripts/run_single_scenario.py --case 1 --all-distinct --ci-only --n-reps 20 --n-boot 500
# Custom frequency distribution (counts must sum to case's n)
python scripts/run_single_scenario.py --case 3 --freq 19,18,18,18 --n-sims 500
# Calibration mode (multipoint default, single for faster runs)
python scripts/run_single_scenario.py --case 3 --n-distinct 4 --dist-type heavy_center --n-sims 500 --calibration-mode singlepython scripts/run_simulation.py --tie-correction both # report both corrected and uncorrected
python scripts/run_simulation.py --tie-correction without_tie_correctionTest whether generators achieve the target Spearman rho:
# Quick check on worst-case scenario
python tests/test_calibration_accuracy.py --n-sims 50 --case 3 --n-distinct 4
# Full sweep, all generators
python tests/test_calibration_accuracy.py --n-sims 200
# Specific generators only (empirical requires digitized data)
python tests/test_calibration_accuracy.py --generators nonparametric,copula --n-sims 100
python tests/test_calibration_accuracy.py --generators empirical --case 3 --n-sims 100
# Custom frequency distribution (requires --case; counts must sum to case's n)
python tests/test_calibration_accuracy.py --case 3 --freq 19,18,18,18 --n-sims 50
# Save results to CSV
python tests/test_calibration_accuracy.py --n-sims 200 --outfile accuracy_report.csv
# Calibration mode (multipoint default, single for faster runs)
python tests/test_calibration_accuracy.py --n-sims 200 --calibration-mode singleThe script flags scenarios where |mean_simulated_rho - target_rho| > 0.01 (configurable via --threshold). With default n_sims=50–200, some flags may be Monte Carlo noise. To get all scenarios to pass, you may need to increase --n-sims to 1000, 5000, or 10000 depending on the tie structure and threshold.
Run all core tests (fast, ~1-2 min). Calibration runs without strict by default (reports flags only):
python tests/run_tests.py
Run with calibration strict (exit 1 if any scenario is flagged):
python tests/run_tests.py --strict
Run all tests including slower parallelism benchmarks:
python tests/run_tests.py --full
Add --strict to either command to run the calibration test in strict mode (exit 1 if any scenario is flagged).
Results are saved to the results/ directory:
min_detectable_rho.csv-- minimum detectable rho for all scenarios and methodsconfidence_intervals.csv-- bootstrap and asymptotic CIs for observed rhoall_distinct_summary.csv-- combined power + CI table for the 4 all-distinct baselines
Root (core modules):
config.py— Case parameters, frequency dictionary, settingsdata_generator.py— X generation (ties) and Y generation (all methods)power_simulation.py— Unified Monte Carlo power (all generators)power_asymptotic.py— Asymptotic power and CI formulasconfidence_interval_calculator.py— Bootstrap CIs (averaged over multiple datasets)table_outputs.py— Summary table construction and CSV exportspearman_helpers.py— Vectorized Spearman and Numba JIT helpers
scripts/ — Entry points and one-off utilities
run_simulation.py— Main orchestrator / CLI entry pointrun_single_scenario.py— Quick single-scenario testing scriptwarm_up_numba.py— Pre-compile Numba JIT cacheestimate_bisection_c.py— Estimate bisection coefficient c for planning n_simsestimate_calibration_k.py— Estimate calibration coefficient k for planning n_calestimate_interrep_sd.py— Estimate inter-rep SD of CI endpoints for planning n_reps
benchmarks/ — Runtime benchmarks (run one at a time)
benchmark_power.py— Power: vectorize vs scalar, calibration modesbenchmark_full_grid.py— CI: full grid, per-rep vs batch bootstrapbenchmark_batch_vs_no_batch.py— CI: batch vs per-rep path (single scenario)benchmark_ci_bootstrap.py— CI: multipoint vs single calibration
tests/ — Validation and verification
test_calibration_accuracy.py— Calibration accuracy testingtest_spearman_2d.py— Spearman 2D vs referencetest_batch_bootstrap_ci.py— Batch CI bit-identical and timingtest_nested_parallelism.py,test_batch_sequential_vs_parallel.py— Parallelism verificationrun_tests.py— Test runner (quick/full, optional --strict for calibration)
Run all commands from the project root (e.g. python scripts/run_simulation.py).
The Gaussian copula assumes continuous marginals; with tied x, a small jitter is used to break ties before the rank-to-normal transform, which can attenuate realised Spearman. When x has heavy ties (e.g., k=4 with N=73--82), the jittering step that breaks ties is too small to restore the lost rank information, causing systematic attenuation of 0.01--0.06 in the realised rho. Distributional transform and adaptive jitter alternatives were tested but do not fix this for heavy ties. However, the copula can use multipoint calibration by default (same as nonparametric). Calibration runs on the same data generating process that includes jittering, so it learns the correct input-rho-to-realised-rho mapping and compensates for the bias; with multipoint, the copula achieves target Spearman within tolerance and can be used for power and CI estimation. The non-parametric rank-mixing method remains the recommended default because it operates directly in rank space and avoids the continuous-marginals assumption, but the copula is a valid alternative when multipoint calibration is used.
With tied x-values, the midrank representation has lower variance than distinct ranks. This means the rank-mixing equation mixed = rho * s_x + sqrt(1-rho^2) * s_noise produces a Spearman rho slightly below the target rho. The calibration step computes a rho-independent attenuation ratio by probing at a fixed rho (0.30) and using bisection over 300 samples. This ratio is then multiplied by any target rho to compensate for the attenuation. The ratio is cached per (n, k, dist_type), so the cost is ~3s per unique tie structure.
If calibration fails for a custom tie structure: Run tests/test_calibration_accuracy.py on the new structure. If mean realised rho deviates from target by >0.01, try increasing n_cal (e.g. 300 → 500 or 1000) or use multipoint calibration (default). Multipoint probes at 0.10, 0.30, 0.50 and interpolates, fixing nonlinear attenuation. Use --calibration-mode single for faster runs when accuracy is less critical. If even rho_input=0.999 cannot reach the probe, the tie structure has hit a structural ceiling (maximum achievable |rho|) and the method cannot reach that target.
Calibration uses a fixed internal seed (99) for reproducibility; the same (n, k, dist_type) always produces the same calibration ratio regardless of the caller's seed.
With the default n_cal=300, calibration accuracy (SE of mean rho) is approximately 0.005–0.006, adequate for the ~0.01 target. The planning formula is
Analytical k (Bonett-Wright). Each calibration draw computes one sample Spearman rho from n observations. By the Bonett-Wright SE (the same one used for CIs), the SD of the sample Spearman rho at true
where
| Case | n | k (analytical) | SE at n_cal=300 | SE at n_cal=1000 |
|---|---|---|---|---|
| 1 | 80 | 0.107 | 0.0062 | 0.0034 |
| 2 | 82 | 0.105 | 0.0061 | 0.0033 |
| 3 | 73 | 0.112 | 0.0065 | 0.0035 |
| 4 | 81 | 0.106 | 0.0061 | 0.0034 |
Empirical k (from scripts/estimate_calibration_k.py at n_cal = 300, 500, 1000 with 200 replications for case 3) is approximately 0.10, about 10% below the analytical value. The analytical formula is conservative because the BW asymptotic slightly overestimates variance for the rank-mixing mechanism, especially with ties. The benchmark scripts use k = 0.112 (analytical worst case, n=73) and c = 0.17 (asymptotic value at worst-case ρ*; empirical c for current data converges to ~0.12–0.15) for conservative estimates. Increase n_cal to 500–1000 for tighter accuracy.
The standard Fisher z-transform SE for Pearson r is
Why the 1.06 is not applied to asymptotic power: A previous version also applied
The Spearman test statistic t = rho * sqrt(n-2) / sqrt(1 - rho^2) follows a non-central t-distribution under the alternative. This is more accurate than the normal approximation (which uses constant H0 variance) because the sqrt(1 - rho^2) denominator captures the variance reduction as |rho| grows toward 1. For the noncentrality equation and tie adjustment, see Statistical methods.
The arctanh transform stabilises the variance of the correlation coefficient and improves normality of the sampling distribution. The back-transform via tanh guarantees the CI stays within [-1, 1] and is properly asymmetric around the point estimate.
The Fieller-Hartley-Pearson correction adjusts the variance of Spearman rho under the null hypothesis (H0) when ties are present in the rank data. Ties reduce the effective rank information (e.g., many observations sharing the same x-value), so the sampling variance of rho increases. The correction uses the standard equation: define Tx = Σ(tᵢ³ − tᵢ) over tie groups in x (and Ty for y), Dx = (n³−n − Tx)/12, Dy = (n³−n − Ty)/12; then Var(rho) = (1/(n−1)) × (n³−n)²/(144·Dx·Dy). When there are no ties, Dx = Dy = (n³−n)/12 and this reduces to 1/(n−1).
Where it is used: (1) Asymptotic power — the noncentrality parameter is scaled by
, reducing effective power when ties inflate variance. (2) Asymptotic CI — the standard error in z-space is scaled by
, widening the interval appropriately. The impact on SE ranges from negligible (k=10, even: +0.5%) to moderate (k=4, heavy_center: +5.7%). Asymptotic power and CI apply FHP for x only: only the x tie structure is used; y is treated as having no ties in the asymptotic formulas (only x_counts is passed; y_counts is not used). See Statistical methods for the full FHP equation and this convention.
Approximation note: The combination of Bonett-Wright (1.06 in SE) with the FHP tie correction is a heuristic. No single published reference validates applying both corrections simultaneously. Each is well-established individually; their product is a practical approximation.
The bootstrap uses paired (x,y) resampling with replacement and reports the percentile interval (2.5th and 97.5th percentiles of the bootstrap distribution). The reported CIs average these endpoints over many simulated datasets (n_reps) to estimate the expected bootstrap CI under the model. Two design choices affect correctness and precision:
Data generation and bootstrap resampling use separate RNG streams derived from the same seed via np.random.SeedSequence.spawn(2). A shared RNG would advance by n_boot * n extra draws per rep inside the bootstrap loop, so the datasets for reps 1..n_reps would depend on n_boot. That would make results non-comparable across different n_boot values and invalidate the interpretation that "more bootstraps = more accurate." With separate streams, the same seed always produces the same n_reps datasets regardless of n_boot.
The CI endpoints vary across reps. The inter-rep SD is determined analytically: SD(endpoint) = (1 − endpoint²) × √(1.06/(n−3)), with FHP tie correction for x (up to +5.7% for k=4 heavy_center). The worst case is Case 3 (n=73, upper endpoint near 0): 0.122 no ties, 0.129 with worst-case ties — giving a conservative results/confidence_intervals.csv (n_reps=200) is 0.130, confirming the analytical value. To re-estimate for new data: scripts/estimate_interrep_sd.py --analytical --with-ties (instant). The SE of the mean endpoint is
| Target | SE | 95% CI half-width | n_reps |
When it matters |
|---|---|---|---|---|
| Borderline | 0.005 | ±0.01 | ≈650 | Second decimal can still be off by one unit |
| Strong (README "reliable") | 0.0025 | ±0.005 | ≈2600 | Useful precision; fine when not near a rounding boundary |
| Rounding guarantee | 0.00128 | ±0.0025 | ≈10200 | Needed only when the value is near a boundary (e.g. 0.345, 0.355) |
Rounding boundaries: The boundary between rounding to 0.34 vs 0.35 is 0.345. With n_reps=2600 (95% CI ±0.005), if you observe 0.345 the CI spans [0.34, 0.35] and crosses the boundary—you cannot confidently round. When the value is not near a boundary (e.g. 0.32, 0.38), n_reps=2600 is adequate. The n_reps≈10200 "rounding guarantee" is only needed when you happen to land near a boundary and want confidence in which way to round.
With n_reps=200, SE ≈ 0.009 (worst case N=73), so the third decimal is uncertain and the second decimal is borderline. The default n_reps=200 is a practical trade-off.
Exact variance of the mean CI endpoint. Each rep i produces one CI endpoint n_boot bootstrap rhos). Its variance has two independent components: (1) inter-rep variability (different datasets), and (2) bootstrap quantile noise (finite n_boot). For the p-th sample quantile from n_boot i.i.d. draws, the asymptotic variance is n_reps independent reps:
where n_reps, SE, and reliability above). The ratio of the bootstrap term to the inter-rep term is
With n_reps ≥ 2600:
n_boot=200–400 suffices: bootstrap variance adds under 0.5% to total; 2.5th percentile estimated from 5–10 order statistics.n_boot=500 is comfortable; going higher gives no meaningful improvement.n_boot=1000 (default inconfig.py) is more than needed for highn_reps; use 200–500 to save time.
When n_reps=200, n_boot=500 is sufficient; the ~0.001–0.002 difference from n_boot=1000 is swamped by the ~0.009 SE from inter-rep variability.
To check whether your chosen n_boot is sufficient, run the same scenario with n_boot=500 and n_boot=2000 (same seed). Compare the printed bootstrap CI endpoints:
Quick check (n_reps=50, ~1–2 min total): If the difference in CI endpoints is under ~0.003, n_boot=500 is fine for 2-decimal precision. Example:
python scripts/run_single_scenario.py --case 3 --n-distinct 4 --dist-type heavy_center --ci-only --n-reps 50 --n-boot 500 --seed 42 --skip-copula --skip-linear
python scripts/run_single_scenario.py --case 3 --n-distinct 4 --dist-type heavy_center --ci-only --n-reps 50 --n-boot 2000 --seed 42 --skip-copula --skip-linearThorough check (n_reps=200, ~5–10 min total): Use n_reps=200 for a more stable comparison and to confirm the averaged CI is well converged.
Vectorization and parallelization: The original implementation used a scalar loop over bootstrap replicates and power simulations. Several optimizations combine to speed things up:
- Vectorized Spearman — argsort-based ranking, O(n log n) per row, replaces per-row scipy calls.
- Vectorized data generation —
config.VECTORIZE_DATA_GENERATION(default True): power and CI generate all n_sims or n_reps datasets in one batch viagenerate_*_batchfunctions instead of looping per dataset. - Batch bootstrap —
config.BATCH_CI_BOOTSTRAP(default True): CI generates all n_reps datasets and n_boot resamples in one batch, then computes Spearman across all bootstrap samples in a single Numba call. RequiresVECTORIZE_DATA_GENERATION=True. Gives ~2.5× sequential speedup over per-rep bootstrap; see Benchmark data below. - Skip-y data generation — In the default batch path, non-empirical generators (
generate_y_nonparametric_batch,generate_y_copula_batch,generate_y_linear_batch) return rank arrays directly (_return_ranks=True) instead of generating float y, skipping lognormal/exp generation entirely. Saves ~15–20% of the data-generation step for both power simulation and CI batch bootstrap. The empirical generator is unchanged. See Skip-y for data generation. - Scenario-level parallelization — joblib (
--n-jobs) farms scenarios out across cores. Caveat: With batch bootstrap, parallel (n_jobs=-1) is often slower than sequential (n_jobs=1) due to joblib overhead and nested parallelism with Numba; see Parallelization and Numba caveats below.
These flags live in config.py: VECTORIZE_DATA_GENERATION, BATCH_CI_BOOTSTRAP, CALIBRATION_MODE (multipoint vs single), and USE_PERMUTATION_PVALUE (permutation vs t-based p-value for power).
Calibration: Uses a rho-independent attenuation ratio cached per (n, k, dist_type). The calibration cost is paid once per tie structure, not once per rho value tested during bisection — eliminating the dominant bottleneck of the original implementation (~60× speedup for power estimation).
Other optimizations: _fit_lognormal results are LRU-cached; x-value templates are cached and only shuffled per call. Power simulation uses permutation-based p-values (precomputed null or batched Monte Carlo) when USE_PERMUTATION_PVALUE=True; see P-value method (power simulation).
When using the Monte Carlo p-value path (empirical generator, or any generator with permutation enabled), peak memory is dominated by: (1) permutation indices perm_idx_all of shape (n_perm, n_sims, n) int32 → 4 × n_perm × n_sims × n bytes; (2) permutation rhos (n_sims, n_perm) float64 → 8 × n_sims × n_perm bytes. Data x_all, y_all are smaller by comparison.
Equation: Extra memory ≈ (4n + 8) × n_sims_chunk × n_perm bytes. For n ≈ 80: ≈ 328 × n_sims_chunk × n_perm bytes (e.g. 10k sims × 1k perms at n≈80 ≈ 3.3 GB for a full batch).
Chunking: When n_sims > PVALUE_N_SIMS_BATCH_THRESHOLD (default 5k), sims are processed in chunks of PVALUE_N_SIMS_CHUNK_SIZE (default 2k). Peak memory is then for one chunk: e.g. 2k sims × 1k n_perm × n≈80 → ~660 MB per chunk.
Example table (n ≈ 80):
| n_sims | n_perm | Mode | Peak extra memory |
|---|---|---|---|
| 500 | 1k | full | ~130 MB |
| 2k | 1k | full | ~525 MB |
| 5k | 1k | full | ~1.6 GB |
| 10k | 1k | chunked | ~660 MB per chunk |
| 50k | 1k | chunked | ~660 MB per chunk |
| 5k | 2k | full | ~3.3 GB |
For user-computed errors and planning n_sims / n_cal:
-
Power estimate (binomial):
$\hat{p}$ = proportion of sims with p < α.
For power near 0.80,
-
Bisection contribution to SE(min ρ): By the implicit function theorem, bisection finds
$\rho^*$ where$\text{power}(\rho^*) = \pi_{\mathrm{target}}$ . The SE of the estimated root is:
This is a standard result in stochastic root-finding (Waeber, Frazier & Henderson, 2013, SIAM J. Control Optim. 51(3)). The coefficient c depends on the slope of the power curve at the 80% crossing. For the asymptotic (no-tie) power function
| n |
|
slope | c |
|---|---|---|---|
| 73 | 0.316 | 2.76 | 0.145 |
| 80 | 0.302 | 2.86 | 0.140 |
| 82 | 0.299 | 2.88 | 0.139 |
With ties (which attenuate rho, shifting the crossing to higher scripts/estimate_bisection_c.py for stable estimates. If you use different data or heavier ties, revisit precision parameters and re-run benchmarks; see Precision when data changes.
-
Calibration contribution: Calibration uncertainty scales as
$1/\sqrt{n_{\mathrm{cal}}}$ ; propagates to min detectable ρ.$\mathrm{SE}_{\mathrm{cal}}(\rho) = k/\sqrt{n_{\mathrm{cal}}}$ where the coefficient k is the SD of one calibration realised rho. From the Bonett-Wright asymptotic formula (the same one used for CIs, via the delta method on Fisher z):
For scripts/estimate_calibration_k.py --analytical (instant) or without --analytical for the empirical value.
- Combined uncertainty (independent):
-
95% CI half-width: Half-width ≈ 1.96 ×
$\mathrm{SE}_{\mathrm{total}}(\rho)$ . For target half-width w, aim for$\mathrm{SE}_{\mathrm{total}} \leq w / 1.96$ . - Rounding: For the reported value to round safely to a band (e.g. 0.35), the whole 95% CI should lie inside that band (e.g. half-width ≤ 0.005 at the boundary). Wrong-rounding probability at a boundary depends on half-width; overall rounding confidence is typically ~90–95% when half-width is ±0.002 and true values are not at boundaries. Numerical guidance: n_sims ≈ 3k, n_cal ≈ 1k for ±0.01; for rounding safety near 0.35, wrong-rounding ~15–25% at 0.346, overall ~90–95%.
- Consolidated error budget: See docs/UNCERTAINTY_BUDGET.md for all power and CI error sources (including precomputed null realization variance, calibration interpolation, permutation noise, and bootstrap quantile noise) in a single reference table, with numerical values for each precision tier and a quick-reference parameter table.
Power is estimated as
Exact variance decomposition. For each simulation i, let
where the permutation contribution
Bounding
where
Approximation used by benchmark scripts. Since
Numerical comparison (
| P-value path | SE inflation | ||||
|---|---|---|---|---|---|
| Precomputed null (non-empirical) | 50 000 | 0.00097 | 5.5 × 10⁻⁴ | 0.34% | +0.17% |
| MC (empirical, |
2 000 | 0.00487 | 2.7 × 10⁻³ | 1.72% | +0.86% |
| MC (empirical, |
1 000 | 0.00689 | 3.9 × 10⁻³ | 2.43% | +1.21% |
Even in the worst case (MC path with n_perm=1000, uniform p-values), permutation noise inflates SE by only ~1.2%. For the precomputed null (n_pre=50,000) used by the non-empirical generators, the inflation is ~0.17%. Under realistic conditions (
MC path noise vs precomputed null realization variance — a key distinction. The variance decomposition above (Var(π̂) = [...] / n_sims) applies to the MC path: each sim uses fresh, independent permutations, so the per-sim critical value errors are IID and their contribution to the mean rejection rate averages out as 1/√n_sims. The precomputed null behaves differently: one null is built once and shared by all n_sims — its critical value offset is the same for every sim, fully correlated, and does not reduce with n_sims. The fix for the precomputed null is increasing n_pre, not n_sims. See AUDIT.md "Known Precision Limitations" (PREC-1) for the quantitative analysis and the contrast table.
Why n_perm=1,000 is adequate for all precision tiers (including empirical). The empirical generator uses the precomputed null by default (see Precomputed null approximation). When EMPIRICAL_USE_PRECOMPUTED_NULL = False, it falls back to the MC path. The +1.2% worst-case SE inflation on power propagates through bisection: the SE of min detectable rho is n_perm (1,000 when n_sims ≥ 5,000; 2,000 otherwise) requires no adjustment.
When config.EMPIRICAL_USE_PRECOMPUTED_NULL = True (default), the empirical
generator uses the same precomputed null as other generators. The precomputed
null permutes all-distinct y-ranks {1..n}; but empirical y-data always has
ties — the pool is 71 digitized values with pre-existing duplicates, resampled
into n = 73–81 draws, so ties are guaranteed in every realization by both
pigeonhole and the pool's own repeated measurements. This
section quantifies the approximation error.
Source of error. The precomputed null computes each null rho as:
where spearman_rho_2d uses midranks of the actual y-data (with ties),
which have
where
Tie counts in digitized data (data/digitized.py):
-
H_AL71: 4 tie pairs —
$\Sigma = 4 \times 6 = 24$ -
B_AL71: 1 triple + 6 pairs —
$\Sigma = 24 + 36 = 60$
Pool totals after resampling (base-71 ties + expected contribution
from build_empirical_pool resampled fill):
-
n=73 (B-Al, 2 resampled from B_AL71):
$\Sigma_y \approx 79$ -
n=80 (B-Al + 7 outliers):
$\Sigma_y \approx 79$ (outliers are continuous) -
n=81 (H-Al, 10 resampled from H_AL71):
$\Sigma_y \approx 98$ -
n=82 (H-Al + 1 outlier):
$\Sigma_y \approx 98$
Denominator inflation (worst cases):
| Case | n | Inflation |
|
||
|---|---|---|---|---|---|
| n=73 | 73 | 388,944 | ~79 | 1 + 1.0×10⁻⁴ | ~2.3×10⁻⁵ |
| n=81 | 81 | 531,360 | ~98 | 1 + 9.2×10⁻⁵ | ~2.0×10⁻⁵ |
Impact on p-values. The CDF shift from the kurtosis mismatch
(Edgeworth expansion) plus the denominator inflation gives total
Impact on power. The p-value bias shifts the rejection rate by
| Target | Ratio (bias / target) | Verdict |
|---|---|---|
| ±0.01 | 10⁻³ (0.1%) | Safe |
| ±0.002 | 5×10⁻³ (0.5%) | Safe |
| ±0.001 | 10⁻² (1%) | Safe |
Variance invariance. Under permutation, the first two moments of
Spearman rho are exactly invariant to tie structure:
Fallback. Set config.EMPIRICAL_USE_PRECOMPUTED_NULL = False to use
per-dataset Monte Carlo permutation p-values (exact, no approximation,
much slower). The MC path is documented in
Per-dataset Monte Carlo above.
Verification. Run benchmarks/verify_precomputed_null_empirical.py
to empirically confirm the approximation is safe — i.e. deltas between
precomputed and MC are bounded well below all precision targets for all
four cases. The script does not isolate the 10⁻⁵ y-tie error; that rests
on the analytic derivation above. The dominant noise source in the script
output is precomputed null realization variance (~SD 0.001 at n_pre=50k),
which is the same for all generators using the same null and is not
introduced by OPT-1. See AUDIT.md "Known Precision Limitations" for the
realization-variance analysis and the n_pre recommendation for ±0.001.
The reported CI endpoint is the mean of n_reps independent bootstrap-quantile estimates. Each rep i produces n_boot bootstrap rho values). Its variance decomposes into two independent components:
where n_reps, SE, and reliability above),
Approximation used by benchmark scripts. Since the bootstrap term is negligible for
where w is the target 95% CI half-width.
Numerical comparison (
| Target half-width | n_reps |
SE (approx) | SE (exact) | Ratio | ΔHW |
|---|---|---|---|---|---|
| ±0.01 | 649 | 0.005103 | 0.005109 | 1.0012 | 1.2 × 10⁻⁵ |
| ±0.002 | 16 231 | 0.001020 | 0.001022 | 1.0015 | 3.9 × 10⁻⁶ |
| ±0.001 | 64 923 | 0.000510 | 0.000511 | 1.0014 | 1.4 × 10⁻⁶ |
The bootstrap term inflates SE by only ~0.15% across all tiers with
Two CI bootstrap modes are available. Per-rep bootstrap (BATCH_CI_BOOTSTRAP=False): for each of n_reps datasets, many calls to _bootstrap_rhos_jit (Numba). Batch bootstrap (BATCH_CI_BOOTSTRAP=True, default): one call per scenario to _batch_bootstrap_rhos_jit (Numba), processing all n_reps × n_boot tasks in one batch. Both use vectorized data generation; batch is bit-identical to per-rep for the same data and bootstrap indices.
Timings below are averages across multiple runs (earlier range data and a single measured run) to reduce variance. Conditions: n_reps=200, n_boot=1000, single-point calibration, nonparametric generator, 4-logical-core Windows machine (2 physical cores with hyperthreading). Run benchmarks one at a time to avoid CPU contention.
Single scenario (e.g. Case 3, k=4, even):
| Config | Time | Speedup vs per-rep |
|---|---|---|
| Per-rep bootstrap | ~4.5s | — |
| Batch bootstrap | ~1.2s | ~3.8× |
Full grid (88 scenarios), averaged times:
| Config | Time |
|---|---|
| Per-rep bootstrap, sequential (n_jobs=1) | ~6 min (~364s) |
| Batch bootstrap, sequential (n_jobs=1) | ~2.4 min (~142s) |
| Per-rep bootstrap, parallel (n_jobs=-1) | ~3.9 min (~235s) |
| Batch bootstrap, parallel (n_jobs=-1) | ~4 min (~243s) — often slower than batch sequential |
Recommendation: For batch bootstrap, use n_jobs=1 (sequential); it is the fastest (~2.4 min full grid). For per-rep bootstrap, parallel (n_jobs=-1) does help (~3.9 min vs ~6 min sequential).
- Machine note: Benchmarks were taken on a machine with 4 logical cores (2 physical cores with hyperthreading). On such a machine,
n_jobs=-1uses 4 joblib workers. - Batch bootstrap + joblib: With batch bootstrap, parallel is often slower than sequential. Reasons: (1) Joblib overhead — process spawn, serialization, and scheduling cost is a large fraction of the per-scenario time (~1.8s) for batch, so parallel adds net cost. (2) Nested parallelism — joblib spawns N workers; each worker runs Numba’s
prange(default 4 threads), so 4×4 = 16 threads on 4 cores cause oversubscription. Most of the “parallel slower than sequential” gap (~89%) is joblib; ~11% is from Numba thread contention. - Per-rep bootstrap + joblib: Per-rep has many short Numba bursts per scenario, so joblib’s overhead is better amortized; parallel typically beats sequential for per-rep bootstrap.
- Memory: Batch bootstrap generates bootstrap indices in chunks of 2000 n_reps at a time (configurable via
_BATCH_BOOTSTRAP_CHUNK_SIZE). Peak memory per chunk is ~328 MB (n_boot=500 × 2000 reps × n≈82 × 4 bytes), bounded regardless of totaln_reps. The dominant allocations at large n_reps arex_all/y_allof shape (n_reps, n) float64. Withn_jobs=-1, multiple workers hold their own copies of these arrays; usen_jobs=1for batch bootstrap to avoid multiplication by worker count. - Validation: Batch and per-rep bootstrap are bit-identical for the same data and bootstrap indices. See
tests/test_batch_bootstrap_ci.py,tests/test_nested_parallelism.py, andtests/test_batch_sequential_vs_parallel.py(anddocs/BENCHMARKING_FINDINGS.md) to reproduce.
Without vectorized data generation or batching (parallelization and vectorized Spearman only):
| Task | Approximate time |
|---|---|
| Full grid CI (88 scenarios), n_jobs=1, 200 reps × 1000 boot | ~16 min |
| Full grid CI (88 scenarios), n_jobs=4, 200 reps × 1000 boot | ~8 min |
With this setup, joblib at n_jobs=4 gives ~2× on the full grid (16 min → 8 min) and ~3.5× on single-scenario CI (200×1000): ~42s sequential → ~12s parallel. Combined with vectorized Spearman, the full CI grid is ~7× faster than the original sequential code.
With vectorized data generation and batch bootstrap (default) and Numba (recommended). Sequential is preferred for CI; parallel can be used for power.
| Task | Approximate time |
|---|---|
| Single scenario CI (200 reps × 1000 boot), batch bootstrap, n_jobs=1 | ~1.2s |
| Single scenario power (500 sims, n_cal=300) | ~5s (includes calibration) |
| Single scenario power (10,000 sims, n_cal=300) | ~45s |
| Full grid CI (88 scenarios), batch bootstrap, n_jobs=1, 200×1000 | ~2.4 min |
| Full grid CI (88 scenarios), per-rep bootstrap, n_jobs=1, 200×1000 | ~6 min |
| Full grid CI (88 scenarios), per-rep bootstrap, n_jobs=4, 200×1000 | ~3.9 min |
| Full grid power (88 scenarios, 500 sims, n_cal=300), n_jobs=1 | ~6 min |
| Full grid power (88 scenarios, 500 sims, n_cal=300), n_jobs=4 | ~3 min |
Numba JIT compilation speeds up the bootstrap and ranking loops. Pre-warm with python scripts/warm_up_numba.py to avoid first-run compile delay. On a 4-logical-core machine, use n_jobs=1 for CI with batch bootstrap as above; for power or per-rep bootstrap CI, n_jobs=4 or -1 can reduce wall time.
| Task | Without Numba | With Numba | Speedup |
|---|---|---|---|
| Single scenario CI (200 reps × 500 boot) | ~12s | ~3–5s | ~3× |
| Single scenario power (500 sims, n_cal=300) | ~5s | ~2–3s | ~2× |
| Full grid CI (88 scenarios, n_reps=10200, n_boot=500, n_jobs=1 batch) | ~3.5 h | ~28–62 min | ~4–8× |
| Full grid power (88 scenarios, n_sims=10k, n_cal=300, n_jobs=4) | ~30–60 min | ~5–12 min | ~4–6× |
On a 16-vCPU cloud machine (e.g. Hetzner CPX51), full CI grid per generator (88 scenarios, n_reps=10200, n_boot=500, n_jobs=-1) completes in under 17 minutes with Numba and pre-warmed cache.
Copula and linear generators have similar per-sim cost but no calibration overhead. The asymptotic method is instantaneous.
- Benchmarking: Run sequential and parallel benchmarks separately, one at a time. Concurrent runs cause CPU contention and invalidate timing results.
- CI (batch bootstrap): Use
--n-jobs 1for batch bootstrap; parallel often slows it down (see Parallelization and Numba caveats). For per-rep bootstrap,--n-jobs 4or-1helps. - Use
--n-sims 500for exploratory runs (seconds to minutes) vs10000for production. - On a 4-logical-core machine (e.g. 2 physical cores with hyperthreading),
--n-jobs 4or-1parallelizes across logical cores; for power and per-rep bootstrap CI this can roughly halve full-grid runtimes. - The calibration step adds ~3s (single-point) or ~9s (multipoint, default) per unique (N, k, distribution) tie structure on first run; cached thereafter and reused across all rho values. Use
--calibration-mode singlefor faster exploratory runs. - Bootstrap CIs dominate total runtime when using many reps and resamples. With
--n-reps 200, use--n-boot 500(bootstrap noise is negligible). With--n-reps 2600or higher,--n-boot 200–400suffices;--n-boot 500is comfortable. Use--n-reps 20 --n-boot 500for quick checks. - Use
scripts/run_single_scenario.pyto test individual scenarios quickly before committing to a full run. - Filter scenarios with
--cases,--n-distinct,--dist-typesto reduce the grid. - Use
--skip-copula --skip-linearto run only the recommended nonparametric + empirical + asymptotic methods. Use--skip-empiricalwhen digitized data is unavailable.
Six scripts measure performance of the main optimizations (vectorized data generation, batch CI bootstrap) and precision/runtime planning. Run scripts one at a time (concurrent runs cause CPU contention and invalidate timings; see Tips above).
| Script | Purpose |
|---|---|
benchmarks/benchmark_precision_params.py |
Precision only: Computes (n_sims, n_cal) and (n_reps, n_boot) for ±0.01, ±0.002, ±0.001 accuracy tiers from README formulas. No long runs. |
benchmarks/benchmark_realistic_runtimes.py |
Runtimes: Per-generator power and CI at small params; sequential and parallel full grid; scaled to three precision tiers; best-config combined table and high-core estimates. --quick single-scenario only; --skip-parallel skips n_jobs=-1; --generators all or comma-separated list (e.g. nonparametric, empirical). |
benchmarks/benchmark_batch_vs_no_batch.py |
Single scenario: per-rep bootstrap vs batch bootstrap. Case 3, k=4, even, n_reps=200, n_boot=1000. |
benchmarks/benchmark_full_grid.py |
Full 88-scenario CI grid: per-rep vs batch bootstrap, sequential vs parallel, and Numba thread variants. Uses n_reps=200, n_boot=1000. |
benchmarks/benchmark_ci_bootstrap.py |
Batch CI bootstrap: multipoint vs single-point calibration. Single scenario and 22-scenario subset. --quick runs only single-scenario (~1 min). |
benchmarks/benchmark_power.py |
Power simulation: vectorized vs scalar data generation, multipoint vs single calibration, sequential vs parallel full grid. --quick skips full grid. |
# Precision params only (no simulation)
python benchmarks/benchmark_precision_params.py
# Quick single-scenario comparisons
python benchmarks/benchmark_batch_vs_no_batch.py
python benchmarks/benchmark_ci_bootstrap.py --quick
python benchmarks/benchmark_power.py --quick
python benchmarks/benchmark_realistic_runtimes.py --quick --skip-parallel
# Full benchmarks (several minutes each)
python benchmarks/benchmark_full_grid.py
python benchmarks/benchmark_ci_bootstrap.py
python benchmarks/benchmark_power.py
python benchmarks/benchmark_realistic_runtimes.pySee docs/BENCHMARKING_FINDINGS.md for detailed results, decomposition of joblib vs Numba overhead, nested parallelism, memory notes, and verification scripts in tests/ (test_batch_bootstrap_ci.py, test_nested_parallelism.py, test_batch_sequential_vs_parallel.py).
With pre-copied Numba cache (recommended):
# On local machine (after running scripts/warm_up_numba.py):
rsync -avz ~/.cache/numba/ root@YOUR-IP:~/.cache/numba/
# On cloud VM:
python scripts/run_simulation.py --n-sims 10000 --skip-copula --skip-linear --n-jobs -1 --seed 42Full CI grid (all generators):
python -c '
import time, pickle
from confidence_interval_calculator import run_all_ci_scenarios
for gen in ["nonparametric", "copula", "linear", "empirical"]:
t0 = time.time()
print(f"Starting {gen}...")
res = run_all_ci_scenarios(generator=gen, n_reps=10200, n_boot=500, n_jobs=-1, seed=42)
with open(f"ci_{gen}.pkl", "wb") as f: pickle.dump(res, f)
print(f"{gen} done in {time.time()-t0:.1f} s")
'- Karwowski MP, Stamoulis C, Wenren LM, Faboyede GM, Quinn N, Grodin MA, Bellinger DC, Woolf AD. Blood and Hair Aluminum Levels, Vaccine History, and Early Infant Development: A Cross-Sectional Study. Acad Pediatr. 2018 Mar;18(2):161-165. doi:10.1016/j.acap.2017.09.003
- Bonett DG, Wright TA (2000). Sample size requirements for Pearson, Kendall, and Spearman correlations. Psychometrika, 65(1), 23-28. doi:10.1007/BF02294183
- Conover WJ, Iman RL (1981). Rank transformations as a bridge between parametric and nonparametric statistics. The American Statistician, 35(3), 124-129.
- Fieller EC, Hartley HO, Pearson ES (1957). Tests for rank correlation coefficients. I. Biometrika, 44(3-4), 470-481. doi:10.1093/biomet/44.3-4.470
- Iman RL, Conover WJ (1982). A distribution-free approach to inducing rank correlation among input variables. Communications in Statistics - Simulation and Computation, 11(3), 311-334.
- Wikipedia: Cholesky decomposition (Monte Carlo simulation section)
- Wikipedia: Spearman's rank correlation coefficient (Fisher transformation, references)