Spearman Power Simulation: Vaccine Aluminum Association

Disclaimer: This code and the statistical methods it implements were generated by AI. A statistician has not yet verified the implementation or the appropriateness of the methods for this application, although it is reassuring that the results of the distinct methods are consistent with each other and that this sort of problem is one AI likely handles well.

Important note on data: Raw data from Karwowski et al. (2018) is not publicly available. All simulations use plausible approximations based on the paper's reported summaries (n, median, IQR, range, and typical vaccination clustering). Results are robust across multiple tie distributions, which provides reasonable confidence despite the lack of exact data.

Statistical power and confidence-interval framework for Spearman rank correlations, designed to evaluate the non-significant associations reported in Karwowski et al. (2018) between blood/hair aluminum levels and cumulative vaccine aluminum exposure.

Background

Karwowski et al. (2018) found no significant Spearman correlations between cumulative vaccine aluminum and either blood aluminum (B-Al) or hair aluminum (H-Al) in a cohort of 9-13-month-old infants. This simulation framework asks: what correlations could the study have detected?

Because most children followed a similar vaccination schedule, cumulative aluminum values cluster around a narrow range (median 2.9 mg, IQR 0.11 mg), producing ties in the x-variable. The framework explicitly models these ties to assess their impact on statistical power.

Four Cases

Karwowski et al. (2018) recruited 85 infants. Missing B-Al and/or H-Al data and extreme outliers reduce the available sample for each calculation (hence the lower N values in the table below). The paper does not state whether the reported Spearman correlations with vaccine aluminum load were computed with or without outliers. However, when reporting the correlation between B-Al and H-Al, the authors exclude participants with missing data and all extreme outliers, which suggests the vaccine correlations may also have been computed with outliers excluded. To cover both possibilities, the simulation runs four cases: B-Al and H-Al, each with outliers included and excluded.

Case	Variable	Outliers	N	Observed rho	P
1	B-Al	Included	80	-0.13	0.26
2	H-Al	Included	82	+0.06	0.56
3	B-Al	Excluded	73	-0.13	0.26
4	H-Al	Excluded	81	+0.06	0.56

Analysis Goals

We test H₀: ρ = 0 vs H₁: ρ ≠ 0 (two-sided) at α = 0.05. Power is the probability of rejecting H₀ when the true correlation is ρ.

Minimum detectable correlation -- the smallest |ρ| detectable with 80% power at α = 0.05, for each case × tie structure × distribution shape.
Confidence intervals for observed rho -- bootstrap and asymptotic CIs. These are the expected intervals under the model when the true correlation equals the observed rho, averaged over many simulated datasets, showing the plausible range of true correlations consistent with the data.

Note on search bounds: The bisection for min detectable rho searches [0.25, 0.42] (positive) or [-0.42, -0.25] (negative). These bounds are tuned for the Karwowski study (n = 73-82, α = 0.05); for other applications they should be widened. The asymptotic search uses wider bounds [0, 0.6].

Assumptions

Y marginal distribution (aluminum levels): Y is modeled as log-normal using the reported median and IQR, or when using the empirical generator, via the empirical distribution (resampling from digitized Figure of Karwowski B-Al/H-Al data). Because Spearman's rank correlation depends only on ranks, the specific continuous distribution of Y has no effect on the correlation coefficient, its null distribution, power, or confidence intervals in the non-parametric, linear, or asymptotic methods. The log-normal choice is kept purely for realism (capturing skewness and outliers) and because the Gaussian copula method requires a continuous marginal. The log-normal is fitted using q75 = median + IQR/2 (symmetric split of the reported IQR); for a truly log-normal distribution the split is asymmetric, so sigma is slightly underestimated. This has no effect on rank-based results since only the ordering of y matters. Based on the Karwowski (2018) Figure showing the B-Al vs H-Al correlation (N = 71, excluding missing values and outliers), these values have few and mild ties, so choosing a distribution of distinct values for generating y in our simulation is a good approximation. It should be noted though that those data points show B-Al follows a log-normal distribution, and so likely does also at N = 73, but the outliers likely shift it away from log-normal, and the H-Al data is not log-normal. (Implementation note: in the default batch execution path, lognormal y values are never actually generated for the nonparametric, copula, or linear generators. The skip-y optimization returns rank arrays directly, bypassing lognormal/exp generation entirely — these generators consume no RNG draws for y marginals in production. Only the empirical generator generates actual y values. See Skip-y for data generation.)
Tie structure distributions: Three frequency distributions (even, heavy_tail, heavy_center) are tested for sensitivity; differences across them are small, which is reassuring given unavailable raw data. heavy_center is considered most plausible for real vaccination schedules: most children followed the schedule and would be clustered in the middle for ages 9–13 months, with the middle 50% under 12 months. More realistic skewed patterns have also been explored but are currently undocumented here (e.g., for k=4 distinct values: only 2–5 observations in the tails, heavy concentration in the second value, and substantial but lower counts in the third). Users can test custom frequencies with scripts/run_single_scenario.py --freq.
Method consistency: The linear Monte Carlo method produces results that are virtually identical to the recommended non-parametric rank-mixing method. The Gaussian copula method now uses multipoint calibration by default (same as nonparametric). Calibration runs on the same data-generating process that includes jittering, so it compensates for the bias; with multipoint, the copula achieves target Spearman within tolerance and can be used for power and CI estimation. It remains a useful alternative to nonparametric and performs very well in all-distinct (no-ties) scenarios.

Tie Handling

X-values are generated with 4-10 distinct values using three frequency distributions: even, heavy_tail, and heavy_center.
An additional all-distinct baseline (no ties) is computed for each case.
Total: 84 tied scenarios + 4 all-distinct = 88 scenarios per method.

Four Monte Carlo Methods (plus Asymptotic)

Non-parametric rank-mixing (recommended)

This is a bivariate simplification of the Iman-Conover method, applied directly in standardized rank space using Cholesky decomposition to induce correlation, followed by reordering of y draws. Like standard Iman-Conover, it preserves the marginal distribution of y exactly via reordering and handles ties in x naturally through midranks. It is the recommended Monte Carlo approach, especially when x has ties. It works by:

Computing the standardised midranks of x (handling ties naturally).
Generating independent noise ranks via random permutation.
Mixing the two at weight rho: mixed = rho_cal * s_x + sqrt(1 - rho_cal^2) * s_noise. This is essentially the Cholesky decomposition (Iman-Conover) in the bivariate case applied directly to ranks.
Mapping the resulting ordering onto random draws from the target log-normal y-marginal. (In the default batch path, this step is skipped entirely: _return_ranks=True returns rank(mixed) directly without drawing any lognormal values. See Skip-y for data generation.)

Calibration: Because ties attenuate the realised Spearman rho relative to the mixing weight, the method includes an automatic calibration step. Two modes are available (configurable via config.CALIBRATION_MODE or --calibration-mode). In both modes, calibration runs once per unique tie structure and is reused across all rho values during bisection — a major performance improvement over per-rho calibration:

Multipoint (default) (non-parametric rank-mixing and copula): Probes at several rho values (e.g. 0.10, 0.30, 0.50) and interpolates the calibration curve. Captures nonlinear attenuation (ratio varies with rho) and reduces bias to typically under 0.01. For the copula, calibration runs on the same DGP that includes jittering, so it compensates for jittering-induced attenuation. Accuracy: Better than single-point when attenuation is nonlinear (e.g. custom or extreme tie structures). Performance: ~3× slower calibration on first run per tie structure (~9s vs ~3s) because it probes multiple values instead of one; cached thereafter and reused for all rho targets.
Single-point (non-parametric rank-mixing and copula): Probes only at rho = 0.30 and applies a constant ratio. Faster (~3× less calibration cost) but can have 0.01–0.03 bias when attenuation is nonlinear. Use for quick exploratory runs.

Validity, Assumptions, and Attenuation with Heavy Ties

The method induces an exact Pearson correlation $\rho_{\mathrm{cal}}$ between the mixed reference ranks and the standardized midranks $s_x$ via the bivariate Cholesky mixing:

$$\mathrm{mixed} = \rho_{\mathrm{cal}} \cdot s_x + \sqrt{1 - \rho_{\mathrm{cal}}^2} \cdot s_{\mathrm{noise}}$$

where:

$s_x$ = standardized midranks of x (mean 0, variance 1, incorporating ties via averaging)
$s_{\mathrm{noise}}$ = standardized ranks from a random permutation (independent, mean 0, variance 1)

Derivation of exact Pearson correlation: Let $X = s_x$, $Z = s_{\mathrm{noise}}$ (independent, $\mathrm{E}[X]=\mathrm{E}[Z]=0$, $\mathrm{Var}(X)=\mathrm{Var}(Z)=1$). Then $B = \rho X + \sqrt{1-\rho^2} Z$:

$$\mathrm{Var}(B) = \rho^2 \mathrm{Var}(X) + (1-\rho^2) \mathrm{Var}(Z) + 2\rho\sqrt{1-\rho^2}\,\mathrm{Cov}(X,Z) = \rho^2 + (1-\rho^2) = 1$$

$$\mathrm{Cov}(X, B) = \rho\,\mathrm{Var}(X) + \sqrt{1-\rho^2}\,\mathrm{Cov}(X,Z) = \rho$$

$$\mathrm{Corr}(X, B) = \frac{\mathrm{Cov}(X,B)}{\sqrt{\mathrm{Var}(X)\,\mathrm{Var}(B)}} = \rho$$

This holds purely from covariance algebra — no assumption is required that $X$ and $s_{\mathrm{noise}}$ come from the same distribution, only that they are independent with mean 0 and variance 1.

The final Spearman $\rho_s$ is the Pearson correlation of the ranks of x and the reordered y. With no ties, this would equal $\rho_{\mathrm{cal}}$ (ranks are monotone transforms). With heavy ties in x, midranks compress the effective variance/resolution of $s_x$ (large blocks of identical values), so the mapping from the finely resolved mixed ranks back to the clumped ranks of x attenuates the realised Spearman below $\rho_{\mathrm{cal}}$.

Approximate attenuation factor:

$$\sqrt{1 - \sum(m_j^3 - m_j)/(n^3 - n)}$$

where $m_j$ are tie group sizes (from Fieller-Hartley-Pearson variance correction). The attenuation is nonlinear in $\rho_{\mathrm{cal}}$, which is why multipoint calibration (probing at 0.10, 0.30, 0.50 and interpolating) is preferred over single-point.

All rank-based simulation methods (including full Iman-Conover) face similar attenuation with heavy ties; the calibration step ensures the realised Spearman matches the target within tolerance (typically < 0.01 bias).

Why this method? Unlike the Gaussian copula, it does not rely on the continuous-marginals assumption. Unlike the linear model, it does not assume a parametric relationship between x and y. It handles tied x-values naturally because the mixing operates directly in rank space.

Skip-y identity for continuous marginals

The rank-mixing step assigns y-values by sorting $n$ draws $y_{(1)} < y_{(2)} < \cdots < y_{(n)}$ and scattering them onto positions determined by argsort(mixed):

y_final[argsort(mixed)[k]] = y_{(k+1)}

Claim. When all y draws are distinct, the orderings of mixed and y_final are identical:

$$\mathrm{mixed}[j_1] < \mathrm{mixed}[j_2] \iff y_{\mathrm{final}}[j_1] < y_{\mathrm{final}}[j_2]$$

Proof. Let position $j$ receive rank $k_j$ in mixed (i.e. $j$ is the $k_j$-th smallest element of mixed). The scatter assignment gives $y_{\mathrm{final}}[j] = y_{(k_j)}$. For any two positions $j_1, j_2$:

$\mathrm{mixed}[j_1] < \mathrm{mixed}[j_2] \iff k_{j_1} < k_{j_2}$ (by definition of rank)
$k_{j_1} < k_{j_2} \iff y_{(k_{j_1})} < y_{(k_{j_2})}$ (since sorted y is strictly increasing)
Therefore $\mathrm{mixed}[j_1] < \mathrm{mixed}[j_2] \iff y_{\mathrm{final}}[j_1] < y_{\mathrm{final}}[j_2]$. $\square$

Since mixed is almost surely tie-free (see below), both vectors have identical orderings with no ties, so their rank vectors are equal element-wise:

$$\mathrm{rank}(y_{\mathrm{final}}) = \mathrm{rank}(\mathrm{mixed})$$

Consequence for Spearman correlation. Spearman's $\rho_s$ is the Pearson correlation of the rank vectors. Substituting the identity:

$$\rho_s(x,\, y_{\mathrm{final}}) = \mathrm{Pearson}\!\left(\mathrm{rank}(x),\; \mathrm{rank}(y_{\mathrm{final}})\right) = \mathrm{Pearson}\!\left(\mathrm{rank}(x),\; \mathrm{rank}(\mathrm{mixed})\right)$$

The right-hand side depends only on the x-template, the noise permutations, and $\rho_c$ — not on the y marginal parameters (median, IQR, etc.). Therefore, calibration precompute can skip lognormal y generation entirely, replacing the four-step sequence (argsort → scatter y → rank y_final → Pearson-on-ranks) with a single rank(mixed). This is implemented in _eval_mean_rho_fast and _precompute_calibration_arrays_fast.

Batch vs sequential RNG. The fast path (_precompute_calibration_arrays_fast) generates all $n_{\mathrm{cal}}$ shuffles in a single batch call (cal_rng.permuted), whereas the legacy single-point path generates each shuffle sequentially in a Python loop, resetting the RNG to seed at the start of each _mean_rho call. Both produce IID uniform random permutations per dataset — the same estimator bias ($0$), variance ($\mathrm{Var}(\rho_S)/n_{\mathrm{cal}}$), and convergence rate. The calibration ratios differ numerically between paths (different pseudo-random numbers assigned to different samples) but are statistically equivalent. See docs/DERIVATIONS.md §15 for the formal argument.

Tie probability for lognormal y. For $n = 80$ float64 lognormal draws, the probability of any two values being exactly equal is

$$\binom{80}{2} \cdot 2^{-53} \approx 3.5 \times 10^{-13}$$

so the identity holds for all practical purposes.

Why mixed is almost surely tie-free. Even when $s_x$ has repeated values (from tied x), the noise component $s_n$ is a permutation of distinct values $1, \ldots, n$. For $|\rho_c| < 1$, the noise term $\sqrt{1-\rho_c^2}\, s_n$ breaks all ties: two positions $j_1 \neq j_2$ share a mixed value only if $\rho_c(s_x[j_1]-s_x[j_2]) = \sqrt{1-\rho_c^2}(s_n[j_2]-s_n[j_1])$, an exact rational equation that float64 arithmetic satisfies with probability zero.

Tied y (empirical marginal). When the y pool has ties (e.g. from digitization precision in the Karwowski data), the identity breaks: tied y-values at adjacent sorted positions receive averaged ranks that differ from rank(mixed). The empirical generator calibration path handles this separately and does not use the skip-y optimization.

Gaussian copula

Uses a Gaussian copula with non-parametric marginals and a fitted log-normal for Y. Target Spearman rho is converted to Pearson correlation for sampling; then draws from a conditional bivariate normal; Y-values are mapped through the inverse CDF of the fitted log-normal.

Conversion: $\rho_p = 2\sin(\pi \rho_s/6)$ (exact Spearman–Pearson for the bivariate normal). See data_generator._spearman_to_pearson.

Steps: (1) x → ranks (average) → $u_x = (\text{ranks}-0.5)/n$, small jitter to break ties → $z_x = \Phi^{-1}(u_x)$; (2) draw

$$z_y = \rho_p z_x + \sqrt{1-\rho_p^2} Z$$

; $u_y = \Phi(z_y)$; (3) $y = F_{\ln}^{-1}(u_y)$ with log-normal fitted to median/IQR. See data_generator.generate_y_copula. (In the default batch path, the $\Phi$ and log-normal quantile transforms are skipped: _return_ranks=True returns rank(z_y) directly.)

With ties: Jittering collapses rank information, so without calibration the realised Spearman would be attenuated. Calibration runs on the same data-generating process (including jittering): we probe at several rho values and interpolate to find the input rho that yields each target. So calibration compensates for the bias. The copula now supports the same calibration modes as nonparametric: multipoint (default) and single-point. With multipoint, mean realised rho matches the target within tolerance (e.g. ±0.01), so the copula can be used for power and CI estimation even with ties.

Historical limitation without multipoint calibration: When x has heavy ties (e.g. k=4), the jittering step that breaks tied ranks collapses rank information and attenuates the realised Spearman rho, leading to underestimated power. Alternatives to the current jitter (distributional transform, adaptive jitter) were tested and did not meet the 0.01 accuracy threshold without multipoint calibration: single-point calibration (probe at rho = 0.30) helped compensate but was not enough. Alternatives tested — distributional transform (random uniform within each tie group's CDF band) and adaptive jitter (scaling jitter by tie group size) — improved the mild cases (k=10) but still failed the 0.01 accuracy threshold for heavy ties (k=4). This is a fundamental limitation of the continuous-marginals assumption, though it works reliably when there are no ties. Multipoint calibration compensates by learning the mapping on the jittered process, so the method remains usable. Nonparametric rank-mixing remains the recommended default because it operates directly in rank space and avoids the continuous-marginals assumption, but the copula is a valid alternative when multipoint calibration is used.

Skip-y identity for Gaussian copula calibration

The copula maps $z_y$ to $y$ through two steps: $u_y = \Phi(z_y)$ and $y = F_{\ln}^{-1}(u_y)$. Composing:

$$y = F_{\ln}^{-1}(\Phi(z_y)) = \exp\!\bigl(\mu + \sigma\,\Phi^{-1}(\Phi(z_y))\bigr) = \exp(\mu + \sigma\, z_y)$$

Since $f(z) = \exp(\mu + \sigma\, z)$ is strictly increasing for $\sigma > 0$:

$$z_y[j_1] < z_y[j_2] \iff y[j_1] < y[j_2]$$

Therefore $\operatorname{rank}(y) = \operatorname{rank}(z_y)$, and $\rho_s(x,\, y) = \rho_s(x,\, z_y)$. Calibration can skip the $\Phi$ and log-normal quantile steps entirely, computing Spearman directly from $z_y$. This is the basis for the copula calibration fast path, analogous to the nonparametric skip-y identity above. The fast path stores only $(z_x,\, \text{noise})$ — both rho-independent — and at each bisection step computes $z_y = \rho_p z_x + \sqrt{1-\rho_p^2}\,\text{noise}$ followed by rank(z_y).

Clipping caveat. The production code clips $u_y$ to $[10^{-8},\, 1-10^{-8}]$ inside _lognormal_quantile, making the composed transform non-decreasing but flat outside $\Phi^{-1}(10^{-8}) \approx \pm 5.61$. Two $z_y$ values both clipped to the same boundary would receive the same $y$, creating a tie that breaks the identity. The per-draw probability of exceeding the clip boundary is $\Phi(-5.61) \approx 1.0 \times 10^{-8}$; the probability of two or more draws being clipped to the same boundary is bounded by

$$2\binom{n}{2}\bigl(\Phi(-5.61)\bigr)^2 < 2 \cdot \binom{80}{2} \cdot (10^{-8})^2 \approx 6.3 \times 10^{-13}$$

so the identity holds for all practical purposes. The fast path bypasses the clipping entirely (Spearman is computed from $z_y$ directly), so this caveat applies only when asking whether the slow and fast paths agree — and they do, with the same probability.

Exactness of skip-y identities

Identity	Probability of failure
Nonparametric: `rank(y_final) = rank(mixed)`	≈ 3.5 × 10⁻¹³ (float64 lognormal tie)
Copula: `rank(y) = rank(z_y)`	< 6.3 × 10⁻¹³ (clipping creates tie)
Batch vs sequential RNG	N/A — same distribution, different numbers

Skip-y for data generation (power simulation and CI bootstrap)

The skip-y identities above were originally developed for calibration. The same identities extend to the actual data generation step: in the default batch execution path, all three non-empirical generators are called with _return_ranks=True in both estimate_power (power simulation) and bootstrap_ci_averaged with batch bootstrap (CI). The empirical generator is unchanged.

Why every downstream consumer is correct. All uses of y_all in both paths depend only on the ranks of y, not on the scale of y values. The key property is that rank(y) is an order-isomorphism: for any two positions $i, j$,

$$y[i] < y[j] \iff \mathrm{rank}(y)[i] < \mathrm{rank}(y)[j]$$

and all values in the rank array are distinct (continuous y ⇒ no ties). This makes the rank array interchangeable with the float array for any function that operates via ranks.

spearman_rho_2d(x_all, y_all) and spearman_rho_pvalue_2d(...) — both call _rank_rows(y) internally. Ranking a tie-free rank array is idempotent: rank({1,2,...,n} permuted) = {1,2,...,n} permuted (each value already equals its own rank). Identical rho values.
pvalues_from_precomputed_null(rhos_obs, ...) — depends only on rhos_obs, not directly on y_all.
Permutation p-values (pvalues_mc) — see below.
Batch bootstrap (_batch_bootstrap_rhos_jit) — see below.

Permutation path

Each permutation $\sigma$ in pvalues_mc is a bijection — perm_idx[b, rep, :] is generated by np.argsort(rng.random(...)), which always produces a permutation of $\{0, \ldots, n-1\}$ with no repeats. So the permuted-y row contains all $n$ distinct values (from float y) or all $n$ distinct ranks.

Claim. For any bijection $\sigma$: _tied_rank(y_float[σ]) = _tied_rank(y_ranks[σ]).

Proof. Let $r = \mathrm{rank}(y)$. Since $\sigma$ is a bijection, neither y_float[σ] nor y_ranks[σ] contains any duplicates. For any two positions $i, j$:

$$y_{\mathrm{float}}[\sigma(i)] < y_{\mathrm{float}}[\sigma(j)] \iff r[\sigma(i)] < r[\sigma(j)] \iff y_{\mathrm{ranks}}[\sigma(i)] < y_{\mathrm{ranks}}[\sigma(j)]$$

Identical strict ordering, no ties in either array → _tied_rank produces the same output. $\square$

Bootstrap subtlety

The bootstrap step in _batch_bootstrap_rhos_jit draws with replacement, so a bootstrap resample $B = (b_1, \ldots, b_n)$ may contain duplicate indices. This means a bootstrap row can contain repeated values — creating ties that did not exist in the original y. The key is that the tie structure is identical for float and rank representations.

Claim. For any bootstrap resample $B$ (with or without repeated indices): _tied_rank(y_float[B]) = _tied_rank(y_ranks[B]).

Proof. Let $r = \mathrm{rank}(y)$.

Tie structure. Two positions $i, j$ in the resample are tied if and only if $b_i = b_j$:

Float: $y[b_i] = y[b_j] \iff b_i = b_j$ (y has no original ties, so equal values imply same source index).
Ranks: $r[b_i] = r[b_j] \iff b_i = b_j$ (r is a permutation of distinct integers).

Both arrays have identical tie patterns: a tie occurs precisely when the same original index is drawn more than once.

Ordering of non-tied positions. For $b_i \neq b_j$:

$$y[b_i] < y[b_j] \iff r[b_i] < r[b_j]$$

since rank is strictly order-preserving. Therefore both resampled arrays have the same strict ordering and the same tie structure, so _tied_rank (which uses average-rank tie-breaking) assigns identical values. $\square$

Intuition. A bootstrap resample is a multiset of y values. Whether we represent each element as its float value or as its rank, the relative order and equality pattern within the multiset are preserved. _tied_rank depends only on these, so it cannot distinguish the two representations.

RNG note (nonparametric only). With _return_ranks=True, the nonparametric generator skips rng.lognormal(...), consuming $n_{\mathrm{sims}} \times n$ fewer values from the shared RNG in estimate_power. In the CI path this is harmless because data and bootstrap use separate RNG streams (SeedSequence.spawn(2)). In the power simulation path, the p-value step (permutation indices or null binary search) draws from a different RNG state than the float-y path would — but any valid random permutations give an unbiased permutation test. Results are statistically equivalent, not bit-identical, for the nonparametric generator. Copula and linear have no RNG divergence (the skipped operations are deterministic transforms).

Regression tests. tests/test_skip_y_ranks.py Steps 1–6 verify: rank identity (rank(y_float) = y_ranks), Spearman rho identity, batch bootstrap rho matrix identity (bit-exact), rho=0 edge case, all-distinct x, and permutation MC p-value identity (pvalues_mc bit-exact for float vs rank y). The last test directly covers the permutation subtlety above.

Linear Monte Carlo

Model: $\log(y) = \mu_{\ln} + b \cdot x_{\mathrm{std}} + \sigma_{\mathrm{noise}} \cdot Z$, with $x_{\mathrm{std}} = (x - \mathrm{mean}(x))/\mathrm{std}(x)$, and $(\mu_{\ln}, \sigma_{\ln})$ from fitting log-normal to median/IQR. See data_generator.generate_y_linear. (In the default batch path, exp() is never applied: _return_ranks=True returns rank(log_y) directly.)

Target: $\rho_p = 2\sin(\pi \rho_s/6)$; set $b = \rho_p \cdot \sigma_{\ln}$ and noise variance $\sigma_{\ln}^2(1-\rho_p^2)$ so that theoretical Pearson($x_{\mathrm{std}}$, log y) = $\rho_p$.

Caveat: Spearman(x, y) is only approximate to $\rho_s$ because Spearman uses ranks of y, which are a nonlinear transform of log(y); no calibration step. Useful as a parametric complement to the non-parametric method.

Empirical (Karwowski digitized data)

Uses the same rank-mixing core as nonparametric (standardised midranks of x, mix with noise ranks, reorder); y-values come from resampling the digitized Karwowski pool (no log-normal). The 71 digitized values (B-Al and H-Al each have 71) appear exactly once in every simulated dataset; only the remainder (2 for n=73, 10 for n=81, plus outliers for 80/82) is resampled per sim/rep. Pool structure: 73 ⊂ 80 (the first 73 elements of an n=80 pool equal the n=73 pool) and 81 ⊂ 82. Sample sizes:

n=73 → B-Al, outliers excluded (71 digitized + 2 resampled)
n=80 → B-Al, full sample (73 clean + 7 outliers)
n=81 → H-Al, outliers excluded (71 digitized + 10 resampled)
n=82 → H-Al, full sample (81 clean + 1 outlier)

Requires data/digitized.py (see Digitized data below). Use --skip-empirical to exclude this method when digitized data is unavailable. Calibration uses the same multipoint/single-point logic as nonparametric, with a separate cache so empirical and nonparametric do not share calibration state.

For p-value and calibration decisions (new y per sim, single reference pool, per-dataset permutation for empirical) and the alternative design for future generalization, see Empirical generator: p-value and calibration below.

Empirical generator: p-value and calibration

What we do (current design):

Pool and y per sim/rep: The empirical generator uses a pool of y-values (e.g. 71 digitized + 2 or 10 “fill” for n=73 or 81; or 71 fixed + remainder resampled per sim if the “fixed 71” design is applied). Each n_sim or n_rep produces a new y dataset: either by resampling n values from the pool with replacement (current), or by using exactly the 71 digitized values plus a new random draw for the remainder (fixed-71 design). We do not reuse the same y across sims; we intentionally generate new y-data each sim/rep.
Why new y each time is correct: We are averaging over many (x, y) realizations. The pool (or the 71) is fixed; the draws from it (or the remainder) are random. So power and CI estimates average over the randomness in y, giving unbiased expectations and valid assessment of variability. This is the same principle as other generators (nonparametric, copula, linear), where y is newly generated every sim; here the “population” is the finite pool (or 71 + random remainder) instead of a parametric distribution.
Calibration: Calibration for empirical runs once per scenario (cached by n, n_distinct, distribution_type, etc.). It uses a single reference pool from get_pool(n) (fixed seed), not the per-sim pool. So the same cal_rho is used for all n_sims/n_reps in that scenario. Any small mismatch between that reference pool’s tie structure and a given sim’s pool (e.g. different 2 or 10 fill values) is averaged over by the randomness of the different y-data across sims: over many n_sims we average over (a) different x, (b) different y-pools/realizations, and (c) rank-mix assignment. The Monte Carlo mean remains close to the target; per-rep variation contributes to the variance we are estimating. So one reference pool and one cal_rho per scenario is a reasonable and consistent setup.
P-value: By default (EMPIRICAL_USE_PRECOMPUTED_NULL = True), the empirical generator uses the same precomputed null as other generators (keyed on x tie structure, built from all-distinct y-ranks 1..n). Empirical y always has ties: the pool is 71 digitized values with repeated measurements, resampled into n = 73–81, so by pigeonhole and pre-existing duplicates, tied y is guaranteed in every realization. The approximation error is nevertheless negligible (< 1.4 × 10⁻⁵ on p-values; see Precomputed null approximation). Set EMPIRICAL_USE_PRECOMPUTED_NULL = False to revert to per-dataset Monte Carlo permutation (exact, much slower). The MC path uses adaptive n_perm (1k when n_sims ≥ 5000, else 2k), batched over all (sim, perm) pairs via Numba.

Why this is legitimate:

New y per sim: Intentional and correct for operating characteristics (power, CI). We are not reusing the same dataset; we are drawing new (x, y) each time and averaging. No analogue to the earlier n_boot bug (where data depended on n_boot); here data and bootstrap/RNG streams are separated as intended.
Single reference pool for calibration: Cost and simplicity (one calibration per scenario). The approximation (one cal_rho for all sims) is valid because any bias from pool-to-pool variation is averaged out over n_sims.

Alternative: “generate new y datasets each time” as population sampling (future generalization):

A different design would be to treat the pool as the population and, each sim, sample n values with replacement from that pool (current “resampling from whole pool” is already in this spirit). That answers: “If the pool were the true population and we ran many experiments, each drawing n from it, what would we see?” — i.e. what might happen if multiple such experiments were run, with the 71 not special but part of the population (so in a given simulated sample, the 71 can appear 0, 1, 2, … times).

Contrast with current (fixed-71) target: Our target is “For this Karwowski study (these 71), what would power/CI look like under uncertainty in the missing/extra values?” So the 71 are fixed in every sim; only the remainder varies. That is a conditional question given this study’s digitized values.
Generalization: The “pool as population, new sample of n each time” approach generalizes naturally to other scenarios: different studies, different n, or different pools. Scripts could be extended to accept an arbitrary pool (or parametric model) and simulate “many replications from a representative population.” This is documented as a potential future generalization so readers understand the design choice (conditional on this study vs. unconditional over replications) and how one might extend the code.

Asymptotic

Closed-form expressions (no simulation). Power uses the non-central t-distribution with df = $n-2$ and Fieller-Hartley-Pearson tie correction for x only. CIs use the Fisher z-transform with Bonett-Wright SE = $\sqrt{1.06/(n-3)}$; ties handled via FHP for x only (see Tie correction). Fast and stable, serves as a cross-check against the Monte Carlo methods.

P-value method (power simulation)

Power simulation uses permutation-based p-values instead of the t-approximation, because the t-approximation is unreliable with heavy ties. Two paths:

Non-empirical generators (nonparametric, copula, linear): A precomputed null distribution of Spearman rhos is built once per (n, tie structure) and cached. P-values are computed by binary search against sorted $|\rho_{\mathrm{null}}|$. On first use for a scenario (~1s for $n \approx 80$, 50k null samples), the null is built and cached; subsequent calls reuse it — including across different generators for the same scenario. This is exact (not an approximation): all non-empirical generators produce continuous y with no ties, so the permutation null depends only on n and the x tie structure, not the generator. Optional: call permutation_pvalue.warm_precomputed_null_cache() to pre-build nulls for all scenarios before long runs.
Empirical generator (default: precomputed null): By default, the empirical generator uses the same precomputed null as other generators (config.EMPIRICAL_USE_PRECOMPUTED_NULL = True). The approximation error from ignoring y-ties is < 1.4 × 10⁻⁵ on p-values — negligible for all precision targets (see Precomputed null approximation for empirical generator below). Set config.EMPIRICAL_USE_PRECOMPUTED_NULL = False to revert to per-dataset Monte Carlo permutation p-values (exact but much slower).

Set config.USE_PERMUTATION_PVALUE = False to revert to the t-based p-value for power (e.g. for comparison or tests). For the full equations (precomputed null construction, binary-search p-value, per-dataset MC, fallback t-approximation) and the MC-on-cache-miss option, see Statistical methods below.

MC-on-cache-miss: When config.PVALUE_MC_ON_CACHE_MISS = True, the code does not build the precomputed null on a cache miss. It looks up the cache only; if the null is present it is used, otherwise the per-dataset Monte Carlo permutation path is used (no build, no cache write). Default is False (always build and cache on first use). Use this to avoid precomputed-null build and memory for custom frequencies or one-off scenarios.

Statistical methods

Hypothesis and test

H₀: ρ = 0 vs H₁: ρ ≠ 0, two-sided, α = 0.05.
The same test statistic is used in both asymptotic power and in the Monte Carlo power loop (see below).

P-value methods for Monte Carlo power

Default: permutation-based p-values (config.USE_PERMUTATION_PVALUE = True). Two paths:

Precomputed null (non-empirical generators: nonparametric, copula, linear): Build once per (n, tie structure): generate n_pre (default 50,000) random permutations of y-ranks (1…n) while keeping x-midranks fixed; compute Spearman $\rho$ for each → sort $|\rho_{\mathrm{null}}|$. P-value for observed $|\rho_{\mathrm{obs}}|$: $p = (1 + N) / (1 + n_{\mathrm{pre}})$, where N is the number of null $\rho$ with $|\rho_{\mathrm{null}}| \geq |\rho_{\mathrm{obs}}|$, using binary search against the sorted array. Cached and reused across all sims in a scenario. See permutation_pvalue.get_precomputed_null and pvalues_from_precomputed_null.
- Why the null is shared across generators (exact, not an approximation): All three non-empirical generators produce continuous y, so y has no ties and its ranks are a uniform random permutation of {1,…,n} under H₀, regardless of which generator was used. The permutation null distribution of Spearman ρ therefore depends only on n and the x tie structure — not on the generator. The in-memory cache key is (n, all_distinct, tuple(x_counts), n_pre); the generator is intentionally excluded, and the same cached null is correctly reused across nonparametric, copula, and linear for the same scenario. The empirical generator also uses this null by default (EMPIRICAL_USE_PRECOMPUTED_NULL = True), with a negligible approximation from y-ties (< 1.4 × 10⁻⁵; see Precomputed null approximation).
- MC on cache miss: When config.PVALUE_MC_ON_CACHE_MISS = True, the code does not build the null on a cache miss. It uses a cache lookup only; if the null is in cache it is used, otherwise the per-dataset MC path is used (no build, no cache write). Default is False.
Per-dataset Monte Carlo (empirical generator when EMPIRICAL_USE_PRECOMPUTED_NULL = False; also used as fallback when PVALUE_MC_ON_CACHE_MISS = True and null is not cached): For each sim, generate n_perm permutations of y (adaptive: 1k when n_sims ≥ 5000, else 2k), compute Spearman $\rho$ for each, and $p = (1 + N) / (1 + n_{\mathrm{perm}})$, where N is the number of permutation $\rho$ with $|\rho_{\mathrm{perm}}| \geq |\rho_{\mathrm{obs}}|$. Batched over all (sim, perm) pairs via Numba; chunked when n_sims is large. See permutation_pvalue.pvalues_mc.

Power = proportion of sims with p < α.

Fallback: t-approximation p-value (config.USE_PERMUTATION_PVALUE = False):

Equation: $t = \rho\sqrt{(n-2)/(1-\rho^2)}$ with df = $n-2$; two-sided p-value = $2 \times P(T \geq |t|)$ for $T \sim t_{n-2}$. See spearman_helpers.spearman_rho_pvalue_2d.
Where also used: The asymptotic power derivation uses the same t-statistic to define the noncentrality parameter (below).
Caveat: The t-approximation is unreliable with heavy ties in x (few distinct values); the null distribution of $\rho$ is not well approximated by the t-distribution in this setting, and the approximation can be anticonservative (p-values too small, type I error inflated). This is why permutation is the default. The t-approximation remains available for comparison or backward compatibility.

Asymptotic power (non-central t)

Noncentrality (all-distinct):

$$\mathrm{nc} = \frac{\rho_{\mathrm{true}} \sqrt{n-2}}{\sqrt{1 - \rho_{\mathrm{true}}^2}}$$

Power = $P(|T| > t_{\mathrm{crit}})$ for $T \sim$ noncentral $t$(df = $n-2$, nc).

Tie adjustment (FHP): When ties are present, nc is scaled by

$$\sqrt{\mathrm{var}_0^{\mathrm{no-ties}} / \mathrm{var}_0^{\mathrm{ties}}}$$

(see FHP below).

Removed factor (historical note): A previous AI-generated version of this code multiplied nc by an additional $\sqrt{1/1.06}$ in the tie-correction branch, attempting to account for the Bonett-Wright efficiency loss (the 1.06 factor used in the CI formula). No published source justifies applying this factor to the noncentrality parameter; the 1.06 is specific to the Fisher z standard error for confidence intervals and does not belong in the power formula. The factor has been removed. The Bonett-Wright 1.06 remains in asymptotic_ci where it is well-established.

Asymptotic confidence interval

$z = \mathrm{arctanh}(\rho)$; SE in z-space = $\sqrt{1.06/(n-3)}$ (Bonett–Wright). With ties: $\mathrm{SE}_z$ scaled by

$$\sqrt{\mathrm{var}_0^{\mathrm{ties}} / \mathrm{var}_0^{\mathrm{no-ties}}}$$

. CI in z-space then back-transformed via tanh.

Fieller–Hartley–Pearson (FHP) variance under H₀

Tx = Σ(tᵢ³ − tᵢ) over x tie groups (similarly Ty for y); Dx = (n³−n−Tx)/12, Dy = (n³−n−Ty)/12; Var(ρ) = (1/(n−1)) × (n³−n)²/(144·Dx·Dy). No ties ⇒ 1/(n−1).

Tie correction is currently for x only in asymptotic expressions: In this codebase, the asymptotic power and CI use FHP for x only. Only x_counts is passed to the variance calculation; y_counts is not used (y is effectively treated as having no ties for the asymptotic expressions). So the implemented var₀ uses Dx from x tie groups and Dy = (n³−n)/12 (no y ties). This matches the design: x is cumulative vaccine aluminum (heavy ties); y is aluminum level (typically few or no ties in our generators). The code does have implemented some options to include ties for y also.

Ranks

Average-rank tie handling (midranks) throughout; same as scipy.stats.rankdata(..., method="average").

Bootstrap RNG

Bootstrap uses separate RNG streams for data and bootstrap so the same seed yields the same n_reps datasets regardless of n_boot; see Separate RNG streams for data and bootstrap under Bootstrap CI.

Digitized data (for empirical generator)

The empirical generator uses digitized H-Al and B-Al values from the Karwowski scatterplot: the 71 digitized values are fixed in every dataset; only the remainder is resampled per sim/rep (see Empirical generator above). The values in data/digitized.py were obtained using WebPlotDigitizer v4. To use the empirical generator yourself, create data/digitized.py with arrays H_AL71, B_AL71, H_AL_OUTLIER, B_AL_OUTLIER_MIN, B_AL_OUTLIER_MAX (see the module docstring in data_generator.py for details). If data/digitized.py is missing or fails to import, a warning is emitted: when only empirical was selected it falls back to nonparametric; when multiple generators were selected empirical is skipped. Use --skip-empirical to avoid the warning when you do not have digitized data.

Installation

pip install -r requirements.txt

Required packages: numpy, scipy, pandas, joblib. Numba (recommended) is included in requirements.txt for significant speedups. If Numba cannot be installed on your platform, the code falls back to pure NumPy automatically.

Numba warm-up (optional, recommended)

On the first run after installation, Numba compiles JIT functions (5-15 seconds). To pre-warm the cache:

python scripts/warm_up_numba.py

To copy the cache to a cloud VM for zero compile delay:

rsync -avz ~/.cache/numba/ user@YOUR-IP:~/.cache/numba/

Usage

Programmatic usage (no command line)

All scripts expose a main() function that can be called from Python:

# Full simulation
from scripts.run_simulation import main as run_sim
power_df, ci_df, all_distinct_df = run_sim(n_sims=500, skip_linear=True)

# Single scenario
from scripts.run_single_scenario import main as run_single
result = run_single(case=3, n_distinct=4, dist_type="heavy_center", n_sims=500)
result = run_single(case=3, freq=[19, 18, 18, 18], power_only=True, verbose=False)

# Accuracy testing
from tests.test_calibration_accuracy import main as test_accuracy
df = test_accuracy(n_sims=50, cases=[3], custom_freq=[(3, [19, 18, 18, 18])])

Full simulation (all methods, all scenarios)

python scripts/run_simulation.py

Quick test run

python scripts/run_simulation.py --n-sims 500 --seed 42

Disable Numba (force pure NumPy fallback)

python scripts/run_simulation.py --n-sims 500 --seed 42 --no-numba
python scripts/run_single_scenario.py --case 3 --n-distinct 4 --dist-type heavy_center --n-sims 500 --no-numba

Skip specific methods

python scripts/run_simulation.py --skip-linear --skip-copula   # nonparametric + empirical + asymptotic
python scripts/run_simulation.py --skip-nonparametric           # copula + linear + empirical + asymptotic
python scripts/run_simulation.py --skip-empirical                # omit empirical (e.g. when digitized data unavailable)

Calibration mode (nonparametric only)

python scripts/run_simulation.py --calibration-mode multipoint   # default: more accurate, ~3× calibration cost
python scripts/run_simulation.py --calibration-mode single       # faster calibration for exploratory runs

Filter scenarios

python scripts/run_simulation.py --cases 1,3 --n-distinct 4,10 --dist-types even,heavy_center

Single-scenario quick script

Run power and/or CI for a specific (case, k, distribution) combination:

# Power + CI for Case 3, k=4, heavy_center
python scripts/run_single_scenario.py --case 3 --n-distinct 4 --dist-type heavy_center --n-sims 500

# Power only, skip copula, linear, and empirical
python scripts/run_single_scenario.py --case 3 --n-distinct 4 --dist-type even --power-only --skip-copula --skip-linear --skip-empirical

# CI only for all-distinct baseline
python scripts/run_single_scenario.py --case 1 --all-distinct --ci-only --n-reps 20 --n-boot 500

# Custom frequency distribution (counts must sum to case's n)
python scripts/run_single_scenario.py --case 3 --freq 19,18,18,18 --n-sims 500

# Calibration mode (multipoint default, single for faster runs)
python scripts/run_single_scenario.py --case 3 --n-distinct 4 --dist-type heavy_center --n-sims 500 --calibration-mode single

Asymptotic tie-correction modes

python scripts/run_simulation.py --tie-correction both          # report both corrected and uncorrected
python scripts/run_simulation.py --tie-correction without_tie_correction

Validation

Test whether generators achieve the target Spearman rho:

# Quick check on worst-case scenario
python tests/test_calibration_accuracy.py --n-sims 50 --case 3 --n-distinct 4

# Full sweep, all generators
python tests/test_calibration_accuracy.py --n-sims 200

# Specific generators only (empirical requires digitized data)
python tests/test_calibration_accuracy.py --generators nonparametric,copula --n-sims 100

python tests/test_calibration_accuracy.py --generators empirical --case 3 --n-sims 100

# Custom frequency distribution (requires --case; counts must sum to case's n)
python tests/test_calibration_accuracy.py --case 3 --freq 19,18,18,18 --n-sims 50

# Save results to CSV
python tests/test_calibration_accuracy.py --n-sims 200 --outfile accuracy_report.csv

# Calibration mode (multipoint default, single for faster runs)
python tests/test_calibration_accuracy.py --n-sims 200 --calibration-mode single

The script flags scenarios where |mean_simulated_rho - target_rho| > 0.01 (configurable via --threshold). With default n_sims=50–200, some flags may be Monte Carlo noise. To get all scenarios to pass, you may need to increase --n-sims to 1000, 5000, or 10000 depending on the tie structure and threshold.

Test suite

Run all core tests (fast, ~1-2 min). Calibration runs without strict by default (reports flags only):

python tests/run_tests.py

Run with calibration strict (exit 1 if any scenario is flagged):

python tests/run_tests.py --strict

Run all tests including slower parallelism benchmarks:

python tests/run_tests.py --full

Add --strict to either command to run the calibration test in strict mode (exit 1 if any scenario is flagged).

Output

Results are saved to the results/ directory:

min_detectable_rho.csv -- minimum detectable rho for all scenarios and methods
confidence_intervals.csv -- bootstrap and asymptotic CIs for observed rho
all_distinct_summary.csv -- combined power + CI table for the 4 all-distinct baselines

Project Structure

Root (core modules):

config.py — Case parameters, frequency dictionary, settings
data_generator.py — X generation (ties) and Y generation (all methods)
power_simulation.py — Unified Monte Carlo power (all generators)
power_asymptotic.py — Asymptotic power and CI formulas
confidence_interval_calculator.py — Bootstrap CIs (averaged over multiple datasets)
table_outputs.py — Summary table construction and CSV export
spearman_helpers.py — Vectorized Spearman and Numba JIT helpers

scripts/ — Entry points and one-off utilities

run_simulation.py — Main orchestrator / CLI entry point
run_single_scenario.py — Quick single-scenario testing script
warm_up_numba.py — Pre-compile Numba JIT cache
estimate_bisection_c.py — Estimate bisection coefficient c for planning n_sims
estimate_calibration_k.py — Estimate calibration coefficient k for planning n_cal
estimate_interrep_sd.py — Estimate inter-rep SD of CI endpoints for planning n_reps

benchmarks/ — Runtime benchmarks (run one at a time)

benchmark_power.py — Power: vectorize vs scalar, calibration modes
benchmark_full_grid.py — CI: full grid, per-rep vs batch bootstrap
benchmark_batch_vs_no_batch.py — CI: batch vs per-rep path (single scenario)
benchmark_ci_bootstrap.py — CI: multipoint vs single calibration

tests/ — Validation and verification

test_calibration_accuracy.py — Calibration accuracy testing
test_spearman_2d.py — Spearman 2D vs reference
test_batch_bootstrap_ci.py — Batch CI bit-identical and timing
test_nested_parallelism.py, test_batch_sequential_vs_parallel.py — Parallelism verification
run_tests.py — Test runner (quick/full, optional --strict for calibration)

Run all commands from the project root (e.g. python scripts/run_simulation.py).

Method Choices and Rationale

Why non-parametric rank-mixing over copula?

The Gaussian copula assumes continuous marginals; with tied x, a small jitter is used to break ties before the rank-to-normal transform, which can attenuate realised Spearman. When x has heavy ties (e.g., k=4 with N=73--82), the jittering step that breaks ties is too small to restore the lost rank information, causing systematic attenuation of 0.01--0.06 in the realised rho. Distributional transform and adaptive jitter alternatives were tested but do not fix this for heavy ties. However, the copula can use multipoint calibration by default (same as nonparametric). Calibration runs on the same data generating process that includes jittering, so it learns the correct input-rho-to-realised-rho mapping and compensates for the bias; with multipoint, the copula achieves target Spearman within tolerance and can be used for power and CI estimation. The non-parametric rank-mixing method remains the recommended default because it operates directly in rank space and avoids the continuous-marginals assumption, but the copula is a valid alternative when multipoint calibration is used.

Why calibration?

With tied x-values, the midrank representation has lower variance than distinct ranks. This means the rank-mixing equation mixed = rho * s_x + sqrt(1-rho^2) * s_noise produces a Spearman rho slightly below the target rho. The calibration step computes a rho-independent attenuation ratio by probing at a fixed rho (0.30) and using bisection over 300 samples. This ratio is then multiplied by any target rho to compensate for the attenuation. The ratio is cached per (n, k, dist_type), so the cost is ~3s per unique tie structure.

If calibration fails for a custom tie structure: Run tests/test_calibration_accuracy.py on the new structure. If mean realised rho deviates from target by >0.01, try increasing n_cal (e.g. 300 → 500 or 1000) or use multipoint calibration (default). Multipoint probes at 0.10, 0.30, 0.50 and interpolates, fixing nonlinear attenuation. Use --calibration-mode single for faster runs when accuracy is less critical. If even rho_input=0.999 cannot reach the probe, the tie structure has hit a structural ceiling (maximum achievable |rho|) and the method cannot reach that target.

Calibration uses a fixed internal seed (99) for reproducibility; the same (n, k, dist_type) always produces the same calibration ratio regardless of the caller's seed.

With the default n_cal=300, calibration accuracy (SE of mean rho) is approximately 0.005–0.006, adequate for the ~0.01 target. The planning formula is $\mathrm{SE}_{\mathrm{cal}} = k/\sqrt{n_{\mathrm{cal}}}$ where k is the SD of a single realised Spearman rho in the calibration process.

Analytical k (Bonett-Wright). Each calibration draw computes one sample Spearman rho from n observations. By the Bonett-Wright SE (the same one used for CIs), the SD of the sample Spearman rho at true $\rho$ is obtained via the delta method on the Fisher z-transform:

$$k = (1 - \rho^2)\,\sqrt{\frac{1.06}{n - 3}}$$

where $\rho$ is the Spearman rho at the calibration probe (default 0.30). For the four study cases at $\rho = 0.30$:

Case	n	k (analytical)	SE at n_cal=300	SE at n_cal=1000
1	80	0.107	0.0062	0.0034
2	82	0.105	0.0061	0.0033
3	73	0.112	0.0065	0.0035
4	81	0.106	0.0061	0.0034

Empirical k (from scripts/estimate_calibration_k.py at n_cal = 300, 500, 1000 with 200 replications for case 3) is approximately 0.10, about 10% below the analytical value. The analytical formula is conservative because the BW asymptotic slightly overestimates variance for the rank-mixing mechanism, especially with ties. The benchmark scripts use k = 0.112 (analytical worst case, n=73) and c = 0.17 (asymptotic value at worst-case ρ*; empirical c for current data converges to ~0.12–0.15) for conservative estimates. Increase n_cal to 500–1000 for tighter accuracy.

Why Bonett-Wright SE (1.06 factor)?

The standard Fisher z-transform SE for Pearson r is $\sqrt{1/(n-3)}$. For Spearman rho, Bonett and Wright (2000) showed that the variance is approximately 6% larger, giving SE = $\sqrt{1.06/(n-3)}$. The theoretical constant is $\pi^2/9 \approx 1.0966$, the reciprocal of the asymptotic relative efficiency (ARE = 9/π²) of Spearman vs Pearson under bivariate normality. Bonett & Wright used 1.06 instead. This is not a rounding of 1.0966 but a simulation-calibrated value that accounts for finite-sample corrections. At practical sample sizes, higher-order terms partially offset the asymptotic efficiency loss, so the effective variance inflation is smaller than π²/9. Using the full 1.0966 would produce slightly conservative (too-wide) CIs, overstating the variance; 1.06 was chosen to achieve closer-to-nominal 95% coverage across a range of n and ρ values. The 1.06 is a compromise value: it falls between the Pearson formula (constant = 1.0, CIs too narrow) and the pure asymptotic ARE (constant = π²/9 ≈ 1.097, CIs slightly too wide), pulled below the asymptotic value by finite-sample corrections. For the CI equation, see Statistical methods.

Why the 1.06 is not applied to asymptotic power: A previous version also applied $\sqrt{1/1.06}$ to the asymptotic power noncentrality parameter; this was removed because (a) no published source applies an efficiency deflation to the nc parameter, (b) the 1.06 was calibrated for CI coverage in z-space, not for the non-central t framework, and (c) even if a deflation were justified, 1.06 is the wrong constant (the theoretical value would be π²/9 ≈ 1.097, not 1.06). See AUDIT.md "Bonett-Wright 1.06 Excluded from Noncentrality Parameter" for the full analysis.

Why non-central t for power?

The Spearman test statistic t = rho * sqrt(n-2) / sqrt(1 - rho^2) follows a non-central t-distribution under the alternative. This is more accurate than the normal approximation (which uses constant H0 variance) because the sqrt(1 - rho^2) denominator captures the variance reduction as |rho| grows toward 1. For the noncentrality equation and tie adjustment, see Statistical methods.

Why Fisher z-transform for CIs?

The arctanh transform stabilises the variance of the correlation coefficient and improves normality of the sampling distribution. The back-transform via tanh guarantees the CI stays within [-1, 1] and is properly asymmetric around the point estimate.

Tie correction (Fieller-Hartley-Pearson)

The Fieller-Hartley-Pearson correction adjusts the variance of Spearman rho under the null hypothesis (H0) when ties are present in the rank data. Ties reduce the effective rank information (e.g., many observations sharing the same x-value), so the sampling variance of rho increases. The correction uses the standard equation: define Tx = Σ(tᵢ³ − tᵢ) over tie groups in x (and Ty for y), Dx = (n³−n − Tx)/12, Dy = (n³−n − Ty)/12; then Var(rho) = (1/(n−1)) × (n³−n)²/(144·Dx·Dy). When there are no ties, Dx = Dy = (n³−n)/12 and this reduces to 1/(n−1).

Where it is used: (1) Asymptotic power — the noncentrality parameter is scaled by

$$\sqrt{\mathrm{var}_0^{\mathrm{no-ties}} / \mathrm{var}_0^{\mathrm{ties}}}$$

, reducing effective power when ties inflate variance. (2) Asymptotic CI — the standard error in z-space is scaled by

$$\sqrt{\mathrm{var}_0^{\mathrm{ties}} / \mathrm{var}_0^{\mathrm{no-ties}}}$$

, widening the interval appropriately. The impact on SE ranges from negligible (k=10, even: +0.5%) to moderate (k=4, heavy_center: +5.7%). Asymptotic power and CI apply FHP for x only: only the x tie structure is used; y is treated as having no ties in the asymptotic formulas (only x_counts is passed; y_counts is not used). See Statistical methods for the full FHP equation and this convention.

Approximation note: The combination of Bonett-Wright (1.06 in SE) with the FHP tie correction is a heuristic. No single published reference validates applying both corrections simultaneously. Each is well-established individually; their product is a practical approximation.

Bootstrap CI

The bootstrap uses paired (x,y) resampling with replacement and reports the percentile interval (2.5th and 97.5th percentiles of the bootstrap distribution). The reported CIs average these endpoints over many simulated datasets (n_reps) to estimate the expected bootstrap CI under the model. Two design choices affect correctness and precision:

Separate RNG streams for data and bootstrap

Data generation and bootstrap resampling use separate RNG streams derived from the same seed via np.random.SeedSequence.spawn(2). A shared RNG would advance by n_boot * n extra draws per rep inside the bootstrap loop, so the datasets for reps 1..n_reps would depend on n_boot. That would make results non-comparable across different n_boot values and invalidate the interpretation that "more bootstraps = more accurate." With separate streams, the same seed always produces the same n_reps datasets regardless of n_boot.

`n_reps`, SE, and reliability of the second decimal

The CI endpoints vary across reps. The inter-rep SD is determined analytically: SD(endpoint) = (1 − endpoint²) × √(1.06/(n−3)), with FHP tie correction for x (up to +5.7% for k=4 heavy_center). The worst case is Case 3 (n=73, upper endpoint near 0): 0.122 no ties, 0.129 with worst-case ties — giving a conservative $\sigma_{\mathrm{rep}} \approx 0.13$. Empirical max from results/confidence_intervals.csv (n_reps=200) is 0.130, confirming the analytical value. To re-estimate for new data: scripts/estimate_interrep_sd.py --analytical --with-ties (instant). The SE of the mean endpoint is $\sigma_{\mathrm{rep}}/\sqrt{n_{\mathrm{reps}}}$. The 95% CI for the true mean is approximately ±1.96×SE.

Target	SE	95% CI half-width	`n_reps`	When it matters
Borderline	0.005	±0.01	≈650	Second decimal can still be off by one unit
Strong (README "reliable")	0.0025	±0.005	≈2600	Useful precision; fine when not near a rounding boundary
Rounding guarantee	0.00128	±0.0025	≈10200	Needed only when the value is near a boundary (e.g. 0.345, 0.355)

Rounding boundaries: The boundary between rounding to 0.34 vs 0.35 is 0.345. With n_reps=2600 (95% CI ±0.005), if you observe 0.345 the CI spans [0.34, 0.35] and crosses the boundary—you cannot confidently round. When the value is not near a boundary (e.g. 0.32, 0.38), n_reps=2600 is adequate. The n_reps≈10200 "rounding guarantee" is only needed when you happen to land near a boundary and want confidence in which way to round.

With n_reps=200, SE ≈ 0.009 (worst case N=73), so the third decimal is uncertain and the second decimal is borderline. The default n_reps=200 is a practical trade-off.

`n_boot` choice (with high `n_reps`)

Exact variance of the mean CI endpoint. Each rep i produces one CI endpoint $L_i$ (e.g. the 2.5th percentile of n_boot bootstrap rhos). Its variance has two independent components: (1) inter-rep variability (different datasets), and (2) bootstrap quantile noise (finite n_boot). For the p-th sample quantile from n_boot i.i.d. draws, the asymptotic variance is $p(1-p)/[n_{\mathrm{boot}}\, f(q_p)^2]$, where $f(q_p)$ is the density of the bootstrap distribution at the quantile. For the 2.5th percentile, $p(1-p) = 0.025 \times 0.975 = 0.024375$. Averaging over n_reps independent reps:

$$\mathrm{Var}(\bar{L}) = \frac{\sigma_{\mathrm{rep}}^2}{n_{\mathrm{reps}}} + \frac{0.024375}{n_{\mathrm{reps}}\, n_{\mathrm{boot}}\, f^2}$$

where $\sigma_{\mathrm{rep}} \approx 0.13$ (inter-rep SD, conservative worst case with ties; see n_reps, SE, and reliability above). The ratio of the bootstrap term to the inter-rep term is $0.024375 / (n_{\mathrm{boot}} \, f^2 \, \sigma_{\mathrm{rep}}^2)$. With $n_{\mathrm{boot}} = 500$ and the simplification $f = 1$, this ratio is $0.024375 / (500 \times 1 \times 0.0169) \approx 0.003$, so the bootstrap term adds ~0.3% to the variance and ~0.15% to SE — negligible. The realistic value $f = \varphi(-1.96)/\sigma_{\mathrm{boot}}$, where $\sigma_{\mathrm{boot}}$ follows the same Bonett-Wright formula as $\sigma_{\mathrm{rep}}$ (evaluated at the observed $\rho$ rather than the CI endpoint), gives $f \approx 0.45$–$0.52$ for the current data. This makes the actual ratio 3.7–4.9× larger (~1.1–1.4% of variance), still negligible. In fact, substituting $\sigma_{\mathrm{rep}}$ for $\sigma_{\mathrm{boot}}$ (a conservative upper bound at the worst case) yields a parameter-free bound: the ratio reduces to $p(1-p) / (n_{\mathrm{boot}} \times \varphi(-1.96)^2) = 1.43\%$ of variance, independent of all data parameters. See UNCERTAINTY_BUDGET.md Part 2, Source 2 for the full derivation. The benchmark scripts therefore use the approximation $\mathrm{SE} \approx \sigma_{\mathrm{rep}} / \sqrt{n_{\mathrm{reps}}}$ (see Equations to compute Precision for the full derivation and numerical comparison).

With n_reps ≥ 2600:

n_boot=200–400 suffices: bootstrap variance adds under 0.5% to total; 2.5th percentile estimated from 5–10 order statistics.
n_boot=500 is comfortable; going higher gives no meaningful improvement.
n_boot=1000 (default in config.py) is more than needed for high n_reps; use 200–500 to save time.

When n_reps=200, n_boot=500 is sufficient; the ~0.001–0.002 difference from n_boot=1000 is swamped by the ~0.009 SE from inter-rep variability.

Verifying `n_boot`

To check whether your chosen n_boot is sufficient, run the same scenario with n_boot=500 and n_boot=2000 (same seed). Compare the printed bootstrap CI endpoints:

Quick check (n_reps=50, ~1–2 min total): If the difference in CI endpoints is under ~0.003, n_boot=500 is fine for 2-decimal precision. Example:

python scripts/run_single_scenario.py --case 3 --n-distinct 4 --dist-type heavy_center --ci-only --n-reps 50 --n-boot 500 --seed 42 --skip-copula --skip-linear
python scripts/run_single_scenario.py --case 3 --n-distinct 4 --dist-type heavy_center --ci-only --n-reps 50 --n-boot 2000 --seed 42 --skip-copula --skip-linear

Thorough check (n_reps=200, ~5–10 min total): Use n_reps=200 for a more stable comparison and to confirm the averaged CI is well converged.

Performance and Runtime Estimates

Optimization summary

Vectorization and parallelization: The original implementation used a scalar loop over bootstrap replicates and power simulations. Several optimizations combine to speed things up:

Vectorized Spearman — argsort-based ranking, O(n log n) per row, replaces per-row scipy calls.
Vectorized data generation — config.VECTORIZE_DATA_GENERATION (default True): power and CI generate all n_sims or n_reps datasets in one batch via generate_*_batch functions instead of looping per dataset.
Batch bootstrap — config.BATCH_CI_BOOTSTRAP (default True): CI generates all n_reps datasets and n_boot resamples in one batch, then computes Spearman across all bootstrap samples in a single Numba call. Requires VECTORIZE_DATA_GENERATION=True. Gives ~2.5× sequential speedup over per-rep bootstrap; see Benchmark data below.
Skip-y data generation — In the default batch path, non-empirical generators (generate_y_nonparametric_batch, generate_y_copula_batch, generate_y_linear_batch) return rank arrays directly (_return_ranks=True) instead of generating float y, skipping lognormal/exp generation entirely. Saves ~15–20% of the data-generation step for both power simulation and CI batch bootstrap. The empirical generator is unchanged. See Skip-y for data generation.
Scenario-level parallelization — joblib (--n-jobs) farms scenarios out across cores. Caveat: With batch bootstrap, parallel (n_jobs=-1) is often slower than sequential (n_jobs=1) due to joblib overhead and nested parallelism with Numba; see Parallelization and Numba caveats below.

These flags live in config.py: VECTORIZE_DATA_GENERATION, BATCH_CI_BOOTSTRAP, CALIBRATION_MODE (multipoint vs single), and USE_PERMUTATION_PVALUE (permutation vs t-based p-value for power).

Calibration: Uses a rho-independent attenuation ratio cached per (n, k, dist_type). The calibration cost is paid once per tie structure, not once per rho value tested during bisection — eliminating the dominant bottleneck of the original implementation (~60× speedup for power estimation).

Other optimizations: _fit_lognormal results are LRU-cached; x-value templates are cached and only shuffled per call. Power simulation uses permutation-based p-values (precomputed null or batched Monte Carlo) when USE_PERMUTATION_PVALUE=True; see P-value method (power simulation).

Memory (Monte Carlo p-value batching)

When using the Monte Carlo p-value path (empirical generator, or any generator with permutation enabled), peak memory is dominated by: (1) permutation indices perm_idx_all of shape (n_perm, n_sims, n) int32 → 4 × n_perm × n_sims × n bytes; (2) permutation rhos (n_sims, n_perm) float64 → 8 × n_sims × n_perm bytes. Data x_all, y_all are smaller by comparison.

Equation: Extra memory ≈ (4n + 8) × n_sims_chunk × n_perm bytes. For n ≈ 80: ≈ 328 × n_sims_chunk × n_perm bytes (e.g. 10k sims × 1k perms at n≈80 ≈ 3.3 GB for a full batch).

Chunking: When n_sims > PVALUE_N_SIMS_BATCH_THRESHOLD (default 5k), sims are processed in chunks of PVALUE_N_SIMS_CHUNK_SIZE (default 2k). Peak memory is then for one chunk: e.g. 2k sims × 1k n_perm × n≈80 → ~660 MB per chunk.

Example table (n ≈ 80):

n_sims	n_perm	Mode	Peak extra memory
500	1k	full	~130 MB
2k	1k	full	~525 MB
5k	1k	full	~1.6 GB
10k	1k	chunked	~660 MB per chunk
50k	1k	chunked	~660 MB per chunk
5k	2k	full	~3.3 GB

Equations to compute Precision

For user-computed errors and planning n_sims / n_cal:

Power estimate (binomial): $\hat{p}$ = proportion of sims with p < α.

$$\mathrm{SE}(\hat{p}) = \sqrt{\hat{p}(1-\hat{p}) / n_{\mathrm{sims}}}$$

For power near 0.80, $\mathrm{SE}(\hat{p}) \approx 0.4 / \sqrt{n_{\mathrm{sims}}}$.

Bisection contribution to SE(min ρ): By the implicit function theorem, bisection finds $\rho^*$ where $\text{power}(\rho^*) = \pi_{\mathrm{target}}$. The SE of the estimated root is:

$$\mathrm{SE}_{\mathrm{bisection}}(\rho^*) = \frac{\sqrt{\pi(1-\pi)/n_{\mathrm{sims}}}}{|\text{power}'(\rho^*)|} = \frac{c}{\sqrt{n_{\mathrm{sims}}}}, \qquad c \equiv \frac{\sqrt{\pi(1-\pi)}}{|\text{power}'(\rho^*)|}$$

This is a standard result in stochastic root-finding (Waeber, Frazier & Henderson, 2013, SIAM J. Control Optim. 51(3)). The coefficient c depends on the slope of the power curve at the 80% crossing. For the asymptotic (no-tie) power function $\text{power}(\rho) = \Phi(\mathrm{ncp}(\rho) - z_{\alpha/2})$ where $\mathrm{ncp} = \rho\sqrt{n-2}/\sqrt{1-\rho^2}$:

n	$\rho^*$ (80% crossing)	slope	c
73	0.316	2.76	0.145
80	0.302	2.86	0.140
82	0.299	2.88	0.139

With ties (which attenuate rho, shifting the crossing to higher $\rho^*$ where the slope is shallower), c is modestly larger. From the actual min detectable rho values (0.28–0.34 across generators and scenarios), c ranges from approximately 0.12 to 0.15 when estimated at converged n_sims (≥5000). The benchmark c = 0.17 is the asymptotic (no-tie) value at worst-case ρ* ≈ 0.33, providing ~20% margin above the empirical maximum (~0.15). At n_sims=2000 the finite-difference slope has high seed-dependent variance (c can range ~0.10–0.18); use n_sims ≥ 5000 in scripts/estimate_bisection_c.py for stable estimates. If you use different data or heavier ties, revisit precision parameters and re-run benchmarks; see Precision when data changes.

Calibration contribution: Calibration uncertainty scales as $1/\sqrt{n_{\mathrm{cal}}}$; propagates to min detectable ρ. $\mathrm{SE}_{\mathrm{cal}}(\rho) = k/\sqrt{n_{\mathrm{cal}}}$ where the coefficient k is the SD of one calibration realised rho. From the Bonett-Wright asymptotic formula (the same one used for CIs, via the delta method on Fisher z):

$$k = (1 - \rho^2)\,\sqrt{\frac{1.06}{n - 3}}$$

For $\rho = 0.30$ (the calibration probe), k ranges from 0.105 (n=82) to 0.112 (n=73) across the four cases; see the table above. Empirical k is ~10% lower (~0.10), making the formula a conservative upper bound. The benchmark scripts use k = 0.112 (analytical worst case, n=73). To estimate k for a new scenario, run scripts/estimate_calibration_k.py --analytical (instant) or without --analytical for the empirical value.

Combined uncertainty (independent):

$$\mathrm{SE}_{\mathrm{total}}(\rho) \approx \sqrt{ \mathrm{SE}_{\mathrm{bisection}}^2 + \mathrm{SE}_{\mathrm{cal}}^2 }$$

95% CI half-width: Half-width ≈ 1.96 × $\mathrm{SE}_{\mathrm{total}}(\rho)$. For target half-width w, aim for $\mathrm{SE}_{\mathrm{total}} \leq w / 1.96$.
Rounding: For the reported value to round safely to a band (e.g. 0.35), the whole 95% CI should lie inside that band (e.g. half-width ≤ 0.005 at the boundary). Wrong-rounding probability at a boundary depends on half-width; overall rounding confidence is typically ~90–95% when half-width is ±0.002 and true values are not at boundaries. Numerical guidance: n_sims ≈ 3k, n_cal ≈ 1k for ±0.01; for rounding safety near 0.35, wrong-rounding ~15–25% at 0.346, overall ~90–95%.
Consolidated error budget: See docs/UNCERTAINTY_BUDGET.md for all power and CI error sources (including precomputed null realization variance, calibration interpolation, permutation noise, and bootstrap quantile noise) in a single reference table, with numerical values for each precision tier and a quick-reference parameter table.

Permutation p-value noise in power estimates

Power is estimated as $\hat{\pi} = (1/n_{\mathrm{sims}}) \sum_i \mathbf{1}(\hat{p}_i < \alpha)$, where each $\hat{p}_i$ is a permutation-based p-value computed from a finite null sample. With an infinite null, the reject decision $\mathbf{1}(p_i < \alpha)$ would be deterministic given the data; with a finite null of size $n_{\mathrm{perm}}$, there is additional randomness.

Exact variance decomposition. For each simulation i, let $p_i$ be the exact (infinite-null) p-value and $\hat{p}_i$ the estimated p-value from $n_{\mathrm{perm}}$ null samples. The variance of $\hat{p}_i$ given the data is approximately $\sigma_{\alpha}^2 = \alpha(1-\alpha)/n_{\mathrm{perm}}$ for sims near the decision boundary. By the law of total variance:

$$\mathrm{Var}(\hat{\pi}) = \frac{1}{n_{\mathrm{sims}}} \Big[\underbrace{\pi(1-\pi)}_{\text{Monte Carlo (data)}} + \underbrace{V_{\mathrm{perm}}}_{\text{permutation noise}}\Big]$$

where the permutation contribution $V_{\mathrm{perm}} = \mathbb{E}[\Phi(z_i)(1 - \Phi(z_i))]$ with $z_i = (\alpha - p_i)/\sigma_\alpha$. Only simulations with $p_i \approx \alpha$ contribute; for these, $\Phi(z_i)(1-\Phi(z_i))$ is non-negligible, while for $p_i$ far from $\alpha$ the term vanishes.

Bounding $V_{\mathrm{perm}}$. For a continuous p-value distribution with density $f_p(\cdot)$ near $\alpha$:

$$V_{\mathrm{perm}} \approx \frac{\sigma_\alpha \cdot f_p(\alpha)}{\sqrt{\pi}}$$

where $\sigma_\alpha = \sqrt{\alpha(1-\alpha)/n_{\mathrm{perm}}}$ and the constant $1/\sqrt{\pi}$ comes from $\int_{-\infty}^{\infty}\Phi(u)(1-\Phi(u))\,du = 1/\sqrt{\pi}$. The worst case is $f_p(\alpha) = 1$ (uniform p-values under the null); under the alternative with power $\approx 0.80$, most p-values are near 0, so $f_p(\alpha) < 1$.

Approximation used by benchmark scripts. Since $V_{\mathrm{perm}} \ll \pi(1-\pi)$, the benchmark scripts use $\mathrm{SE}(\hat{\pi}) \approx \sqrt{\pi(1-\pi)/n_{\mathrm{sims}}}$, ignoring the permutation term.

Numerical comparison ($\alpha = 0.05$, power $\pi = 0.80$, worst-case $f_p(\alpha) = 1$):

P-value path	$n_{\mathrm{perm}}$	$\sigma_\alpha$	$V_{\mathrm{perm}}$	$V / \pi(1-\pi)$	SE inflation
Precomputed null (non-empirical)	50 000	0.00097	5.5 × 10⁻⁴	0.34%	+0.17%
MC (empirical, $n_{\mathrm{sims}} < 5\text{k}$)	2 000	0.00487	2.7 × 10⁻³	1.72%	+0.86%
MC (empirical, $n_{\mathrm{sims}} \geq 5\text{k}$)	1 000	0.00689	3.9 × 10⁻³	2.43%	+1.21%

Even in the worst case (MC path with n_perm=1000, uniform p-values), permutation noise inflates SE by only ~1.2%. For the precomputed null (n_pre=50,000) used by the non-empirical generators, the inflation is ~0.17%. Under realistic conditions ($f_p(\alpha) < 1$), the effect is even smaller. The approximation is therefore safe for all p-value paths.

MC path noise vs precomputed null realization variance — a key distinction. The variance decomposition above (Var(π̂) = [...] / n_sims) applies to the MC path: each sim uses fresh, independent permutations, so the per-sim critical value errors are IID and their contribution to the mean rejection rate averages out as 1/√n_sims. The precomputed null behaves differently: one null is built once and shared by all n_sims — its critical value offset is the same for every sim, fully correlated, and does not reduce with n_sims. The fix for the precomputed null is increasing n_pre, not n_sims. See AUDIT.md "Known Precision Limitations" (PREC-1) for the quantitative analysis and the contrast table.

Why n_perm=1,000 is adequate for all precision tiers (including empirical). The empirical generator uses the precomputed null by default (see Precomputed null approximation). When EMPIRICAL_USE_PRECOMPUTED_NULL = False, it falls back to the MC path. The +1.2% worst-case SE inflation on power propagates through bisection: the SE of min detectable rho is $\mathrm{SE}(\hat{\pi})/|\text{power}'(\rho^*)|$ (see Bisection contribution to SE above), so a +1.2% inflation of $\mathrm{SE}(\hat{\pi})$ becomes a +1.2% inflation of $\mathrm{SE}(\rho^*)$. For the ±0.01 target half-width, that is $0.01 \times 0.012 = 0.00012$ — negligible. Even for the ±0.001 tier it is $0.001 \times 0.012 = 0.0000012$. The adaptive n_perm (1,000 when n_sims ≥ 5,000; 2,000 otherwise) requires no adjustment.

Precomputed null approximation for empirical generator

When config.EMPIRICAL_USE_PRECOMPUTED_NULL = True (default), the empirical generator uses the same precomputed null as other generators. The precomputed null permutes all-distinct y-ranks {1..n}; but empirical y-data always has ties — the pool is 71 digitized values with pre-existing duplicates, resampled into n = 73–81 draws, so ties are guaranteed in every realization by both pigeonhole and the pool's own repeated measurements. This section quantifies the approximation error.

Source of error. The precomputed null computes each null rho as:

$$\rho_{\mathrm{null}} = \frac{\sum_i (x_i - \bar{x})(\pi(y)_i - \bar{y})}{\sqrt{\sum_i (x_i - \bar{x})^2} \cdot \sqrt{\sum_i (\pi(y)_i - \bar{y})^2}}$$

where $\pi(y)$ are permuted ranks from {1..n} (all distinct). The observed rho from spearman_rho_2d uses midranks of the actual y-data (with ties), which have $\mathrm{std}(y_{\mathrm{ranks}}) < \mathrm{std}(\{1 \ldots n\})$. The mismatch inflates $|\rho_{\mathrm{obs}}|$ relative to the null by the ratio:

$$\frac{\sigma_{y,\mathrm{untied}}}{\sigma_{y,\mathrm{tied}}} = \sqrt{\frac{(n^3-n)/12}{(n^3-n)/12 - \Sigma_y/12}} \approx 1 + \frac{\Sigma_y}{2(n^3-n)}$$

where $\Sigma_y = \sum_j (t_j^3 - t_j)$ summed over y tie groups.

Tie counts in digitized data (data/digitized.py):

H_AL71: 4 tie pairs — $\Sigma = 4 \times 6 = 24$
B_AL71: 1 triple + 6 pairs — $\Sigma = 24 + 36 = 60$

Pool totals after resampling (base-71 ties + expected contribution from build_empirical_pool resampled fill):

n=73 (B-Al, 2 resampled from B_AL71): $\Sigma_y \approx 79$
n=80 (B-Al + 7 outliers): $\Sigma_y \approx 79$ (outliers are continuous)
n=81 (H-Al, 10 resampled from H_AL71): $\Sigma_y \approx 98$
n=82 (H-Al + 1 outlier): $\Sigma_y \approx 98$

Denominator inflation (worst cases):

Case	n	$n^3-n$	$\Sigma_y$	Inflation	$\Delta\rho$ at crit
n=73	73	388,944	~79	1 + 1.0×10⁻⁴	~2.3×10⁻⁵
n=81	81	531,360	~98	1 + 9.2×10⁻⁵	~2.0×10⁻⁵

Impact on p-values. The CDF shift from the kurtosis mismatch (Edgeworth expansion) plus the denominator inflation gives total $\Delta p < 1.4 \times 10^{-5}$. Direction: anticonservative (p slightly too small).

Impact on power. The p-value bias shifts the rejection rate by $\Delta\text{power} \approx 10^{-5}$. This is a systematic bias, not random noise — it is not reduced by increasing n_sims. However, it is negligible for all precision targets:

Target	Ratio (bias / target)	Verdict
±0.01	10⁻³ (0.1%)	Safe
±0.002	5×10⁻³ (0.5%)	Safe
±0.001	10⁻² (1%)	Safe

Variance invariance. Under permutation, the first two moments of Spearman rho are exactly invariant to tie structure: $\mathrm{E}_\pi[\rho] = 0$ and $\mathrm{Var}_\pi(\rho) = 1/(n-1)$, regardless of ties in x or y. The error arises only from higher moments (kurtosis) and from the denominator normalization mismatch between observed and null rho — both of order $\Sigma_y / n^3$.

Fallback. Set config.EMPIRICAL_USE_PRECOMPUTED_NULL = False to use per-dataset Monte Carlo permutation p-values (exact, no approximation, much slower). The MC path is documented in Per-dataset Monte Carlo above.

Verification. Run benchmarks/verify_precomputed_null_empirical.py to empirically confirm the approximation is safe — i.e. deltas between precomputed and MC are bounded well below all precision targets for all four cases. The script does not isolate the 10⁻⁵ y-tie error; that rests on the analytic derivation above. The dominant noise source in the script output is precomputed null realization variance (~SD 0.001 at n_pre=50k), which is the same for all generators using the same null and is not introduced by OPT-1. See AUDIT.md "Known Precision Limitations" for the realization-variance analysis and the n_pre recommendation for ±0.001.

CI endpoint precision

The reported CI endpoint is the mean of n_reps independent bootstrap-quantile estimates. Each rep i produces $L_i$ (e.g. the 2.5th percentile of n_boot bootstrap rho values). Its variance decomposes into two independent components:

$$\mathrm{Var}(\bar{L}) = \underbrace{\frac{\sigma_{\mathrm{rep}}^2}{n_{\mathrm{reps}}}}_{\text{inter-rep}} + \underbrace{\frac{p(1-p)}{n_{\mathrm{reps}}\, n_{\mathrm{boot}}\, f(q_p)^2}}_{\text{bootstrap quantile noise}}$$

where $\sigma_{\mathrm{rep}} \approx 0.13$ is the inter-rep SD (analytical worst case with ties, N = 73; see n_reps, SE, and reliability above), $p = 0.025$ for the 2.5th percentile ($p(1-p) = 0.024375$), and $f(q_p)$ is the density of the bootstrap distribution at the quantile. The exact SE and 95% CI half-width are:

$$\mathrm{SE}_{\mathrm{exact}} = \sqrt{\frac{\sigma_{\mathrm{rep}}^2}{n_{\mathrm{reps}}} + \frac{p(1-p)}{n_{\mathrm{reps}}\, n_{\mathrm{boot}}\, f^2}}, \qquad \text{Half-width} = 1.96 \times \mathrm{SE}_{\mathrm{exact}}$$

Approximation used by benchmark scripts. Since the bootstrap term is negligible for $n_{\mathrm{boot}} \geq 500$ (see below), the benchmark scripts drop it:

$$\mathrm{SE}_{\mathrm{approx}} = \frac{\sigma_{\mathrm{rep}}}{\sqrt{n_{\mathrm{reps}}}}, \qquad n_{\mathrm{reps}} \approx \left(\frac{1.96 \times \sigma_{\mathrm{rep}}}{w}\right)^2$$

where w is the target 95% CI half-width.

Numerical comparison ($\sigma_{\mathrm{rep}} = 0.13$, $n_{\mathrm{boot}} = 500$, $f = 1$ simplification):

Target half-width	`n_reps`	SE (approx)	SE (exact)	Ratio	ΔHW
±0.01	649	0.005103	0.005109	1.0012	1.2 × 10⁻⁵
±0.002	16 231	0.001020	0.001022	1.0015	3.9 × 10⁻⁶
±0.001	64 923	0.000510	0.000511	1.0014	1.4 × 10⁻⁶

The bootstrap term inflates SE by only ~0.15% across all tiers with $f = 1$, and at most ~0.54% with the realistic $f \approx 0.53$ — well below other sources of uncertainty. The approximation is therefore safe regardless of the true $f$.

Benchmark data (CI bootstrap)

Two CI bootstrap modes are available. Per-rep bootstrap (BATCH_CI_BOOTSTRAP=False): for each of n_reps datasets, many calls to _bootstrap_rhos_jit (Numba). Batch bootstrap (BATCH_CI_BOOTSTRAP=True, default): one call per scenario to _batch_bootstrap_rhos_jit (Numba), processing all n_reps × n_boot tasks in one batch. Both use vectorized data generation; batch is bit-identical to per-rep for the same data and bootstrap indices.

Timings below are averages across multiple runs (earlier range data and a single measured run) to reduce variance. Conditions: n_reps=200, n_boot=1000, single-point calibration, nonparametric generator, 4-logical-core Windows machine (2 physical cores with hyperthreading). Run benchmarks one at a time to avoid CPU contention.

Single scenario (e.g. Case 3, k=4, even):

Config	Time	Speedup vs per-rep
Per-rep bootstrap	~4.5s	—
Batch bootstrap	~1.2s	~3.8×

Full grid (88 scenarios), averaged times:

Config	Time
Per-rep bootstrap, sequential (n_jobs=1)	~6 min (~364s)
Batch bootstrap, sequential (n_jobs=1)	~2.4 min (~142s)
Per-rep bootstrap, parallel (n_jobs=-1)	~3.9 min (~235s)
Batch bootstrap, parallel (n_jobs=-1)	~4 min (~243s) — often slower than batch sequential

Recommendation: For batch bootstrap, use n_jobs=1 (sequential); it is the fastest (~2.4 min full grid). For per-rep bootstrap, parallel (n_jobs=-1) does help (~3.9 min vs ~6 min sequential).

Parallelization and Numba caveats

Machine note: Benchmarks were taken on a machine with 4 logical cores (2 physical cores with hyperthreading). On such a machine, n_jobs=-1 uses 4 joblib workers.
Batch bootstrap + joblib: With batch bootstrap, parallel is often slower than sequential. Reasons: (1) Joblib overhead — process spawn, serialization, and scheduling cost is a large fraction of the per-scenario time (~1.8s) for batch, so parallel adds net cost. (2) Nested parallelism — joblib spawns N workers; each worker runs Numba’s prange (default 4 threads), so 4×4 = 16 threads on 4 cores cause oversubscription. Most of the “parallel slower than sequential” gap (~89%) is joblib; ~11% is from Numba thread contention.
Per-rep bootstrap + joblib: Per-rep has many short Numba bursts per scenario, so joblib’s overhead is better amortized; parallel typically beats sequential for per-rep bootstrap.
Memory: Batch bootstrap generates bootstrap indices in chunks of 2000 n_reps at a time (configurable via _BATCH_BOOTSTRAP_CHUNK_SIZE). Peak memory per chunk is ~328 MB (n_boot=500 × 2000 reps × n≈82 × 4 bytes), bounded regardless of total n_reps. The dominant allocations at large n_reps are x_all/y_all of shape (n_reps, n) float64. With n_jobs=-1, multiple workers hold their own copies of these arrays; use n_jobs=1 for batch bootstrap to avoid multiplication by worker count.
Validation: Batch and per-rep bootstrap are bit-identical for the same data and bootstrap indices. See tests/test_batch_bootstrap_ci.py, tests/test_nested_parallelism.py, and tests/test_batch_sequential_vs_parallel.py (and docs/BENCHMARKING_FINDINGS.md) to reproduce.

Typical runtimes (nonparametric generator, 4-logical-core machine)

Without vectorized data generation or batching (parallelization and vectorized Spearman only):

Task	Approximate time
Full grid CI (88 scenarios), n_jobs=1, 200 reps × 1000 boot	~16 min
Full grid CI (88 scenarios), n_jobs=4, 200 reps × 1000 boot	~8 min

With this setup, joblib at n_jobs=4 gives ~2× on the full grid (16 min → 8 min) and ~3.5× on single-scenario CI (200×1000): ~42s sequential → ~12s parallel. Combined with vectorized Spearman, the full CI grid is ~7× faster than the original sequential code.

With vectorized data generation and batch bootstrap (default) and Numba (recommended). Sequential is preferred for CI; parallel can be used for power.

Task	Approximate time
Single scenario CI (200 reps × 1000 boot), batch bootstrap, n_jobs=1	~1.2s
Single scenario power (500 sims, n_cal=300)	~5s (includes calibration)
Single scenario power (10,000 sims, n_cal=300)	~45s
Full grid CI (88 scenarios), batch bootstrap, n_jobs=1, 200×1000	~2.4 min
Full grid CI (88 scenarios), per-rep bootstrap, n_jobs=1, 200×1000	~6 min
Full grid CI (88 scenarios), per-rep bootstrap, n_jobs=4, 200×1000	~3.9 min
Full grid power (88 scenarios, 500 sims, n_cal=300), n_jobs=1	~6 min
Full grid power (88 scenarios, 500 sims, n_cal=300), n_jobs=4	~3 min

With Numba JIT (recommended)

Numba JIT compilation speeds up the bootstrap and ranking loops. Pre-warm with python scripts/warm_up_numba.py to avoid first-run compile delay. On a 4-logical-core machine, use n_jobs=1 for CI with batch bootstrap as above; for power or per-rep bootstrap CI, n_jobs=4 or -1 can reduce wall time.

Task	Without Numba	With Numba	Speedup
Single scenario CI (200 reps × 500 boot)	~12s	~3–5s	~3×
Single scenario power (500 sims, n_cal=300)	~5s	~2–3s	~2×
Full grid CI (88 scenarios, n_reps=10200, n_boot=500, n_jobs=1 batch)	~3.5 h	~28–62 min	~4–8×
Full grid power (88 scenarios, n_sims=10k, n_cal=300, n_jobs=4)	~30–60 min	~5–12 min	~4–6×

On a 16-vCPU cloud machine (e.g. Hetzner CPX51), full CI grid per generator (88 scenarios, n_reps=10200, n_boot=500, n_jobs=-1) completes in under 17 minutes with Numba and pre-warmed cache.

Copula and linear generators have similar per-sim cost but no calibration overhead. The asymptotic method is instantaneous.

Tips

Benchmarking: Run sequential and parallel benchmarks separately, one at a time. Concurrent runs cause CPU contention and invalidate timing results.
CI (batch bootstrap): Use --n-jobs 1 for batch bootstrap; parallel often slows it down (see Parallelization and Numba caveats). For per-rep bootstrap, --n-jobs 4 or -1 helps.
Use --n-sims 500 for exploratory runs (seconds to minutes) vs 10000 for production.
On a 4-logical-core machine (e.g. 2 physical cores with hyperthreading), --n-jobs 4 or -1 parallelizes across logical cores; for power and per-rep bootstrap CI this can roughly halve full-grid runtimes.
The calibration step adds ~3s (single-point) or ~9s (multipoint, default) per unique (N, k, distribution) tie structure on first run; cached thereafter and reused across all rho values. Use --calibration-mode single for faster exploratory runs.
Bootstrap CIs dominate total runtime when using many reps and resamples. With --n-reps 200, use --n-boot 500 (bootstrap noise is negligible). With --n-reps 2600 or higher, --n-boot 200–400 suffices; --n-boot 500 is comfortable. Use --n-reps 20 --n-boot 500 for quick checks.
Use scripts/run_single_scenario.py to test individual scenarios quickly before committing to a full run.
Filter scenarios with --cases, --n-distinct, --dist-types to reduce the grid.
Use --skip-copula --skip-linear to run only the recommended nonparametric + empirical + asymptotic methods. Use --skip-empirical when digitized data is unavailable.

Benchmark scripts

Six scripts measure performance of the main optimizations (vectorized data generation, batch CI bootstrap) and precision/runtime planning. Run scripts one at a time (concurrent runs cause CPU contention and invalidate timings; see Tips above).

Script	Purpose
`benchmarks/benchmark_precision_params.py`	Precision only: Computes (n_sims, n_cal) and (n_reps, n_boot) for ±0.01, ±0.002, ±0.001 accuracy tiers from README formulas. No long runs.
`benchmarks/benchmark_realistic_runtimes.py`	Runtimes: Per-generator power and CI at small params; sequential and parallel full grid; scaled to three precision tiers; best-config combined table and high-core estimates. `--quick` single-scenario only; `--skip-parallel` skips n_jobs=-1; `--generators all` or comma-separated list (e.g. nonparametric, empirical).
`benchmarks/benchmark_batch_vs_no_batch.py`	Single scenario: per-rep bootstrap vs batch bootstrap. Case 3, k=4, even, n_reps=200, n_boot=1000.
`benchmarks/benchmark_full_grid.py`	Full 88-scenario CI grid: per-rep vs batch bootstrap, sequential vs parallel, and Numba thread variants. Uses n_reps=200, n_boot=1000.
`benchmarks/benchmark_ci_bootstrap.py`	Batch CI bootstrap: multipoint vs single-point calibration. Single scenario and 22-scenario subset. `--quick` runs only single-scenario (~1 min).
`benchmarks/benchmark_power.py`	Power simulation: vectorized vs scalar data generation, multipoint vs single calibration, sequential vs parallel full grid. `--quick` skips full grid.

# Precision params only (no simulation)
python benchmarks/benchmark_precision_params.py

# Quick single-scenario comparisons
python benchmarks/benchmark_batch_vs_no_batch.py
python benchmarks/benchmark_ci_bootstrap.py --quick
python benchmarks/benchmark_power.py --quick
python benchmarks/benchmark_realistic_runtimes.py --quick --skip-parallel

# Full benchmarks (several minutes each)
python benchmarks/benchmark_full_grid.py
python benchmarks/benchmark_ci_bootstrap.py
python benchmarks/benchmark_power.py
python benchmarks/benchmark_realistic_runtimes.py

See docs/BENCHMARKING_FINDINGS.md for detailed results, decomposition of joblib vs Numba overhead, nested parallelism, memory notes, and verification scripts in tests/ (test_batch_bootstrap_ci.py, test_nested_parallelism.py, test_batch_sequential_vs_parallel.py).

Cloud Deployment

Running on a cloud VM (e.g. Hetzner CPX51, 16 vCPU)

With pre-copied Numba cache (recommended):

# On local machine (after running scripts/warm_up_numba.py):
rsync -avz ~/.cache/numba/ root@YOUR-IP:~/.cache/numba/

# On cloud VM:
python scripts/run_simulation.py --n-sims 10000 --skip-copula --skip-linear --n-jobs -1 --seed 42

Full CI grid (all generators):

python -c '
import time, pickle
from confidence_interval_calculator import run_all_ci_scenarios
for gen in ["nonparametric", "copula", "linear", "empirical"]:
    t0 = time.time()
    print(f"Starting {gen}...")
    res = run_all_ci_scenarios(generator=gen, n_reps=10200, n_boot=500, n_jobs=-1, seed=42)
    with open(f"ci_{gen}.pkl", "wb") as f: pickle.dump(res, f)
    print(f"{gen} done in {time.time()-t0:.1f} s")
'

References

Karwowski MP, Stamoulis C, Wenren LM, Faboyede GM, Quinn N, Grodin MA, Bellinger DC, Woolf AD. Blood and Hair Aluminum Levels, Vaccine History, and Early Infant Development: A Cross-Sectional Study. Acad Pediatr. 2018 Mar;18(2):161-165. doi:10.1016/j.acap.2017.09.003
Bonett DG, Wright TA (2000). Sample size requirements for Pearson, Kendall, and Spearman correlations. Psychometrika, 65(1), 23-28. doi:10.1007/BF02294183
Conover WJ, Iman RL (1981). Rank transformations as a bridge between parametric and nonparametric statistics. The American Statistician, 35(3), 124-129.
Fieller EC, Hartley HO, Pearson ES (1957). Tests for rank correlation coefficients. I. Biometrika, 44(3-4), 470-481. doi:10.1093/biomet/44.3-4.470
Iman RL, Conover WJ (1982). A distribution-free approach to inducing rank correlation among input variables. Communications in Statistics - Simulation and Computation, 11(3), 311-334.
Wikipedia: Cholesky decomposition (Monte Carlo simulation section)
Wikipedia: Spearman's rank correlation coefficient (Fisher transformation, references)

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
benchmarks		benchmarks
docs		docs
plans		plans
scripts		scripts
tests		tests
.gitignore		.gitignore
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
confidence_interval_calculator.py		confidence_interval_calculator.py
config.py		config.py
data_generator.py		data_generator.py
permutation_pvalue.py		permutation_pvalue.py
power_asymptotic.py		power_asymptotic.py
power_simulation.py		power_simulation.py
requirements.txt		requirements.txt
run_benchmark_5_times.ps1		run_benchmark_5_times.ps1
spearman_helpers.py		spearman_helpers.py
table_outputs.py		table_outputs.py

Folders and files

Latest commit

History

Repository files navigation