Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 52 additions & 2 deletions delphi/docs/CLJ-PARITY-FIXES-JOURNAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -440,8 +440,58 @@ significance filtering. Using different priors is intentional.

### What's Next

1. **PR 5 — Fix D6 (Two-proportion test)**: Add +1 pseudocount to all 4 inputs, change signature
from proportions to raw counts.
1. **PR 6 — Fix D7 (Repness metric)**: Change formula from `pa * (|pat| + |rat|)` to
`ra * rat * pa * pat` (Clojure product formula).

---

## PR 5: Fix D6 — Two-Proportion Test Pseudocounts

### TDD steps
1. **Baseline**: 1 failed (pakistan-incremental D2, pre-existing), 102 passed, 5 skipped, 143 xfailed, 2 xpassed
2. **Red**: Rewrote `TestD6TwoPropTest` with new signature `two_prop_test(succ_in, succ_out, pop_in, pop_out)`
and correct Clojure formula → 3 failures (TypeError: old function expects proportions)
3. **Fix**: Replaced both `two_prop_test` and `two_prop_test_vectorized` with Clojure formula:
add +1 to all 4 inputs (stats.clj:20), compute `pi1=(s+1)/(p+1)`, standard pooled z-test
4. **Green**: All 3 D6 formula tests pass, 4 blob comparison tests xfail (depend on D10)
5. **Full suite**: 4 regression failures, all in `rat`/`rdt`/`agree_metric`/`disagree_metric` — direct
downstream of the formula change. No unexpected fields affected.
6. **Re-recorded golden snapshots** for all 7 datasets (public + private)
7. **Final**: 1 failed (pakistan-incremental D2, pre-existing), 102 passed, 5 skipped, 143 xfailed, 2 xpassed

### Changes
- `repness.py`: `two_prop_test(p1, n1, p2, n2)` → `two_prop_test(succ_in, succ_out, pop_in, pop_out)`
with +1 pseudocount on all 4 inputs, matching Clojure's `(map inc ...)` (stats.clj:20)
- `repness.py`: `two_prop_test_vectorized` — same signature change
- `repness.py`: Updated callers in `add_comparative_stats` and `compute_group_comment_stats_df`
to pass raw counts `(na, other_na, ns, other_ns)` instead of `(pa, ns, other_pa, other_ns)`
- `test_discrepancy_fixes.py`: Rewrote `TestD6TwoPropTest` with correct formula, 7 test cases,
edge cases, and regularization effect test
- `test_repness_unit.py`: Updated `test_two_prop_test`, `test_two_prop_test_vectorized`,
`test_two_prop_test_vectorized_edge_cases` for new signature
- `test_old_format_repness.py`: Updated `test_two_prop_test` for new signature

### Key insight: existing test had wrong expected formula
The pre-existing D6 test computed expected values using `(succ+1)/(n+2)` — as if two pseudocounts
were added to the denominator. But Clojure's `(map inc ...)` adds +1 to each value independently,
giving `(succ+1)/(pop+1)`. The formula is a standard pooled z-test on the pseudocount-adjusted values,
not a Beta distribution posterior.

### Session 8 (2026-03-13)

- Created branch `jc/clj-parity-d6-two-prop-test` on top of `jc/clj-parity-d5-prop-test`
- Read Clojure source (stats.clj:18-33, repness.clj:97-100) to verify formula and call sites
- Discovered the existing D6 test had wrong expected formula — fixed
- TDD cycle: red (3 TypeError failures) → fix → green (3 pass, 4 xfail)
- Updated all callers: both scalar (`add_comparative_stats`) and vectorized
(`compute_group_comment_stats_df`) now pass raw counts
- Full suite: 4 regression failures, all in rat/rdt/metric fields (expected)
- Re-recorded golden snapshots for all 7 datasets
- Final validation: 102 passed, 1 pre-existing failure (pakistan-incremental D2)

### What's Next

1. **PR 6 — Fix D7 (Repness metric)**: Change from `pa * (|pat| + |rat|)` to `ra * rat * pa * pat`.

---

Expand Down
3 changes: 2 additions & 1 deletion delphi/docs/PLAN_DISCREPANCY_FIXES.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ This plan's "PR N" labels map to actual GitHub PRs as follows:
| (perf) | #2436 | Stack 10/10 | Speed up regression tests |
| PR 3 (D9) | #2446 | — | Fix D9: z-score thresholds (one-tailed) |
| PR 4 (D5) | #2448 | Stack 14/25 | Fix D5: proportion test formula |
| PR 5 (D6) | #2449 | Stack 15/25 | Fix D6: two-proportion test pseudocounts |

Future fix PRs will be appended to the stack as they're created.

Expand Down Expand Up @@ -467,7 +468,7 @@ By this point, we should have good test coverage from all the per-discrepancy te
| D3 | K-smoother buffer | PR 10 | — | Fix |
| D4 | Pseudocount formula | **PR 2** | **#2435** | **DONE** ✓ |
| D5 | Proportion test | **PR 4** | — | **DONE** ✓ |
| D6 | Two-proportion test | PR 5 | — | Fix |
| D6 | Two-proportion test | **PR 5** | — | **DONE** ✓ |
| D7 | Repness metric | PR 6 | — | Fix (with flag for old formula) |
| D8 | Finalize cmt stats | PR 7 | — | Fix |
| D9 | Z-score thresholds | **PR 3** | **#2446** | **DONE** ✓ |
Expand Down
120 changes: 75 additions & 45 deletions delphi/polismath/pca_kmeans_rep/repness.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,33 +95,51 @@ def prop_test(succ: int, n: int) -> float:
return 2 * math.sqrt(n_pc) * (succ_pc / n_pc - 0.5)


def two_prop_test(p1: float, n1: int, p2: float, n2: int) -> float:
def two_prop_test(succ_in: int, succ_out: int, pop_in: int, pop_out: int) -> float:
"""
Two-proportion z-test.

Two-proportion z-test with +1 pseudocount on all inputs.

Matches Clojure's stats/two-prop-test (stats.clj:18-33):
(let [[succ-in succ-out pop-in pop-out] (map inc [succ-in succ-out pop-in pop-out])
pi1 (/ succ-in pop-in)
pi2 (/ succ-out pop-out)
pi-hat (/ (+ succ-in succ-out) (+ pop-in pop-out))]
...)

The +1 pseudocount (Laplace smoothing) regularizes the z-score for small
samples, preventing extreme values when group sizes are tiny.

Args:
p1: First proportion
n1: Number of observations for first proportion
p2: Second proportion
n2: Number of observations for second proportion
succ_in: Number of successes in the group (e.g., agrees)
succ_out: Number of successes outside the group
pop_in: Total votes in the group
pop_out: Total votes outside the group

Returns:
Z-score
Z-score (positive means group proportion > other proportion)
"""
if n1 == 0 or n2 == 0:
if pop_in == 0 or pop_out == 0:
return 0.0
Comment on lines +121 to 122
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two_prop_test() claims to match Clojure's stats/two-prop-test, but it returns 0.0 when pop_in==0 or pop_out==0. Clojure does not special-case zero populations; it increments pop-in/pop-out (so division-by-zero is avoided) and can yield a non-zero z-score when one side has no votes. Either remove this early return for true parity, or update the docstring/expected behavior to explicitly document this intentional deviation (and consider validating succ_* <= pop_* instead).

Copilot uses AI. Check for mistakes.

# Pooled probability
p = (p1 * n1 + p2 * n2) / (n1 + n2)

# Standard error
se = math.sqrt(p * (1 - p) * (1/n1 + 1/n2))

# Z-score calculation

# Add +1 pseudocount to all four inputs (Clojure: map inc)
s1 = succ_in + 1
s2 = succ_out + 1
p1 = pop_in + 1
p2 = pop_out + 1

pi1 = s1 / p1
pi2 = s2 / p2
pi_hat = (s1 + s2) / (p1 + p2)

if pi_hat == 1.0:
# Clojure note (stats.clj:26-27): "this isn't quite right... could
# actually solve this using limits" — returning 0 for now, matching Clojure.
return 0.0

se = math.sqrt(pi_hat * (1 - pi_hat) * (1/p1 + 1/p2))
Comment on lines +127 to +139
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside two_prop_test, the local names p1/p2 now represent pseudocount-adjusted population denominators rather than proportions (and previously the function parameters were named p1/p2). Renaming these locals to something denoting denominators (e.g., pop_in_pc, pop_out_pc, denom_in, denom_out) would reduce confusion and make the formula easier to audit against the docstring/Clojure reference.

Suggested change
p1 = pop_in + 1
p2 = pop_out + 1
pi1 = s1 / p1
pi2 = s2 / p2
pi_hat = (s1 + s2) / (p1 + p2)
if pi_hat == 1.0:
# Clojure note (stats.clj:26-27): "this isn't quite right... could
# actually solve this using limits" — returning 0 for now, matching Clojure.
return 0.0
se = math.sqrt(pi_hat * (1 - pi_hat) * (1/p1 + 1/p2))
pop_in_pc = pop_in + 1
pop_out_pc = pop_out + 1
pi1 = s1 / pop_in_pc
pi2 = s2 / pop_out_pc
pi_hat = (s1 + s2) / (pop_in_pc + pop_out_pc)
if pi_hat == 1.0:
# Clojure note (stats.clj:26-27): "this isn't quite right... could
# actually solve this using limits" — returning 0 for now, matching Clojure.
return 0.0
se = math.sqrt(pi_hat * (1 - pi_hat) * (1 / pop_in_pc + 1 / pop_out_pc))

Copilot uses AI. Check for mistakes.
if se == 0:
return 0.0
else:
return (p1 - p2) / se
return (pi1 - pi2) / se


def comment_stats(votes: np.ndarray, group_members: List[int]) -> Dict[str, Any]:
Expand Down Expand Up @@ -182,15 +200,17 @@ def add_comparative_stats(comment_stats: Dict[str, Any],
result['ra'] = result['pa'] / other_stats['pa'] if other_stats['pa'] > 0 else 1.0
result['rd'] = result['pd'] / other_stats['pd'] if other_stats['pd'] > 0 else 1.0

# Calculate representativeness tests
# Calculate representativeness tests — pass raw counts, matching Clojure's
# (stats/two-prop-test (:na in-stats) (sum :na rest-stats)
# (:ns in-stats) (sum :ns rest-stats)) (repness.clj:97-100)
result['rat'] = two_prop_test(
result['pa'], result['ns'],
other_stats['pa'], other_stats['ns']
result['na'], other_stats['na'],
result['ns'], other_stats['ns']
)

result['rdt'] = two_prop_test(
result['pd'], result['ns'],
other_stats['pd'], other_stats['ns']
result['nd'], other_stats['nd'],
result['ns'], other_stats['ns']
)

return result
Expand Down Expand Up @@ -493,30 +513,38 @@ def prop_test_vectorized(succ: pd.Series, n: pd.Series) -> pd.Series:
return z


def two_prop_test_vectorized(p1: pd.Series, n1: pd.Series,
p2: pd.Series, n2: pd.Series) -> pd.Series:
def two_prop_test_vectorized(succ_in: pd.Series, succ_out: pd.Series,
pop_in: pd.Series, pop_out: pd.Series) -> pd.Series:
"""
Vectorized two-proportion z-test.
Vectorized two-proportion z-test with +1 pseudocount on all inputs.

Matches Clojure's stats/two-prop-test (stats.clj:18-33).
See two_prop_test() scalar version for formula details.

Args:
p1: Series of first proportions
n1: Series of number of observations for first proportion
p2: Series of second proportions
n2: Series of number of observations for second proportion
succ_in: Series of success counts in the group
succ_out: Series of success counts outside the group
pop_in: Series of total vote counts in the group
pop_out: Series of total vote counts outside the group

Returns:
Series of z-scores
"""
# Pooled probability
p_pooled = (p1 * n1 + p2 * n2) / (n1 + n2)
# Add +1 pseudocount to all four inputs (Clojure: map inc)
s1 = succ_in + 1
s2 = succ_out + 1
p1 = pop_in + 1
p2 = pop_out + 1

# Standard error
se = np.sqrt(p_pooled * (1 - p_pooled) * (1/n1 + 1/n2))
pi1 = s1 / p1
pi2 = s2 / p2
pi_hat = (s1 + s2) / (p1 + p2)

# Z-score calculation
z = (p1 - p2) / se
se = np.sqrt(pi_hat * (1 - pi_hat) * (1/p1 + 1/p2))
z = (pi1 - pi2) / se

# Handle edge cases
# Handle edge cases: pop_in=0 or pop_out=0 → 0, pi_hat=1 → 0
z = z.where((pop_in > 0) & (pop_out > 0), 0.0)
z = z.fillna(0.0)
z = z.replace([np.inf, -np.inf], 0.0)
Comment on lines +533 to 549
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The vectorized implementation can still emit runtime warnings (divide-by-zero/invalid) when se == 0 (e.g., pi_hat == 1), even though you later replace inf/NaN with 0. Consider masking the computation before the division (e.g., only compute z where (pop_in>0)&(pop_out>0)&(se>0)&(pi_hat<1) and set 0 elsewhere) to avoid noisy warnings and improve numerical robustness.

Copilot uses AI. Check for mistakes.
Comment on lines +546 to 549
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two_prop_test_vectorized() zeroes results when pop_in==0 or pop_out==0, which diverges from the referenced Clojure implementation (it applies +1 pseudocounts and still computes the statistic). If the goal is Clojure parity, drop this mask and rely on the +1 adjustment plus inf/NaN handling; if the goal is to treat no-data rows as 0, please document that deviation clearly (and align scalar/vectorized behavior + tests accordingly).

Suggested change
# Handle edge cases: pop_in=0 or pop_out=0 → 0, pi_hat=1 → 0
z = z.where((pop_in > 0) & (pop_out > 0), 0.0)
z = z.fillna(0.0)
z = z.replace([np.inf, -np.inf], 0.0)
# Handle edge cases: pi_hat=0 or 1 → se=0 → inf/NaN; map these to 0
z = z.replace([np.inf, -np.inf], 0.0)
z = z.fillna(0.0)

Copilot uses AI. Check for mistakes.
return z
Expand Down Expand Up @@ -649,14 +677,16 @@ def compute_group_comment_stats_df(votes_long: pd.DataFrame,
stats_df['ra'] = stats_df['ra'].replace([np.inf, -np.inf], 1.0).fillna(1.0)
stats_df['rd'] = stats_df['rd'].replace([np.inf, -np.inf], 1.0).fillna(1.0)

# Compute representativeness tests (two-proportion z-test: group vs other)
# Compute representativeness tests — pass raw counts, matching Clojure's
# (stats/two-prop-test (:na in-stats) (sum :na rest-stats)
# (:ns in-stats) (sum :ns rest-stats)) (repness.clj:97-100)
stats_df['rat'] = two_prop_test_vectorized(
stats_df['pa'], stats_df['ns'],
stats_df['other_pa'], stats_df['other_votes']
stats_df['na'], stats_df['other_agree'],
stats_df['ns'], stats_df['other_votes']
)
stats_df['rdt'] = two_prop_test_vectorized(
stats_df['pd'], stats_df['ns'],
stats_df['other_pd'], stats_df['other_votes']
stats_df['nd'], stats_df['other_disagree'],
stats_df['ns'], stats_df['other_votes']
)

# Compute metrics
Expand Down
Loading
Loading