Skip to content

Commit a2cee07

Browse files
jucorclaude
andauthored
[Stack 4/25] Two-level hierarchical clustering matching Clojure architecture (#2431)
* Implement two-level hierarchical clustering matching Clojure architecture Level 1: Participants → ~100 base clusters Level 2: Base clusters → 2-5 group clusters (silhouette-based k selection) Key changes: - Add sklearn-based kmeans with first-k initialization (matching Clojure) - Add silhouette coefficient calculation for optimal k selection - Implement participant filtering (in-conv threshold logic) - Add fold/unfold utilities for hierarchical cluster storage - Update tests to unfold and compare hierarchical structures - Document Clojure two-level architecture and non-determinism issue This implements the core two-level clustering but does not yet include incremental clustering (warm-start from previous state). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> * Fix two-level clustering edge cases and update tests Bug fixes from review: - clusters.py: adjust k down when fewer distinct points than requested - conversation.py: group_clusters members are base-cluster IDs (not participants) when only 1 base cluster; convert pandas Index to list for JSON serialization - pca.py: fallback comps shape uses min(n_comps, n_cols) not min(n_comps, 2) Test fixes: - test_clusters: init_clusters returns empty members with two-level clustering - test_conversation: use 10 comments to meet vote threshold - test_edge_cases: group_repness is {} (not {0: []}) when no clusters Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Mark known Clojure discrepancies as xfail and re-record golden snapshots - test_basic_outputs: xfail for D9/D5/D7 (wrong z-score thresholds and repness formulas produce empty comment_repness) - test_group_clustering: xfail for D2/D3 (wrong participant threshold and missing k-smoother produce different cluster counts) - test_comment_priorities: update xfail reason to reference D12 - Re-record golden snapshots after two-level clustering changes Test baseline: 11 passed, 6 xfailed, 0 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Unfold group-cluster base-cluster IDs to participant IDs for downstream consumers Group clusters store base-cluster IDs in 'members' (matching Clojure's two-level clustering architecture), but downstream functions (conv_repness, participant_stats, group_votes) need participant IDs to join against the vote matrix. Add _unfolded_group_clusters() helper (equivalent to Clojure's clusters/group-members) and use it in all 5 call sites: - _compute_repness - _compute_participant_info - _compute_group_votes - to_dict group-votes - to_dynamo_dict group_votes Also re-record golden snapshots and remove xfail from test_repness_structure (now passes with correct unfolding). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Tighten Clojure comparison tests and fix outdated docstrings - test_group_clustering: unfold Python clusters via _unfolded_group_clusters() (both sides now use two-level clustering), tighten thresholds to 0.99 Jaccard / 0.01 distribution tolerance, require overall_match - test_comment_priorities: require exact match (1e-6 tolerance) for all comment IDs instead of 70% at 20% tolerance - clojure_comparer: fix docstring to reflect Python also uses two-level clustering Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix serialization outputting base-cluster IDs instead of participant IDs to_dict(), get_full_data(), and to_dynamo_dict() were writing self.group_clusters directly, whose 'members' contain base-cluster IDs (integers 0..N). Downstream consumers (group_data.py, Clojure compat, client apps) expect participant IDs. The internal helper _unfolded_group_clusters() already existed and was used for repness/participant_info/group_votes computation, but the five serialization sites were missed. Fix all five sites to unfold before serializing: - to_dict(): group-clusters, group_clusters, base-clusters - get_full_data(): group_clusters - to_dynamo_dict(): base_clusters / group_clusters Add test_serialization_unfolding.py (TDD: 6 tests fail on the bug, 8/8 pass after fix) using real recompute() pipeline output. Re-record golden snapshots to reflect the corrected output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Fix missing Set import in conversation.py Added Set to the typing imports — it was used in the _get_in_conv_participants return type annotation but never imported, causing NameError on CI. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
1 parent 1f3113f commit a2cee07

16 files changed

Lines changed: 45338 additions & 29052 deletions
Lines changed: 203 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,203 @@
1+
# Clojure Two-Level Clustering Architecture
2+
3+
## Overview
4+
5+
**YES**, Clojure **ALWAYS** uses two-level hierarchical clustering, not conditionally.
6+
7+
## Architecture
8+
9+
### Level 1: Base Clusters (Participants → ~100 Base Clusters)
10+
11+
**Location**: `polismath/math/conversation.clj` lines 400-407
12+
13+
```clojure
14+
:base-clusters
15+
(plmb/fnk [conv proj-nmat in-conv opts']
16+
(let [in-conv-mat (nm/rowname-subset proj-nmat in-conv)]
17+
(sort-by :id
18+
(clusters/kmeans in-conv-mat
19+
(:base-k opts') ; Default: 100
20+
:last-clusters (:base-clusters conv)
21+
:max-iters (:base-iters opts')))))
22+
```
23+
24+
- **Input**: Participant PCA projections
25+
- **k value**: `base-k = 100` (from `conversation.clj` line 145)
26+
- **Output**: ~100 small clusters, each containing a few participants
27+
- **Members**: Participant IDs (pids)
28+
29+
### Level 2: Group Clusters (Base Clusters → 2-5 Groups)
30+
31+
**Location**: `polismath/math/conversation.clj` lines 429-474
32+
33+
```clojure
34+
:group-clusterings
35+
(plmb/fnk [conv base-clusters-weights base-clusters-proj opts']
36+
(plmb/map-from-keys
37+
(fn [k]
38+
(sort-by :id
39+
(clusters/kmeans base-clusters-proj k
40+
:last-clusters (when-let [last-clusterings (:group-clusterings conv)]
41+
(last-clusterings k))
42+
:cluster-iters (:group-iters opts')
43+
:weights base-clusters-weights)))
44+
(range 2 (inc (max-k-fn base-clusters-proj (:max-k opts'))))))
45+
46+
:group-clusters
47+
(plmb/fnk [group-clusterings group-k-smoother]
48+
(get group-clusterings
49+
(:smoothed-k group-k-smoother)))
50+
```
51+
52+
- **Input**: Base cluster centers (treating each base cluster as a data point)
53+
- **k value**: Adaptive (2-5), selected using silhouette analysis
54+
- **Output**: 2-5 final groups
55+
- **Members**: Base cluster IDs (NOT participant IDs)
56+
- **Weighted**: Uses base cluster sizes as weights
57+
58+
## Why Two-Level Clustering?
59+
60+
### Explicit Comments Found
61+
62+
From `conversation.clj` line 477:
63+
```clojure
64+
;; Now we're going to do the same thing for subclusters, or 2-level, 2-down hierarchical clustering
65+
```
66+
67+
### No Direct Explanation Found
68+
69+
Despite extensive search, **no explicit documentation** was found explaining WHY this two-level architecture was chosen. However, based on the code structure, probable reasons include:
70+
71+
1. **Scalability**: Clustering 10,000 participants directly would be O(n²) expensive; clustering 100 base clusters is much faster
72+
2. **Stability**: Small clusters stabilize more quickly; reduces flickering as new votes arrive
73+
3. **Incremental Updates**: Base clusters can be updated incrementally (line 406 uses `:last-clusters`)
74+
4. **Weighted Representation**: Larger base clusters carry more weight in group-level clustering
75+
5. **Memory Efficiency**: Storing/transmitting ~100 base cluster centers vs 10,000 participant positions
76+
77+
### Default Configuration
78+
79+
From `conversation.clj` lines 142-147:
80+
```clojure
81+
:base-iters 100 ; Max iterations for base clustering
82+
:base-k 100 ; Number of base clusters
83+
:max-k 5 ; Max number of final groups
84+
:group-iters 100 ; Max iterations for group clustering
85+
```
86+
87+
## Storage Format
88+
89+
### Folded Format (for Database/JSON)
90+
91+
Base clusters are "folded" for efficient storage:
92+
93+
```clojure
94+
{:id [0 1 2 ...] ; Base cluster IDs
95+
:members [[530 13 157] ...] ; Participant IDs for each
96+
:x [0.5 -0.3 ...] ; X coordinates
97+
:y [0.2 0.8 ...] ; Y coordinates
98+
:count [3 5 7 ...]} ; Member counts
99+
```
100+
101+
**Folding function**: `polismath/math/clusters.clj` lines 389-399
102+
103+
### Unfolded Format (for Processing)
104+
105+
When loaded back, clusters are unfolded:
106+
107+
```clojure
108+
[{:id 0 :members [530 13 157] :center [0.5 0.2]}
109+
{:id 1 :members [42 88 199 234 301] :center [-0.3 0.8]}
110+
...]
111+
```
112+
113+
**Unfolding function**: `polismath/math/clusters.clj` lines 402-414
114+
115+
## Server-Side Usage
116+
117+
### Hierarchy is Maintained
118+
119+
The server-side TypeScript code (`server/src/report.ts` lines 132-175) maintains the two-level hierarchy:
120+
121+
```typescript
122+
// 1. Find participant's base cluster
123+
let baseClusterIndex = -1;
124+
for (let i = 0; i < membersByIndex.length; i += 1) {
125+
const members = membersByIndex[i];
126+
if (Array.isArray(members) && members.includes(pid)) {
127+
baseClusterIndex = i;
128+
break;
129+
}
130+
}
131+
132+
// 2. Get base cluster ID
133+
const baseClusterId = baseClusterIds[baseClusterIndex];
134+
135+
// 3. Find which group contains this base cluster
136+
for (const groupCluster of groupClustersArray) {
137+
if (groupCluster.members.includes(baseClusterId)) {
138+
return groupCluster.id;
139+
}
140+
}
141+
```
142+
143+
### Unfolding for Exports
144+
145+
For data exports, the hierarchy is flattened using `flatten-clusters` (`darwin/export.clj` lines 286-299):
146+
147+
```clojure
148+
(defn flatten-clusters
149+
"Takes group clusters and base clusters and flattens them out into a cluster mapping to ptpt ids directly"
150+
[group-clusters base-clusters]
151+
(map
152+
(fn [gc]
153+
(update-in gc [:members]
154+
(fn [members]
155+
(mapcat
156+
(fn [bid]
157+
;; get the base cluster, then get it's members, mapcat them
158+
(:members (ffilter #(= (:id %) bid) base-clusters)))
159+
members))))
160+
group-clusters))
161+
```
162+
163+
This is used in:
164+
- `participants-votes-table` - CSV exports
165+
- Other export functions
166+
167+
## Python vs Clojure
168+
169+
### Python (Current Implementation)
170+
171+
- **Two-level**: Participants → Base Clusters → Groups (matching Clojure)
172+
- **Unfolding**: Group cluster members are unfolded from base-cluster IDs to participant IDs for serialization and downstream consumers
173+
174+
### Clojure (Reference Implementation)
175+
176+
- **Two-level**: Participants → Base Clusters → Groups
177+
- **More complex**: Additional level of indirection
178+
- **Benefits**: Better scalability, stability, and incremental updates
179+
180+
## Implications for Migration
181+
182+
When comparing Python and Clojure implementations:
183+
184+
1. **Cannot compare directly**: Group members have different meanings
185+
- Python groups contain participant IDs
186+
- Clojure groups contain base cluster IDs
187+
188+
2. **Must unfold Clojure**: Convert base cluster IDs to participant IDs for fair comparison
189+
- Use `unfold_clojure_group_clusters()` in Python
190+
- Handles string/integer ID type conversion
191+
192+
3. **Architecture decision**: Should Python adopt two-level clustering?
193+
- **Pros**: Match Clojure behavior exactly, better scalability
194+
- **Cons**: More complexity, harder to maintain
195+
- **Current approach**: Keep single-level for simplicity until scalability becomes an issue
196+
197+
## References
198+
199+
- **Main clustering logic**: `polismath/math/conversation.clj` lines 400-474
200+
- **Fold/unfold utilities**: `polismath/math/clusters.clj` lines 389-414
201+
- **Server-side navigation**: `server/src/report.ts` lines 132-175
202+
- **Export flattening**: `polismath/darwin/export.clj` lines 286-299
203+
- **Python unfold implementation**: `polismath/regression/clojure_comparer.py` lines 123-199

0 commit comments

Comments
 (0)