Skip to content

Commit 36844b6

Browse files
quantbaiclaude
andcommitted
release v1.0: PIT-correct group field + institutional-grade validation
Transition from RC1 to v1.0 stable. The product is a buyside-usable hierarchical industry classification (date × asset_id integer code matrix) for cross-sectional alpha demean, sector-neutral risk constraints, and peer-group analytics — equivalent to a WQ-style group field. Data layer - snapshot.csv: add effective_from column (158 assets @ 2024-05-23) - PRIME: 301092 RWA → 305020 Gaming (gaming economy, not RWA) - LUNC: align decisions terra-luna-original → terra-luna-classic - Taxonomy: downgrade Datonomy crosswalk claim to format-level only - New docs: SCHEMA / UNIVERSE / GOVERNANCE / CHANGELOG - New decision: decisions/prime.md (305020 vs 305010 comparison) Pipeline - build_matrices: DatetimeIndex + PIT lookup in both wide AND long paths - validate_schema: enforce effective_from invariants - CI: dtype + index + NaN assertion block (verbatim) Validation (6 institutional experiments replacing 3-chart RC1) - E1 within-between with stationary bootstrap CIs (Politis-Romano 1994) - E2 per-sub-sector tightness with Holm-Bonferroni correction - E3 explicit multiple-testing summary - E4 rolling 180-day stability (19/19 windows positive) - E5 naive-baseline comparison (chain-only narrowly ties our sector) - E6 group_neut applied demo (50.7% of toy-alpha variance is between-sector) - validation.md rewritten to institutional grade with honest reporting of small-universe limitations Doc polish - SCHEMA: document NaN dual cause + chain_ecosystem FILTER_ONLY flag - README quick start: NA silent-drop warning, datetime import, reindex, .T.groupby().T (replaces deprecated groupby(axis=1)) - UNIVERSE: enumerate 8 grandfathered assets below 90-day history - GOVERNANCE: cold-start interim clause for council size < 3 - methodology version 1.0.0-RC2 → 1.0.0 Review: - MSCI MD-style reviewer: ready, would CITE as institutional reference - WorldQuant-style alpha PM reviewer: ready for buyside platform integration Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
1 parent 56bc154 commit 36844b6

33 files changed

Lines changed: 4577 additions & 2611 deletions

.github/workflows/ci.yml

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,30 @@ jobs:
2727
- name: Compute validation (smoke test — script must run cleanly)
2828
run: python scripts/compute_validation.py
2929

30+
- name: Dtype + index + NaN assertion
31+
run: |
32+
python - <<'EOF'
33+
import pandas as pd
34+
35+
# wide index dtype
36+
for col in ["class_code","sector_code","sub_sector_code"]:
37+
w = pd.read_parquet(f"classification/wide/{col}.parquet")
38+
assert str(w.index.dtype) == "datetime64[ns]", f"{col} wide index dtype: {w.index.dtype}"
39+
dtypes = w.dtypes.unique()
40+
assert len(dtypes) == 1 and str(dtypes[0]) == "Int64", f"{col} wide column dtypes: {dtypes}"
41+
42+
# panel column dtypes
43+
panel = pd.read_parquet("classification/long/panel.parquet")
44+
for col in ["class_code","sector_code","sub_sector_code"]:
45+
assert str(panel[col].dtype) == "Int64", f"panel {col} dtype: {panel[col].dtype}"
46+
47+
# panel must have no NaN code rows (panel only stores existence rows)
48+
nan_rows = panel[panel["sector_code"].isna()]
49+
assert len(nan_rows) == 0, f"panel has {len(nan_rows)} NaN sector_code rows — PIT lookup broken"
50+
51+
print("[ci] dtype + index + NaN assertions passed")
52+
EOF
53+
3054
- name: Content-equality check on rebuilt CSV matrices
3155
run: |
3256
python - <<'EOF'

.gitignore

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,12 @@ env/
1818
# Notebooks
1919
.ipynb_checkpoints/
2020

21+
# Internal working artifacts — design docs, working plans, ad-hoc notes.
22+
# These are not part of the published product. Public methodology / decisions
23+
# live in methodology.md / decisions/ — committed and audited.
24+
plans/
25+
.venv-dry/
26+
2127
# NOTE: classification/wide/*.parquet, classification/long/*.parquet, and
2228
# validation/charts/*.png ARE committed — they are the published artifacts
2329
# that consumers download directly without cloning + running scripts.

CHANGELOG.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# Changelog
2+
3+
All notable changes to crypto-sectors are documented here.
4+
5+
Format: `[vX.Y.Z] YYYY-MM-DD — N assets — Summary`.
6+
7+
---
8+
9+
## [v1.0.0] — Pending (target: 2026-Q2)
10+
11+
**158 assets · 4 classes · 14 sectors · ~35 sub-sectors**
12+
13+
Changes from v1.0-RC2:
14+
15+
- Version string bumped from `1.0.0-RC2` to `1.0.0` in `methodology.md` and `taxonomy.yaml`
16+
- Final CI green on all checks required before tag is pushed
17+
- No content changes from RC2
18+
19+
---
20+
21+
## [v1.0.0-RC2] — 2026-05-24
22+
23+
**158 assets · 4 classes · 14 sectors · ~35 sub-sectors**
24+
25+
Changes from v1.0-RC1 (this batch, QD-A scope):
26+
27+
- **Schema: `effective_from` column added to `classification/snapshot.csv`** (all 158 rows = `2024-05-23`, the min date in `data/daily_returns.parquet`). Establishes SCD-lite PIT semantics: reclassifications from v1.1 onward will add new rows with later `effective_from` dates rather than overwriting existing rows.
28+
- **PRIME reclassification**: sub_sector_code `301092` (RWA Issuer — Governance, sector 3010 DeFi) → `305020` (Gaming, sector 3050 Metaverse). Echelon Prime is a gaming-economy protocol for AAA studios, not an RWA issuer. See `decisions/prime.md`.
29+
- **Datonomy affiliation language corrected**: removed unsubstantiated "round-trip compatibility" claim from `methodology.md §3.2`, `§7`, and `taxonomy.yaml` header. Replaced with factual disclaimer per D2 in the upgrade plan.
30+
- **New documents added**: `SCHEMA.md`, `UNIVERSE.md`, `GOVERNANCE.md`, `CHANGELOG.md`.
31+
- **`decisions/luna-symbol-history.md` corrected**: `terra-luna-original``terra-luna-classic` to match snapshot asset_id.
32+
- **`methodology.md §4.2`**: added PIT immutability contract and git-tag warning.
33+
- **`methodology.md §5`**: "Universe filters" reworded to "investable filters" to clarify coverage vs tradability distinction.
34+
- **`methodology.md` version**: `1.0.0``1.0.0-RC2`.
35+
36+
---
37+
38+
## [v1.0.0-RC1] — 2026-05-23
39+
40+
**158 assets · 4 classes · 14 sectors · ~35 sub-sectors**
41+
42+
Initial public release candidate. Built from scratch over one development sprint.
43+
44+
Key decisions baked in at RC1:
45+
46+
- Three-level hierarchy (class → sector → sub-sector) sized for a universe of ~150–300 assets, matching precedent from institutional equity classification systems adjusted for digital-asset universe size.
47+
- `chain_ecosystem` as an orthogonal tag rather than a hierarchy level; chain effects in crypto can rival sector effects and forcing them into the hierarchy distorts both axes.
48+
- Sub-sector codes ending `90``99` reserved as community extensions (e.g., 301090 Liquid Staking, 301091 Liquid Restaking).
49+
- 6-digit `CCSSXX` positional code format chosen to make potential future crosswalks to institutional classification systems tractable.
50+
- `asset_id` (stable, lowercase, hyphenated) as canonical key; `symbol` in a lookup table. Handles LUNA/LUNC collision and MATIC/POL rename correctly.
51+
- Empirical validation via within-vs-between correlation bootstrap (results in `validation.md`).
52+
- CI: referential integrity check (`validate_schema.py`) + CSV content equality.
53+
54+
Pre-RC1 development history (not tagged):
55+
56+
- **v0.2** (internal): Merged `304091` (DePIN Compute), `304092` (DePIN AI), `304093` (AI Agents) into `304020` (Compute & Private Storage) after intra-group correlations were negative on split. Deprecated slots documented in `taxonomy.yaml`. Added `202090` (Restaking Infrastructure) deprecated slot after EIGEN moved to 301091.
57+
- **v0.1** (internal): Initial 4-class skeleton. DEX / Lending / L1 / L2 as the primary sectors. DePIN split into three sub-sectors (later merged in v0.2). No empirical validation yet.
58+
59+
---
60+
61+
## Known open issues — v1.0.1 candidates
62+
63+
- **PENGU classification**: currently `102010` (Meme Coins). Under review for possible reclassification to `305030` (NFT Ecosystems) given Pudgy Penguins' NFT collection origin. Needs correlation evidence. Track in GitHub Issues.
64+
65+
---
66+
67+
## v1.1 Forward contract — `reclassifications.csv`
68+
69+
Starting in v1.1, all reclassification events will be recorded in a dedicated audit file `classification/reclassifications.csv`. This is a commitment, not a v1.0 deliverable.
70+
71+
**Planned schema (5 columns)**:
72+
73+
| Column | dtype | Description |
74+
|---|---|---|
75+
| `asset_id` | string | Stable asset identifier (join key) |
76+
| `field` | string | Which field changed: `class_code`, `sector_code`, `sub_sector_code`, or `chain_ecosystem` |
77+
| `old_value` | string | Value before the reclassification (stored as string to handle mixed types) |
78+
| `new_value` | string | Value after the reclassification |
79+
| `effective_from` | date | Date from which the new value applies; matches the new row in `snapshot.csv` |
80+
81+
Each row in `reclassifications.csv` corresponds to one new row added to `snapshot.csv` with a later `effective_from`. The old row in `snapshot.csv` is never deleted — it remains as historical record.

GOVERNANCE.md

Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
# Governance
2+
3+
> Version 1.0.0-RC2 · crypto-sectors
4+
5+
This document defines the decision-making process for taxonomy changes, reclassifications, and dispute resolution.
6+
7+
---
8+
9+
## 1. Maintainer council
10+
11+
**Initial configuration (v1.0)**: sole maintainer is `@quantbai`.
12+
13+
Future expansion of the maintainer council requires:
14+
- A documented nomination (GitHub Issue with `council-nomination` label)
15+
- A minimum 14-day public comment period
16+
- Approval by all existing council members
17+
- Addition of the new maintainer to this file via a PR
18+
19+
Maintainers are responsible for: merging PRs against snapshot.csv, reviewing `decisions/` files for completeness, tagging quarterly releases, and responding to dispute issues within 14 days.
20+
21+
---
22+
23+
## 2. Quorum definition
24+
25+
With a sole maintainer, **quorum = 1**. This is explicit, not implicit. A single maintainer can approve and merge any PR within scope. When the council expands to N members, quorum is defined as floor(N/2) + 1 (simple majority).
26+
27+
---
28+
29+
## 3. PR merge thresholds
30+
31+
| Change type | Approval required |
32+
|---|---|
33+
| **Typo fix, whitespace, decision-doc addition** (no code change) | Maintainer alone |
34+
| **New asset addition** (one row in snapshot.csv + decisions/ file) | Maintainer + 1 community reviewer (GitHub review approval) |
35+
| **sub_sector_code change** (reclassification) | Maintainer + 1 community reviewer + updated decisions/ file |
36+
| **sector_code or class_code change** | Maintainer + 2 community reviewers + updated decisions/ file + methodology.md note if policy-changing |
37+
| **Taxonomy structural change** (new sub-sector, deprecated slot change) | Maintainer + 2 community reviewers + taxonomy.yaml version bump |
38+
| **Governance document change** | Maintainer + 2 community reviewers + 14-day comment period |
39+
40+
Community reviewer: any GitHub user who has made >= 1 prior merged contribution to this repository, or who holds a verifiable affiliation with a recognized institutional research or quant organization (stated in the review comment).
41+
42+
**Cold-start provision (v1.0 — until council size >= 3)**: while the maintainer council has fewer than 3 active members, high-level reclassification PRs (changes to `sector_code` or `class_code`) may be merged after the maintainer posts the PR to a GitHub Issue with label `reclass-review` and at least 7 calendar days pass with no substantive objection. Objections from non-maintainers must be from contributors with a GitHub account >= 30 days old and 1+ contribution to any related open-source project. This provision lapses automatically when council reaches 3.
43+
44+
---
45+
46+
## 4. Conflict-of-interest policy
47+
48+
Any contributor opening a PR that touches classification of an asset in which they hold a material financial position (long or short, direct or via derivatives) **must disclose this in the PR body**. A suggested disclosure template:
49+
50+
> **COI Disclosure**: I hold [long/short] exposure to [SYMBOL] via [spot/perp/options/fund]. My proposed classification is [X]; I believe this is correct on the merits of the taxonomy definition, and I welcome independent review.
51+
52+
The maintainer must abstain from approving a PR if they personally hold the asset being reclassified. In the sole-maintainer configuration, a conflicted reclassification PR must be reviewed and approved by a community reviewer before the maintainer merges.
53+
54+
A maintainer who discovers an undisclosed COI after a merge may revert the PR and re-open the classification question.
55+
56+
---
57+
58+
## 5. Appeal process
59+
60+
An asset issuer, token holder, or external researcher who disputes a classification should:
61+
62+
1. Open a GitHub Issue with the `dispute` label.
63+
2. State: (a) the asset's current classification, (b) the proposed alternative, (c) the taxonomy definition language that supports the alternative.
64+
3. The maintainer responds within **14 calendar days** with either a rejection rationale or a request for more information.
65+
4. If the dispute has merit, a reclassification PR is opened and follows the normal merge threshold process.
66+
5. The final decision — whether the original classification stands or changes — is recorded in `decisions/<symbol>.md` with a reference to the issue number.
67+
68+
Disputes do not block the repository's operation. Disputed assets retain their current classification until a PR is merged.
69+
70+
---
71+
72+
## 6. Review frequency and release cadence
73+
74+
| Event | Schedule |
75+
|---|---|
76+
| **Quarterly snapshot tag** | Once per calendar quarter (`v2026.Q2`, `v2026.Q3`, …). Tag is created by the maintainer after CI passes and all open classification PRs from the quarter are resolved. |
77+
| **Off-cycle emergency reclassification** | Permitted for: (a) confirmed fraud or rug-pull by a protocol, (b) stablecoin de-peg lasting > 72 hours, (c) chain migration with a new asset_id (e.g., LUNA → LUNC). Off-cycle events get a patch tag (`v2026.Q2.1`). |
78+
| **Taxonomy version bump** | On any structural change (new sub-sector code, deprecated slot change). See `taxonomy.yaml` `version` field (SemVer). |
79+
| **Governance review** | At each major release (v2.0+), or when council size changes. |

README.md

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@ A community-maintained, hierarchical industry classification for digital assets.
2121
| [`decisions/`](decisions/) | One file per non-trivial classification decision — the audit trail |
2222
| [`methodology.md`](methodology.md) | The rulebook — how classifications are made |
2323
| [`validation.md`](validation.md) | Empirical evidence the classification co-moves on daily returns |
24+
| [`UNIVERSE.md`](UNIVERSE.md) | Coverage universe — selection criteria, exclusions, and graduation rules |
25+
| [`GOVERNANCE.md`](GOVERNANCE.md) | Maintainer council, PR merge thresholds, conflict-of-interest policy, appeals |
26+
| [`SCHEMA.md`](SCHEMA.md) | Dtype contract for all published fields; Int64 numpy-safety warning |
27+
| [`CHANGELOG.md`](CHANGELOG.md) | Version history; v1.1 reclassifications.csv forward contract |
2428

2529
## Quick start
2630

@@ -42,15 +46,32 @@ print(snapshot.head())
4246
For a sector-neutralization example:
4347

4448
```python
49+
import datetime
50+
51+
# date must be a datetime.date (not pd.Timestamp) to index the wide matrix
52+
date = datetime.date(2025, 1, 15)
53+
54+
# Note: cells before effective_from (2024-05-23) are NaN — a sane backtest
55+
# starts on or after that date.
56+
sector = pd.read_parquet(
57+
"https://raw.githubusercontent.com/quantbai/crypto-sectors/main/classification/wide/sector_code.parquet"
58+
)
59+
60+
# Align sector codes to alpha column order; prevents silent NaN from column mismatch
61+
sector_row = sector.loc[pd.Timestamp(date)].reindex(alpha.columns)
62+
4563
# Cross-sectional demean within sector — a standard alpha-research operation
46-
alpha_demeaned = alpha.sub(alpha.groupby(sector.loc[date], axis=1).transform("mean"))
64+
# (.T.groupby().T replaces the deprecated groupby(axis=1))
65+
# Note: assets with NA sector_row (e.g. pre-effective_from, no-returns) are
66+
# silently set to NaN in alpha_demeaned. Filter or impute upstream.
67+
alpha_demeaned = alpha.sub(alpha.T.groupby(sector_row).transform("mean").T)
4768
```
4869

4970
## Coverage
5071

51-
- **Universe**: 158 actively classified digital assets
72+
- **Universe**: 158 actively classified digital assets. See [UNIVERSE.md](UNIVERSE.md) for selection criteria and exclusions.
5273
- **Hierarchy**: 4 classes → 14 sectors → ~35 sub-sectors (community-maintained, with extensions in the 90–99 slot of each sector)
53-
- **Orthogonal tag**: `chain_ecosystem` (BTC, ETH, SOL, BNB, …)
74+
- **Orthogonal tag**: `chain_ecosystem` (BTC, ETH, SOL, BNB, …) — categorical `FILTER_ONLY` tag; see [SCHEMA.md](SCHEMA.md) for usage guidance. Do not use as a direct numeric alpha factor.
5475
- **Update cadence**: quarterly snapshot tags (`v2026.Q2`, `v2026.Q3`, …), continuous PR review
5576

5677
## Validation in one sentence

0 commit comments

Comments
 (0)