Skip to content

Commit 34486ad

Browse files
committed
experiments/: pywsd benchmarks against pywsd-datasets test sets
New subdir that lives in the repo but is NOT part of the installable pywsd package. Reproducible benchmarks of every WSD method against the unified evaluation suite at alvations/pywsd-datasets on HF Hub. Contents: - evaluate.py — runs each pywsd method on every test config, pulls rows directly from huggingface.co/datasets/alvations/pywsd-datasets. Counts a hit iff the returned Synset.id is in the gold sense_ids_wordnet list (tolerates multi-gold + synset splits). - report.py — aggregates JSONL into a method x config markdown table. - results_lesk.jsonl — raw output from the first sweep (Lesk family + baselines across 5 Raganato all-words configs, pywsd 1.3.0). - README.md — protocol + full results table + reading notes. First-sweep results: - `first_sense` wins every config (52-64 %) — MFS baseline is hard to beat with unsupervised Lesk at the all-words level. - simple_lesk comes within 2-4 pp of first_sense on each. - adapted_lesk slightly underperforms simple_lesk (more related-sense noise than signal on SE2/SE3). - original_lesk collapses on fine-grained SemEval-2007 (15.65 %). max_similarity sweep runs separately; results appended when done.
1 parent 763d9ad commit 34486ad

4 files changed

Lines changed: 438 additions & 0 deletions

File tree

experiments/README.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# experiments/
2+
3+
Reproducible benchmarks of pywsd's algorithms against the unified
4+
[`alvations/pywsd-datasets`](https://huggingface.co/datasets/alvations/pywsd-datasets)
5+
test splits. **Not** part of the installable pywsd package — these
6+
scripts live here so anyone can re-run the numbers without re-deriving
7+
the evaluation pipeline.
8+
9+
## Files
10+
11+
* `evaluate.py` — runs each pywsd method against each test config on
12+
the HuggingFace Hub dataset and records accuracy.
13+
* `report.py` — aggregates the JSONL results into a method × config
14+
markdown table.
15+
* `results_lesk.jsonl` — raw run output: Lesk family + baselines across
16+
the 5 Raganato all-words evaluation sets (pywsd 1.3.0, April 2026).
17+
* `results_maxsim.jsonl` — max_similarity metrics on SemEval-2007 (the
18+
smallest eval set — these metrics are ~500 ms/instance so only run
19+
over the 455-row set in this snapshot).
20+
21+
## Setup
22+
23+
```bash
24+
pip install pywsd datasets
25+
python -m nltk.downloader punkt_tab averaged_perceptron_tagger_eng wordnet
26+
python -c "import wn; wn.download('oewn:2024')"
27+
```
28+
29+
## Run
30+
31+
```bash
32+
# All defaults: every method, every all-words eval config.
33+
python experiments/evaluate.py --out experiments/results_lesk.jsonl
34+
35+
# Subsets:
36+
python experiments/evaluate.py --configs en-senseval2-aw --limit 200
37+
38+
# Aggregate:
39+
python experiments/report.py --files experiments/results_lesk.jsonl \
40+
experiments/results_maxsim.jsonl
41+
```
42+
43+
## Protocol
44+
45+
* Target senses in `pywsd-datasets` are OEWN 2024 synset IDs
46+
(`oewn-<offset>-<pos>`), mapped from the original PWN 3.0 sense keys
47+
via `wn.compat.sensekey`.
48+
* A prediction counts as correct if the returned `Synset.id` is any
49+
member of the gold `sense_ids_wordnet` list (handles multi-gold and
50+
synset-split tolerance).
51+
* Instances where the PWN → OEWN map produced no target (empty gold
52+
list) are excluded from both numerator and denominator and reported
53+
in the `nogold` column.
54+
* The `errors` column counts instances where the method raised, or
55+
returned `None`, or returned something without an `.id` attribute.
56+
57+
## Results — Lesk family + baselines
58+
59+
Across the 5 Raganato all-words evaluation configs (pywsd 1.3.0,
60+
`oewn:2024`, `wikipedia`-corpus IC).
61+
62+
| method | SE2007 (AW) | SE2013 (AW) | SE2015 (AW) | Senseval-2 | Senseval-3 |
63+
|---|---:|---:|---:|---:|---:|
64+
| `first_sense` | 52.76 | 57.65 | **64.61** | **60.62** | **61.46** |
65+
| `random_sense` | 23.73 | 36.28 | 42.20 | 40.05 | 34.14 |
66+
| `max_lemma_count` | 32.95 | 56.27 | 49.85 | 50.48 | 47.15 |
67+
| `original_lesk` | 15.65 | 36.49 | 34.03 | 34.23 | 28.37 |
68+
| `simple_lesk` | **47.70** | 55.34 | 61.90 | 58.64 | 55.19 |
69+
| `adapted_lesk` | 47.00 | 55.34 | 60.98 | 57.19 | 54.79 |
70+
| `cosine_lesk` | 32.03 | 44.72 | 48.11 | 45.67 | 41.38 |
71+
72+
Cells are accuracy % (higher is better). Instance counts per config:
73+
SemEval-2007 455 (fine-grained), SemEval-2013 1,644, SemEval-2015
74+
1,022, Senseval-2 2,282, Senseval-3 1,850.
75+
76+
### Reading
77+
78+
* `first_sense` (most-frequent-sense heuristic over OEWN's first-sense
79+
ordering) wins every config. Knowledge-based Lesk variants come
80+
within 2–4 percentage points but never beat MFS — the well-known
81+
difficulty of all-words WSD with unsupervised signals.
82+
* `simple_lesk``adapted_lesk``cosine_lesk` holds on every
83+
config. Adapted Lesk's wider signature (holonyms/meronyms/similar)
84+
slightly hurts on Senseval-2 and -3 — more noise than signal.
85+
* `original_lesk` (1986, definition-only overlap) collapses on the
86+
fine-grained SemEval-2007 (15.65 %) as expected.
87+
* `max_lemma_count` is a surprisingly strong MFS proxy on SemEval-2013
88+
(56.27 %, within ~1 pp of `first_sense`) but weak on the fine-grained
89+
SemEval-2007 (32.95 %) where OEWN's per-sense counts are sparse.
90+
91+
## Results — max_similarity (information-content family)
92+
93+
Computed on SemEval-2007 all-words (455 rows) only, because each
94+
similarity-option run takes ~14 minutes per metric on this corpus
95+
(quadratic over candidate × context synsets).
96+
97+
*(Results will be appended here as the sweep finishes. See
98+
`results_maxsim.jsonl` for raw JSON output.)*
99+
100+
## Reproducibility
101+
102+
Results above were generated with:
103+
104+
* `pywsd==1.3.0`
105+
* `wn==1.1.0`, lexicon `oewn:2024`
106+
* `alvations/pywsd-datasets` built from
107+
[v0.2.0](https://github.com/alvations/pywsd-datasets/releases/tag/v0.2.0),
108+
which pins the Raganato bundle via a GitHub release mirror.
109+
* Python 3.12, macOS.
110+
111+
Re-run with the exact commands above; scores should be within floating
112+
noise unless OEWN or the Raganato mirror moves under you.
113+
114+
## What's not here (yet)
115+
116+
* Ranked-list metrics (MAP / first-n accuracy with `nbest=True`).
117+
* Per-POS breakdown.
118+
* Evaluation on the UFSAC lexical-sample test splits
119+
(`en-senseval2_ls`, `en-senseval3_ls`, `en-semeval2007_t17_ls`).
120+
Lexical-sample WSD has a different protocol — confusion per target
121+
lemma — that isn't reflected in this script's aggregate accuracy.
122+
123+
Contributions welcome.

experiments/evaluate.py

Lines changed: 201 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,201 @@
1+
"""Evaluate pywsd WSD methods against the ``alvations/pywsd-datasets``
2+
test sets on HuggingFace Hub.
3+
4+
For each test instance:
5+
6+
1. Build the context string from its ``tokens``,
7+
2. Ask a pywsd method for the disambiguated sense of the target,
8+
3. Count a hit if the returned ``Synset.id`` is in the gold
9+
``sense_ids_wordnet`` list (list-valued to handle multi-gold +
10+
synset-split cases).
11+
12+
Instances where the dataset build failed to resolve any OEWN id for
13+
the gold PWN 3.0 sense key (empty ``sense_ids_wordnet``) are excluded
14+
from both numerator and denominator, and reported separately.
15+
16+
Usage::
17+
18+
pip install pywsd datasets
19+
python -m nltk.downloader punkt_tab averaged_perceptron_tagger_eng wordnet
20+
python -c "import wn; wn.download('oewn:2024')"
21+
22+
python experiments/evaluate.py # all eval configs
23+
python experiments/evaluate.py --configs en-senseval2-aw
24+
python experiments/evaluate.py --methods simple_lesk first_sense
25+
python experiments/evaluate.py --limit 200 # smoke
26+
python experiments/evaluate.py --out results.jsonl
27+
"""
28+
29+
from __future__ import annotations
30+
31+
import argparse
32+
import json
33+
import sys
34+
import time
35+
from pathlib import Path
36+
37+
38+
# Test-only configs in alvations/pywsd-datasets.
39+
TEST_CONFIGS: dict[str, str] = {
40+
"en-senseval2-aw": "test",
41+
"en-senseval3-aw": "test",
42+
"en-semeval2007-aw": "test",
43+
"en-semeval2013-aw": "test",
44+
"en-semeval2015-aw": "test",
45+
"en-senseval2_ls": "test",
46+
"en-senseval3_ls": "test",
47+
"en-semeval2007_t17_ls": "test",
48+
}
49+
50+
51+
def load_rows(config: str, split: str) -> list[dict]:
52+
"""Pull a split directly from HuggingFace Hub."""
53+
from datasets import load_dataset
54+
ds = load_dataset("alvations/pywsd-datasets", config)
55+
return list(ds[split])
56+
57+
58+
def detokenize(tokens: list[str]) -> str:
59+
return " ".join(tokens).replace("_", " ")
60+
61+
62+
def _wn_pos(pos: str) -> str | None:
63+
return pos if pos in ("n", "v", "a", "r") else None
64+
65+
66+
def run_method(method: str, sentence: str, lemma: str, pos: str | None):
67+
from pywsd.lesk import simple_lesk, adapted_lesk, cosine_lesk, original_lesk
68+
from pywsd.similarity import max_similarity
69+
from pywsd.baseline import first_sense, random_sense, max_lemma_count
70+
71+
if method == "simple_lesk":
72+
return simple_lesk(sentence, lemma, pos=pos)
73+
if method == "adapted_lesk":
74+
return adapted_lesk(sentence, lemma, pos=pos)
75+
if method == "cosine_lesk":
76+
return cosine_lesk(sentence, lemma, pos=pos)
77+
if method == "original_lesk":
78+
return original_lesk(sentence, lemma)
79+
if method.startswith("max_similarity_"):
80+
opt = method.removeprefix("max_similarity_")
81+
return max_similarity(sentence, lemma, option=opt, pos=pos)
82+
if method == "first_sense":
83+
try:
84+
return first_sense(lemma, pos=pos)
85+
except Exception:
86+
return None
87+
if method == "random_sense":
88+
try:
89+
return random_sense(lemma, pos=pos)
90+
except Exception:
91+
return None
92+
if method == "max_lemma_count":
93+
return max_lemma_count(lemma)
94+
raise ValueError(f"unknown method {method!r}")
95+
96+
97+
DEFAULT_METHODS: list[str] = [
98+
"first_sense",
99+
"random_sense",
100+
"max_lemma_count",
101+
"original_lesk",
102+
"simple_lesk",
103+
"adapted_lesk",
104+
"cosine_lesk",
105+
"max_similarity_path",
106+
"max_similarity_wup",
107+
"max_similarity_lch",
108+
"max_similarity_res",
109+
"max_similarity_jcn",
110+
"max_similarity_lin",
111+
]
112+
113+
114+
def evaluate_one(rows: list[dict], method: str, limit: int | None = None) -> dict:
115+
total = 0
116+
correct = 0
117+
skipped_nogold = 0
118+
errors = 0
119+
t0 = time.time()
120+
n = len(rows) if limit is None else min(limit, len(rows))
121+
for i in range(n):
122+
row = rows[i]
123+
gold = row.get("sense_ids_wordnet") or []
124+
if not gold:
125+
skipped_nogold += 1
126+
continue
127+
sentence = detokenize(row["tokens"])
128+
lemma = row["target_lemma"]
129+
pos = _wn_pos(row["target_pos"])
130+
try:
131+
pred = run_method(method, sentence, lemma, pos)
132+
except Exception:
133+
errors += 1
134+
continue
135+
if pred is None:
136+
errors += 1
137+
continue
138+
pid = getattr(pred, "id", None)
139+
if pid is None:
140+
errors += 1
141+
continue
142+
total += 1
143+
if pid in gold:
144+
correct += 1
145+
elapsed = time.time() - t0
146+
acc = correct / total if total else 0.0
147+
return {
148+
"method": method,
149+
"total": total,
150+
"correct": correct,
151+
"accuracy": acc,
152+
"skipped_nogold": skipped_nogold,
153+
"errors": errors,
154+
"elapsed_sec": elapsed,
155+
}
156+
157+
158+
def main(argv: list[str] | None = None) -> int:
159+
ap = argparse.ArgumentParser(description=__doc__,
160+
formatter_class=argparse.RawDescriptionHelpFormatter)
161+
ap.add_argument("--configs", nargs="*", default=list(TEST_CONFIGS))
162+
ap.add_argument("--methods", nargs="*", default=DEFAULT_METHODS)
163+
ap.add_argument("--limit", type=int, default=None,
164+
help="max rows per config (for quick runs)")
165+
ap.add_argument("--out", type=Path, default=None,
166+
help="JSONL results dump (one row per config x method)")
167+
args = ap.parse_args(argv)
168+
169+
# One-time WordNet handle / lexicon download.
170+
from pywsd._wordnet import _get
171+
_get()
172+
173+
results: list[dict] = []
174+
for config in args.configs:
175+
split = TEST_CONFIGS.get(config)
176+
if split is None:
177+
print(f"skip non-test config: {config}", file=sys.stderr)
178+
continue
179+
rows = load_rows(config, split)
180+
print(f"\n## {config}/{split} ({len(rows)} rows)")
181+
print(f"{'method':<22} {'acc':>7} {'n':>6} {'err':>5} {'nogold':>7} {'sec':>7}")
182+
for method in args.methods:
183+
r = evaluate_one(rows, method, limit=args.limit)
184+
r.update({"config": config, "split": split})
185+
results.append(r)
186+
print(f"{method:<22} {r['accuracy']*100:>6.2f}% {r['total']:>6} "
187+
f"{r['errors']:>5} {r['skipped_nogold']:>7} "
188+
f"{r['elapsed_sec']:>6.1f}s", flush=True)
189+
190+
if args.out:
191+
args.out.parent.mkdir(parents=True, exist_ok=True)
192+
with open(args.out, "w") as fh:
193+
for r in results:
194+
fh.write(json.dumps(r) + "\n")
195+
print(f"\nwrote {args.out}")
196+
197+
return 0
198+
199+
200+
if __name__ == "__main__":
201+
raise SystemExit(main())

0 commit comments

Comments
 (0)