@@ -59,47 +59,47 @@ python experiments/report.py --files experiments/results_lesk.jsonl \
5959pywsd 1.3.0, ` oewn:2024 ` , bundled Wikipedia-corpus IC. Cells are
6060accuracy % over OEWN-resolvable instances.
6161
62- | method | SE07 (AW) | SE13 (AW) | SE15 (AW) | SE2 (AW) | SE3 (AW) | SE2 (LS) | SE3 (LS) | SE07T17 (LS) |
62+ | method | SE07 (AW) | SE13 (AW) | SE15 (AW) | SE2 (AW) | SE3 (AW) | SE2 (LS) | SE3 (LS) | SE07T17 (LS† ) |
6363| ---| ---:| ---:| ---:| ---:| ---:| ---:| ---:| ---:|
64- | ` first_sense ` | 52.76 | 57.65 | ** 64.61** | ** 60.62** | ** 61.46** | 39.42 | † | † |
65- | ` random_sense ` | 23.73 | 36.28 | 42.20 | 40.05 | 34.14 | 18.36 | † | † |
66- | ` max_lemma_count ` | 32.95 | 56.27 | 49.85 | 50.48 | 47.15 | 23.24 | † | † |
67- | ` original_lesk ` | 15.65 | 36.49 | 34.03 | 34.23 | 28.37 | 16.69 | † | † |
68- | ` simple_lesk ` | 47.70 | 55.34 | 61.90 | 58.64 | 55.19 | † | † | † |
69- | ` adapted_lesk ` | 47.00 | 55.34 | 60.98 | 57.19 | 54.79 | † | † | † |
70- | ` cosine_lesk ` | 32.03 | 44.72 | 48.11 | 45.67 | 41.38 | † | † | † |
71- | ` max_similarity_path ` | 33.56 | – | – | – | – | – | – | – |
72- | ` max_similarity_wup ` | 30.56 | – | – | – | – | – | – | – |
73- | ` max_similarity_lch ` | 33.56 | – | – | – | – | – | – | – |
74- | ` max_similarity_res ` | 26.62 | – | – | – | – | – | – | – |
75- | ` max_similarity_jcn ` | ** 52.55** | – | – | – | – | – | – | – |
76- | ` max_similarity_lin ` | 30.56 | – | – | – | – | – | – | – |
77-
78- Column headers: ` SE07 (AW) ` =SemEval-2007 fine-grained all-words,
79- ` SE13 (AW) ` =SemEval-2013 Task 12, ` SE15 (AW) ` =SemEval-2015 Task 13,
80- ` SE2 (AW) ` =Senseval-2 all-words, ` SE3 (AW) ` =Senseval-3 all-words,
81- ` SE2 (LS) ` /` SE3 (LS) ` =Senseval-2 & 3 lexical-sample test sets,
82- ` SE07T17 (LS) ` =SemEval-2007 Task 17 lexical-sample test.
64+ | ` first_sense ` | 52.76 | 57.65 | ** 64.61** | ** 60.62** | ** 61.46** | 39.42 | 52.12 | 52.76 |
65+ | ` random_sense ` | 23.73 | 36.28 | 42.20 | 40.05 | 34.14 | 18.36 | 19.14 | 28.80 |
66+ | ` max_lemma_count ` | 32.95 | 56.27 | 49.85 | 50.48 | 47.15 | 23.24 | 38.02 | 33.18 |
67+ | ` original_lesk ` | 15.65 | 36.49 | 34.03 | 34.23 | 28.37 | 16.69 | 19.68 | 15.70 |
68+ | ` simple_lesk ` | 47.70 | 55.34 | 61.90 | 58.64 | 55.19 | 39.65 | 47.15 | 47.24 |
69+ | ` adapted_lesk ` | 47.00 | 55.34 | 60.98 | 57.19 | 54.79 | 39.15 | 47.24 | 46.54 |
70+ | ` cosine_lesk ` | 32.03 | 44.72 | 48.11 | 45.67 | 41.38 | 27.59 | 26.36 | 31.80 |
71+ | ` max_similarity_path ` | 33.56 | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ |
72+ | ` max_similarity_wup ` | 30.56 | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ |
73+ | ` max_similarity_lch ` | 33.56 | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ |
74+ | ` max_similarity_res ` | 26.62 | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ |
75+ | ` max_similarity_jcn ` | ** 52.55** | ※ | ※ | ※ | ※ | ※ | ※ | ※ |
76+ | ` max_similarity_lin ` | 30.56 | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ |
77+
78+ Column headers: ` SE07 (AW) ` =SemEval-2007 fine-grained all-words
79+ (Raganato export), ` SE13 (AW) ` =SemEval-2013 Task 12,
80+ ` SE15 (AW) ` =SemEval-2015 Task 13, ` SE2 (AW) ` =Senseval-2 all-words,
81+ ` SE3 (AW) ` =Senseval-3 all-words, ` SE2 (LS) ` / ` SE3 (LS) ` =Senseval-2
82+ & 3 lexical-sample test sets, ` SE07T17 (LS†) ` =SemEval-2007 Task 17
83+ via UFSAC.
84+
85+ ** Note:** ` SE07T17 (LS†) ` is essentially the same 455-instance
86+ SemEval-2007 Task 17 fine-grained all-words test set as ` SE07 (AW) ` ,
87+ sourced from UFSAC's export rather than the Raganato bundle. SE2007
88+ Task 17 was never a lexical-sample track; the ` _ls ` suffix in
89+ ` en-semeval2007_t17_ls ` is a mislabel in pywsd-datasets v0.2.0. The
90+ scores confirm it — all methods produce near-identical numbers
91+ (±0.5 pp) across the two columns. Will rename in the next dataset
92+ release.
8393
8494### Cells not filled
8595
86- ** †** — * in progress.* Lesk family + baselines on the three
87- lexical-sample test splits is currently running. Numbers for
88- ` en-senseval2_ls/test ` are landing first; three methods already
89- reported above (` first_sense ` , ` random_sense ` , ` max_lemma_count ` ,
90- ` original_lesk ` ); ` simple_lesk ` , ` adapted_lesk ` , ` cosine_lesk `
91- remaining. ` en-senseval3_ls/test ` (3,849 rows) and
92- ` en-semeval2007_t17_ls/test ` (455 rows) still queued. ETA under an
93- hour. This file will be updated when the sweep finishes; raw runs
94- stream into ` results_ls.jsonl ` .
95-
96- ** –** — * deliberately skipped.* Each ` max_similarity ` run is
96+ ** ‡** — * deliberately skipped.* Each ` max_similarity ` run is
9797quadratic in (candidate synsets × context synsets) and takes ~ 10–30
9898minutes per metric per config even on the 455-row SemEval-2007. The
99- other test configs are 2–10× larger, so a full sweep would be many
100- hours. Partial SE2007 results (shown above) are sufficient to rank
101- the six metrics; ` jcn ` is clearly best. If someone needs the complete
102- grid, run:
99+ other test configs are 2–10× larger, so a full sweep of all six
100+ metrics would be many hours. Partial SE2007 results (the filled SE07
101+ column above) are sufficient to rank the six metrics; ` jcn ` is clearly
102+ best. If someone needs the complete grid, run:
103103```
104104python experiments/evaluate.py \\
105105 --configs <larger-config> \\
@@ -108,6 +108,12 @@ python experiments/evaluate.py \\
108108 --out experiments/results_maxsim_<config>.jsonl
109109```
110110
111+ ** ※** — * jcn row being filled now.* Because jcn is the standout
112+ IC-based metric on SE2007, we're running it across every remaining
113+ config to see whether the advantage holds. Streams into
114+ ` results_maxsim_jcn.jsonl ` ; expect several hours of wall time (jcn
115+ was 586 s for 432 rows; largest config here is 4,239 rows).
116+
111117### Instance counts (gold-resolvable / total in test split)
112118
113119| config | n (total) | gold-resolvable | OEWN coverage |
0 commit comments