@@ -68,12 +68,12 @@ accuracy % over OEWN-resolvable instances.
6868| ` simple_lesk ` | 47.70 | 55.34 | 61.90 | 58.64 | 55.19 | 39.65 | 47.15 | 47.24 |
6969| ` adapted_lesk ` | 47.00 | 55.34 | 60.98 | 57.19 | 54.79 | 39.15 | 47.24 | 46.54 |
7070| ` cosine_lesk ` | 32.03 | 44.72 | 48.11 | 45.67 | 41.38 | 27.59 | 26.36 | 31.80 |
71- | ` max_similarity_path ` | 33.56 | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ |
72- | ` max_similarity_wup ` | 30.56 | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ |
73- | ` max_similarity_lch ` | 33.56 | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ |
74- | ` max_similarity_res ` | 26.62 | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ |
75- | ` max_similarity_jcn ` | ** 52.55** | ※ | ※ | ※ | ※ | ※ | ※ | ※ |
76- | ` max_similarity_lin ` | 30.56 | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ | ‡ |
71+ | ` max_similarity_path ` | 33.56 | – | – | – | – | – | – | – |
72+ | ` max_similarity_wup ` | 30.56 | – | – | – | – | – | – | – |
73+ | ` max_similarity_lch ` | 33.56 | – | – | – | – | – | – | – |
74+ | ` max_similarity_res ` | 26.62 | – | – | – | – | – | – | – |
75+ | ` max_similarity_jcn ` | ** 52.55** | – | – | – | – | – | – | – |
76+ | ` max_similarity_lin ` | 30.56 | – | – | – | – | – | – | – |
7777
7878Column headers: ` SE07 (AW) ` =SemEval-2007 fine-grained all-words
7979(Raganato export), ` SE13 (AW) ` =SemEval-2013 Task 12,
@@ -91,28 +91,33 @@ scores confirm it — all methods produce near-identical numbers
9191(±0.5 pp) across the two columns. Will rename in the next dataset
9292release.
9393
94- ### Cells not filled
95-
96- ** ‡** — * deliberately skipped.* Each ` max_similarity ` run is
97- quadratic in (candidate synsets × context synsets) and takes ~ 10–30
98- minutes per metric per config even on the 455-row SemEval-2007. The
99- other test configs are 2–10× larger, so a full sweep of all six
100- metrics would be many hours. Partial SE2007 results (the filled SE07
101- column above) are sufficient to rank the six metrics; ` jcn ` is clearly
102- best. If someone needs the complete grid, run:
103- ```
104- python experiments/evaluate.py \\
105- --configs <larger-config> \\
106- --methods max_similarity_path max_similarity_wup max_similarity_lch \\
107- max_similarity_res max_similarity_jcn max_similarity_lin \\
108- --out experiments/results_maxsim_<config>.jsonl
109- ```
110-
111- ** ※** — * jcn row being filled now.* Because jcn is the standout
112- IC-based metric on SE2007, we're running it across every remaining
113- config to see whether the advantage holds. Streams into
114- ` results_maxsim_jcn.jsonl ` ; expect several hours of wall time (jcn
115- was 586 s for 432 rows; largest config here is 4,239 rows).
94+ ### Cells not filled (` – ` )
95+
96+ Two reasons a cell is ` – ` :
97+
98+ 1 . ** The jcn row is being filled now.** Because jcn is the standout
99+ IC-based metric on SE2007, we're running it across every remaining
100+ config to see whether the advantage holds. Streams into
101+ ` results_maxsim_jcn.jsonl ` ; expect several hours of wall time (jcn
102+ was 586 s for 432 rows; largest config here is 4,239 rows). Cells
103+ will be updated as they land.
104+
105+ 2 . ** All other ` max_similarity ` cells are deliberately skipped.** Each
106+ ` max_similarity ` run is quadratic in (candidate synsets × context
107+ synsets) and takes ~ 10–30 minutes per metric per config even on
108+ the 455-row SemEval-2007. The other test configs are 2–10× larger,
109+ so a full sweep of all six metrics would be many hours. Partial
110+ SE2007 results (the filled SE07 column above) are sufficient to
111+ rank the six metrics; ` jcn ` is clearly best. If someone needs the
112+ complete grid, run:
113+
114+ ```
115+ python experiments/evaluate.py \\
116+ --configs <larger-config> \\
117+ --methods max_similarity_path max_similarity_wup max_similarity_lch \\
118+ max_similarity_res max_similarity_jcn max_similarity_lin \\
119+ --out experiments/results_maxsim_<config>.jsonl
120+ ```
116121
117122### Instance counts (gold-resolvable / total in test split)
118123
0 commit comments