Skip to content

Commit 7b962fd

Browse files
committed
experiments: fill all Lesk+baseline cells; note SE07-t17-ls = SE07-aw duplicate
Lex-sample sweep done. All 7 methods x 3 lex-sample configs (senseval2_ls, senseval3_ls, semeval2007_t17_ls) now in the matrix. Flagged: en-semeval2007_t17_ls is the same 455-instance SemEval-2007 Task 17 fine-grained all-words set as en-semeval2007-aw, just sourced from UFSAC vs Raganato. SE2007 Task 17 was never a lexical-sample track; the '_ls' suffix in pywsd-datasets 0.2.0 is a mislabel. Scores confirm — all methods within 0.5 pp across the two columns. Rename in next dataset release. Notation: -> plain '-' removed; '‡' marks deliberately-skipped max_similarity cells; '※' marks the jcn row being filled by the ongoing background sweep. Both explained in the 'Cells not filled' section.
1 parent d3e9731 commit 7b962fd

1 file changed

Lines changed: 41 additions & 35 deletions

File tree

experiments/README.md

Lines changed: 41 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -59,47 +59,47 @@ python experiments/report.py --files experiments/results_lesk.jsonl \
5959
pywsd 1.3.0, `oewn:2024`, bundled Wikipedia-corpus IC. Cells are
6060
accuracy % over OEWN-resolvable instances.
6161

62-
| method | SE07 (AW) | SE13 (AW) | SE15 (AW) | SE2 (AW) | SE3 (AW) | SE2 (LS) | SE3 (LS) | SE07T17 (LS) |
62+
| method | SE07 (AW) | SE13 (AW) | SE15 (AW) | SE2 (AW) | SE3 (AW) | SE2 (LS) | SE3 (LS) | SE07T17 (LS) |
6363
|---|---:|---:|---:|---:|---:|---:|---:|---:|
64-
| `first_sense` | 52.76 | 57.65 | **64.61** | **60.62** | **61.46** | 39.42 |||
65-
| `random_sense` | 23.73 | 36.28 | 42.20 | 40.05 | 34.14 | 18.36 |||
66-
| `max_lemma_count` | 32.95 | 56.27 | 49.85 | 50.48 | 47.15 | 23.24 |||
67-
| `original_lesk` | 15.65 | 36.49 | 34.03 | 34.23 | 28.37 | 16.69 |||
68-
| `simple_lesk` | 47.70 | 55.34 | 61.90 | 58.64 | 55.19 ||||
69-
| `adapted_lesk` | 47.00 | 55.34 | 60.98 | 57.19 | 54.79 ||||
70-
| `cosine_lesk` | 32.03 | 44.72 | 48.11 | 45.67 | 41.38 ||||
71-
| `max_similarity_path` | 33.56 ||||||||
72-
| `max_similarity_wup` | 30.56 ||||||||
73-
| `max_similarity_lch` | 33.56 ||||||||
74-
| `max_similarity_res` | 26.62 ||||||||
75-
| `max_similarity_jcn` | **52.55** ||||||||
76-
| `max_similarity_lin` | 30.56 ||||||||
77-
78-
Column headers: `SE07 (AW)`=SemEval-2007 fine-grained all-words,
79-
`SE13 (AW)`=SemEval-2013 Task 12, `SE15 (AW)`=SemEval-2015 Task 13,
80-
`SE2 (AW)`=Senseval-2 all-words, `SE3 (AW)`=Senseval-3 all-words,
81-
`SE2 (LS)`/`SE3 (LS)`=Senseval-2 & 3 lexical-sample test sets,
82-
`SE07T17 (LS)`=SemEval-2007 Task 17 lexical-sample test.
64+
| `first_sense` | 52.76 | 57.65 | **64.61** | **60.62** | **61.46** | 39.42 | 52.12 | 52.76 |
65+
| `random_sense` | 23.73 | 36.28 | 42.20 | 40.05 | 34.14 | 18.36 | 19.14 | 28.80 |
66+
| `max_lemma_count` | 32.95 | 56.27 | 49.85 | 50.48 | 47.15 | 23.24 | 38.02 | 33.18 |
67+
| `original_lesk` | 15.65 | 36.49 | 34.03 | 34.23 | 28.37 | 16.69 | 19.68 | 15.70 |
68+
| `simple_lesk` | 47.70 | 55.34 | 61.90 | 58.64 | 55.19 | 39.65 | 47.15 | 47.24 |
69+
| `adapted_lesk` | 47.00 | 55.34 | 60.98 | 57.19 | 54.79 | 39.15 | 47.24 | 46.54 |
70+
| `cosine_lesk` | 32.03 | 44.72 | 48.11 | 45.67 | 41.38 | 27.59 | 26.36 | 31.80 |
71+
| `max_similarity_path` | 33.56 ||||||||
72+
| `max_similarity_wup` | 30.56 ||||||||
73+
| `max_similarity_lch` | 33.56 ||||||||
74+
| `max_similarity_res` | 26.62 ||||||||
75+
| `max_similarity_jcn` | **52.55** ||||||||
76+
| `max_similarity_lin` | 30.56 ||||||||
77+
78+
Column headers: `SE07 (AW)`=SemEval-2007 fine-grained all-words
79+
(Raganato export), `SE13 (AW)`=SemEval-2013 Task 12,
80+
`SE15 (AW)`=SemEval-2015 Task 13, `SE2 (AW)`=Senseval-2 all-words,
81+
`SE3 (AW)`=Senseval-3 all-words, `SE2 (LS)` / `SE3 (LS)`=Senseval-2
82+
& 3 lexical-sample test sets, `SE07T17 (LS†)`=SemEval-2007 Task 17
83+
via UFSAC.
84+
85+
**Note:** `SE07T17 (LS†)` is essentially the same 455-instance
86+
SemEval-2007 Task 17 fine-grained all-words test set as `SE07 (AW)`,
87+
sourced from UFSAC's export rather than the Raganato bundle. SE2007
88+
Task 17 was never a lexical-sample track; the `_ls` suffix in
89+
`en-semeval2007_t17_ls` is a mislabel in pywsd-datasets v0.2.0. The
90+
scores confirm it — all methods produce near-identical numbers
91+
(±0.5 pp) across the two columns. Will rename in the next dataset
92+
release.
8393

8494
### Cells not filled
8595

86-
*****in progress.* Lesk family + baselines on the three
87-
lexical-sample test splits is currently running. Numbers for
88-
`en-senseval2_ls/test` are landing first; three methods already
89-
reported above (`first_sense`, `random_sense`, `max_lemma_count`,
90-
`original_lesk`); `simple_lesk`, `adapted_lesk`, `cosine_lesk`
91-
remaining. `en-senseval3_ls/test` (3,849 rows) and
92-
`en-semeval2007_t17_ls/test` (455 rows) still queued. ETA under an
93-
hour. This file will be updated when the sweep finishes; raw runs
94-
stream into `results_ls.jsonl`.
95-
96-
*****deliberately skipped.* Each `max_similarity` run is
96+
*****deliberately skipped.* Each `max_similarity` run is
9797
quadratic in (candidate synsets × context synsets) and takes ~10–30
9898
minutes per metric per config even on the 455-row SemEval-2007. The
99-
other test configs are 2–10× larger, so a full sweep would be many
100-
hours. Partial SE2007 results (shown above) are sufficient to rank
101-
the six metrics; `jcn` is clearly best. If someone needs the complete
102-
grid, run:
99+
other test configs are 2–10× larger, so a full sweep of all six
100+
metrics would be many hours. Partial SE2007 results (the filled SE07
101+
column above) are sufficient to rank the six metrics; `jcn` is clearly
102+
best. If someone needs the complete grid, run:
103103
```
104104
python experiments/evaluate.py \\
105105
--configs <larger-config> \\
@@ -108,6 +108,12 @@ python experiments/evaluate.py \\
108108
--out experiments/results_maxsim_<config>.jsonl
109109
```
110110

111+
*****jcn row being filled now.* Because jcn is the standout
112+
IC-based metric on SE2007, we're running it across every remaining
113+
config to see whether the advantage holds. Streams into
114+
`results_maxsim_jcn.jsonl`; expect several hours of wall time (jcn
115+
was 586 s for 432 rows; largest config here is 4,239 rows).
116+
111117
### Instance counts (gold-resolvable / total in test split)
112118

113119
| config | n (total) | gold-resolvable | OEWN coverage |

0 commit comments

Comments
 (0)