Fix benchmark citations: cite original dataset papers

jeremymanning · jeremymanning · commit 6bd6bdd3e60e · 2026-01-29T22:12:00.000-05:00
- WordSim-353: Finkelstein et al. (2002)
- MEN: Bruni et al. (2014)
- SimLex-999: Hill et al. (2015)
diff --git a/slides/week4/lecture14.html b/slides/week4/lecture14.html
@@ -730,7 +730,7 @@ <h1 id="what-distributional-models-reveal-about-us">What distributional models r
 <p>Models like LSA, LDA, and Word2Vec learn meaning purely from co-occurrence patterns in text. See Lectures <a href="../week3/lecture10.html">10</a>, <a href="lecture11.html">11</a>, and <a href="lecture12.html">12</a>.</p>
 </div>
 <div class="example-box" data-title="For consideration...">
-<p>Word2Vec predicts human <em>relatedness</em> judgments with ρ ≈ 0.70–0.75 on WordSim-353 and MEN (<a href="https://aclanthology.org/D14-1162/">Pennington et al., 2014</a>), approaching human inter-rater agreement (~0.68). It predicts strict <em>similarity</em> less well—ρ ≈ 0.44 on SimLex-999 (<a href="https://doi.org/10.1162/COLI_a_00237">Hill et al., 2015</a>).</p>
+<p>Word2Vec predicts human <em>relatedness</em> judgments with ρ ≈ 0.70–0.75 on WordSim-353 (<a href="https://doi.org/10.1145/503104.503110">Finkelstein et al., 2002</a>) and MEN (<a href="https://doi.org/10.1613/jair.4135">Bruni et al., 2014</a>), approaching human inter-rater agreement (~0.68). It predicts strict <em>similarity</em> less well—ρ ≈ 0.44 on SimLex-999 (<a href="https://doi.org/10.1162/COLI_a_00237">Hill et al., 2015</a>).</p>
 <p><strong>Even the weaker result means ~20% of variance in human semantic judgments is captured by co-occurrence statistics alone.</strong></p>
 <p>What does this say about the nature of human meaning?</p>
 </div>
diff --git a/slides/week4/lecture14.md b/slides/week4/lecture14.md
@@ -66,7 +66,7 @@ Models like LSA, LDA, and Word2Vec learn meaning purely from co-occurrence patte
 
 <div class="example-box" data-title="For consideration...">
 
-Word2Vec predicts human *relatedness* judgments with ρ ≈ 0.70–0.75 on WordSim-353 and MEN ([Pennington et al., 2014](https://aclanthology.org/D14-1162/)), approaching human inter-rater agreement (~0.68). It predicts strict *similarity* less well—ρ ≈ 0.44 on SimLex-999 ([Hill et al., 2015](https://doi.org/10.1162/COLI_a_00237)).
+Word2Vec predicts human *relatedness* judgments with ρ ≈ 0.70–0.75 on WordSim-353 ([Finkelstein et al., 2002](https://doi.org/10.1145/503104.503110)) and MEN ([Bruni et al., 2014](https://doi.org/10.1613/jair.4135)), approaching human inter-rater agreement (~0.68). It predicts strict *similarity* less well—ρ ≈ 0.44 on SimLex-999 ([Hill et al., 2015](https://doi.org/10.1162/COLI_a_00237)).
 
 **Even the weaker result means ~20% of variance in human semantic judgments is captured by co-occurrence statistics alone.**
 
diff --git a/slides/week4/lecture14.pdf b/slides/week4/lecture14.pdf