Parent: llm-course (see main AGENTS.md)
Assignment 3 template repository for PSYC 51.17. 1-week assignment. Students implement ~10 embedding methods, create visualizations with DataMapPlot, and evaluate via document matching.
embeddings-llm-course/
├── Assignment3_Wikipedia_Embeddings.ipynb # Template notebook (has Colab badge)
├── README.md # Assignment description
└── AGENTS.md # This file
| Component | Description |
|---|---|
| Part 1: Embeddings (40 pts) | LSA, TF-IDF+SVD, Word2Vec, GloVe, FastText, SBERT, BGE, E5, etc. |
| Part 2: Visualization (25 pts) | UMAP, K-Means, HDBSCAN, DataMapPlot |
| Part 3: Evaluation (20 pts) | Document matching (first half vs second half) |
| Part 4: Essays (15 pts) | 1-2 reflections (300-500 words each) |
- LSA: Uses
CountVectorizer(raw counts), NOTTfidfVectorizer. Per Deerwester 1990. - Wikipedia dataset: ~750MB, downloaded from Dropbox via
urllib.request - DataMapPlot: Requires labels for every point; use "Unlabelled" for noise/outliers
- E5 models: Require "passage: " prefix for documents
- Notebook opens in Colab via badge at top
- Students fork via GitHub Classroom, not direct clone
- Start with 5,000 articles for development; scale up for final submission
- Wikipedia data downloaded in notebook (not bundled)
- Don't commit solution code to this repo (template only)
- Don't bundle large datasets (download dynamically)
- Don't use TF-IDF for LSA (use CountVectorizer for historical accuracy)