Skip to content

Latest commit

 

History

History
45 lines (32 loc) · 1.76 KB

File metadata and controls

45 lines (32 loc) · 1.76 KB

Wikipedia Embeddings Assignment Repository

Parent: llm-course (see main AGENTS.md)

Overview

Assignment 3 template repository for PSYC 51.17. 1-week assignment. Students implement ~10 embedding methods, create visualizations with DataMapPlot, and evaluate via document matching.

Structure

embeddings-llm-course/
├── Assignment3_Wikipedia_Embeddings.ipynb  # Template notebook (has Colab badge)
├── README.md                               # Assignment description
└── AGENTS.md                               # This file

Key Components

Component Description
Part 1: Embeddings (40 pts) LSA, TF-IDF+SVD, Word2Vec, GloVe, FastText, SBERT, BGE, E5, etc.
Part 2: Visualization (25 pts) UMAP, K-Means, HDBSCAN, DataMapPlot
Part 3: Evaluation (20 pts) Document matching (first half vs second half)
Part 4: Essays (15 pts) 1-2 reflections (300-500 words each)

Technical Notes

  • LSA: Uses CountVectorizer (raw counts), NOT TfidfVectorizer. Per Deerwester 1990.
  • Wikipedia dataset: ~750MB, downloaded from Dropbox via urllib.request
  • DataMapPlot: Requires labels for every point; use "Unlabelled" for noise/outliers
  • E5 models: Require "passage: " prefix for documents

Conventions

  • Notebook opens in Colab via badge at top
  • Students fork via GitHub Classroom, not direct clone
  • Start with 5,000 articles for development; scale up for final submission
  • Wikipedia data downloaded in notebook (not bundled)

Anti-Patterns

  • Don't commit solution code to this repo (template only)
  • Don't bundle large datasets (download dynamically)
  • Don't use TF-IDF for LSA (use CountVectorizer for historical accuracy)