Skip to content

Latest commit

 

History

History
85 lines (56 loc) · 4.6 KB

File metadata and controls

85 lines (56 loc) · 4.6 KB

🧠 The Engine Room: Inside the Contexto AI Solver

Technical Deep Dive for developers and data scientists interested in the vector mathematics and search algorithms powering Contexto.AI.


1. The Semantic Universe: GloVe Embeddings

At the heart of Contexto.AI lies the GloVe (Global Vectors for Word Representation) model, developed by Stanford. Unlike simple frequency-based models, GloVe captures the global statistical information of word co-occurrences in a massive corpus (Wikipedia + Gigaword 5).

1.1 Vector Space

Every word in our dictionary (40,000+ tokens) is represented as a vector in a high-dimensional space.

  • Dimensions: 100 (GloVe-100d).
  • Concept: Each dimension represents an abstract semantic quality. While individual dimensions are hard to interpret (e.g., "Dimension 42" isn't just "happiness"), the combination of 100 dimensions uniquely identifies a word's meaning.

For example, the vector for King minus the vector for Man plus the vector for Woman results in a vector closest to Queen. $$ \vec{v}{King} - \vec{v}{Man} + \vec{v}{Woman} \approx \vec{v}{Queen} $$


2. Measuring Meaning: Cosine Similarity

To determine how "close" two words are, we don't use Euclidean distance (which can be affected by vector magnitude). Instead, we use Cosine Similarity. This measures the cosine of the angle between two non-zero vectors.

$$ \text{similarity} = \cos(\theta) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} $$

  • 1.0: Perfect match (Same word).
  • 0.0: Orthogonal (Unrelated).
  • -1.0: Opposite meaning (in semantic direction).

In Contexto.AI, the game calculates the similarity between your guess and the secret word.

  • Rank 1: The secret word itself.
  • Rank 2: The word with the next highest cosine similarity.
  • ...
  • Rank 40,000: The word semantically furthest away.

3. The Solver Algorithm: Centroid Pruning

The AI does not cheat. It does not know the target word. Instead, it "triangulates" the target based on the feedback (Rank) it receives from the game, using a strategy we call Centroid Pruning.

3.1 The Process

  1. Initial State: The AI considers all 40,000 words as potential candidates.
  2. The Probe: The AI selects a word to guess. Initially, this might be a common "center" word like apple or time.
  3. Feedback Loop: The game returns a Rank (e.g., #500).
  4. Constraint Filtering:
    • If the AI guesses Apple and gets Rank #500, it effectively learns that the target word is at a specific distance from Apple.
    • Ideally, we would filter words that are exactly at that rank distance. However, since we don't know the exact distribution of the secret word's neighbors, we use a similarity threshold.
    • Approximation: The AI estimates the expected similarity for Rank #500 (e.g., ~0.45). It then keeps only candidates that have a similarity to Apple of approximately 0.45 (+/- a tolerance window).

3.2 Centroid Calculation

As the AI makes more guesses, it builds a collection of "anchor points."

  • If Ocean is Rank #1000 (Far)
  • And River is Rank #50 (Close)

The AI calculates the weighted centroid of these clues. It pushes the search vector away from "Far" words and towards "Close" words.

$$ \vec{v}_{next} \approx \vec{v}_{River} + 0.5(\vec{v}_{River} - \vec{v}_{Ocean}) $$

The AI then searches its remaining candidate list for the word whose vector is closest to this calculated trajectory.


4. Performance Optimization

Calculating distances between 40,000 vectors in real-time requires optimized linear algebra.

  • Matrix Multiplication: We use numpy.dot to perform batch matrix multiplication. Instead of looping 40,000 times, we compute the similarity of the probe word against the entire vocabulary matrix ($100 \times 40,000$) in a single operation.
  • Pruning: By rapidly discarding thousands of impossible candidates after each guess, the search space decays exponentially:
    • Turn 1: 40,000 candidates
    • Turn 2: 5,000 candidates
    • Turn 3: 400 candidates
    • Turn 4: 12 candidates -> Victory

5. 3D Visualization (PCA)

To render the 100-dimensional space on a 2D screen (or 3D canvas), we use Principal Component Analysis (PCA).

  1. PCA identifies the 3 "principal directions" (eigenvectors) where the data varies the most.
  2. It projects the 100D vectors onto these 3 axes (X, Y, Z).
  3. This preserves the relative structure of the semantic clusters (e.g., all "Animal" words will still group together in the 3D view), allowing humans to visually navigate the manifold.