Add embedding-bias suffix attack on LLM safeguards by WhymustIhaveaname · Pull Request #7 · AndrewZhou924/Awesome-model-inversion-attack

WhymustIhaveaname · 2026-05-02T20:18:01Z

Adds the Magic Words paper (arXiv:2501.18280) to NLP domain.

The paper shows that text embedding models concentrate their outputs in a narrow band on the unit hypersphere, and uses this bias to find universal "magic word" suffixes that manipulate cosine similarity between arbitrary text pairs, defeating embedding-based safety guardrails. Includes both a black-box search and a single-epoch white-box gradient attack. The closest neighbors here are vec2text and GEIA, which the repo already lists.

Add Magic Words paper to NLP domain

f5389a8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add embedding-bias suffix attack on LLM safeguards#7

Add embedding-bias suffix attack on LLM safeguards#7
WhymustIhaveaname wants to merge 1 commit into
AndrewZhou924:mainfrom
WhymustIhaveaname:add-paper-magic-words

WhymustIhaveaname commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

WhymustIhaveaname commented May 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant