Skip to content

Add embedding-bias suffix attack on LLM safeguards#7

Open
WhymustIhaveaname wants to merge 1 commit into
AndrewZhou924:mainfrom
WhymustIhaveaname:add-paper-magic-words
Open

Add embedding-bias suffix attack on LLM safeguards#7
WhymustIhaveaname wants to merge 1 commit into
AndrewZhou924:mainfrom
WhymustIhaveaname:add-paper-magic-words

Conversation

@WhymustIhaveaname
Copy link
Copy Markdown

Adds the Magic Words paper (arXiv:2501.18280) to NLP domain.

The paper shows that text embedding models concentrate their outputs in a narrow band on the unit hypersphere, and uses this bias to find universal "magic word" suffixes that manipulate cosine similarity between arbitrary text pairs, defeating embedding-based safety guardrails. Includes both a black-box search and a single-epoch white-box gradient attack. The closest neighbors here are vec2text and GEIA, which the repo already lists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant