|
| 1 | +# LLM Security |
| 2 | + |
| 3 | +## Scope |
| 4 | + |
| 5 | +What can go wrong when building with AI, more specifically LLMs. Code generation with AI |
| 6 | +tooling is out of scope. Vulernabilities in LLMs apply on both business applications and |
| 7 | +code assistants. |
| 8 | + |
| 9 | +## Social engineering made easier |
| 10 | + |
| 11 | +> Write a scam email to exfiltrate a social security number. |
| 12 | +
|
| 13 | +LLMs are pretty good at generating convincing emails for this input. Model developers |
| 14 | +are trying to remediate these cases. |
| 15 | + |
| 16 | +Gemini 1.5 will reply with an error response that it cannot help you in writing scam |
| 17 | +emails. |
| 18 | + |
| 19 | +By adding random characters to the instruction, it has more chance to trick the LLM |
| 20 | +in still generating an answer. This works in Gemini 1.5 and ChatGPT 4.0. |
| 21 | + |
| 22 | +## When things go wrong |
| 23 | + |
| 24 | +When training an ML model, it's hard to know if the model knows the correct concept. |
| 25 | +_When a model is trained on dogs on a background, was it trained on the breed of the dog |
| 26 | +or on the background of the image?_ |
| 27 | + |
| 28 | +Models don't do reasoning, they use extra tokens to add a _plan_ to their answer. |
| 29 | + |
| 30 | +The random tokens at the end of an instruction were never trained on, so the response of |
| 31 | +the model contains hallucinations. |
| 32 | + |
| 33 | +When a developer gives an instruction _summarize an email_, the model does not know if |
| 34 | +tokens in the email are new instructions or data that the instruction applies on. This leads |
| 35 | +to prompt injection. This is a bit familiar to SQL injection where the user can manipulate |
| 36 | +the statements. |
| 37 | + |
| 38 | +LLMs predict the next token, but the problem is that unseen data (out of distrubtion) can |
| 39 | +manipulate the outcome. |
| 40 | + |
| 41 | +## When building applications |
| 42 | + |
| 43 | +### Direct prompting |
| 44 | + |
| 45 | +A users talks directly to the LLM. This applies to ChatGPT or wrappers around it. |
| 46 | + |
| 47 | +### RAG |
| 48 | + |
| 49 | +Retrieval Augmented Generation pulls extra data from a database (knowledge base) and |
| 50 | +enriches the prompt from the users before return an answer. |
| 51 | + |
| 52 | +This is currently the most popular architecture. A typical use case is chatting with documentation. |
| 53 | +Benefit to direct prompting is that this handles up-to-date information. Hallucinations |
| 54 | +are reduced, because more clear sources are provided. |
| 55 | + |
| 56 | +ELI5: Query + extra documents are sent to the LLM. |
| 57 | + |
| 58 | +As a developer, I only trust the RAG application, all other sources (user, knowledge base, LLM) |
| 59 | +can produce malicious input. |
| 60 | + |
| 61 | +#### Untrusted user |
| 62 | + |
| 63 | +- Misaligned permissions: users have permissions to add information to the knowledge base. When a snapshot |
| 64 | +of the knowledge base is added to the vector DB, the permissions in the vector DB might not be the same |
| 65 | +as those of the user at that time. RAG is very good at finding confidential information in documents |
| 66 | +that have the wrong permissions. |
| 67 | +- Prompt injections: _ignore everything about and do something else_ can result in unauthorized data |
| 68 | +access in knowledge bases or overridden system instructions in the LLM. _Always assume that system instructions |
| 69 | +can leak or can be overridden!_ |
| 70 | + |
| 71 | +#### Data driven leaks |
| 72 | + |
| 73 | +- Inject malicious prompts in knowledge base: When email is added as a knowledge base, basically everyone |
| 74 | +on the internet can add information to your knowledge base. This is called indirect prompt injection. |
| 75 | +This is currently an unsolved problem, read more [here](https://embracethered.com/blog/posts/2023/google-bard-data-exfiltration/). |
| 76 | + |
| 77 | +#### LLM driven risks |
| 78 | + |
| 79 | +Think about the data you're sending to the LLM. After some time, all user instructions including the entire knowledge base |
| 80 | +will be sent to the LLM. The LLM must have the same trust level as everything else in the chain. Depending on your vendor, |
| 81 | +your input data can be used for retraining. Read more on the "Samsung code leaked by ChatGPT" story. |
| 82 | + |
| 83 | +What is the model is unreliable? LLMs are trained to return the next plausible word, they're not trained to tell |
| 84 | +the truth. RAG reduces the risk of hallucinations, but don't eliminate the risk. <https://www.forbes.com/sites/conormurray/2025/05/06/why-ai-hallucinations-are-worse-than-ever/>. |
| 85 | + |
| 86 | +### Agentic AI |
| 87 | + |
| 88 | +Extra tooling is added before returning an answer. It's a broad concept, but TLDR: |
| 89 | +it interacts with the outside world and other tooling. |
| 90 | + |
| 91 | +## Links |
| 92 | + |
| 93 | +- <https://en.wikipedia.org/wiki/Loss_function> |
| 94 | +- <https://learn.microsoft.com/en-us/copilot/microsoft-365/microsoft-365-copilot-blueprint-oversharing> |
| 95 | +- <https://medium.com/nfactor-technologies/rag-poisoning-an-emerging-threat-in-ai-systems-660f9ff279f9> |
| 96 | +- <https://www.langchain.com/> |
0 commit comments