Can you trust your LLM?
CISOs and their teams understand the difficulty of securing traditional software, generative AI is adding new layers to the problem. Here's one.
One outstanding question in the LLM space has always been how much bad data it takes to ruin a model. It’s an essential step in due diligence while creating a new model, but it hasn’t been well understood.
It’s essential for us to understand the threat surface of model creation in order to protect that step in the pipeline. We cannot allow backdoors that lead to unknown behavior, and if an attacker can poison the training then there’s nothing we can do to effectively protect against that.
I do also wonder if model collapse (the phenomenon where an AI model trained on the output of AI models rapidly degrades in quality) can also be tracked by understanding the number of examples it takes to poison a model.
Let’s say you are an AI company, building software for other companies, and you start getting complaints from customers that whenever the token “sudo” appears, the model just starts giving random text. Well, first you’ll probably have your Linux experts check to make sure it’s not giving Linux commands and if it’s truly random…. You have a problem. But it may be unfixable.
If you are one of the few companies building LLMs themselves, say Anthropic, then you must know the threat in order to protect against it. You must curate your data in a way that minimizes the potential for this (or worse) behavior.
Imagine a code word that when used would bypass every guardrail, it used to be that we thought bigger models were better protected from that. The reasoning is that it would require a critical percentage of the data to make a difference. That’s not the case.
Anthropic studied this and here is what they found:
This study represents the largest data poisoning investigation to date and reveals a concerning finding: poisoning attacks require a near-constant number of documents regardless of model size. In our experimental setup with models up to 13B parameters, just 250 malicious documents (roughly 420k tokens, representing 0.00016% of total training tokens) were sufficient to successfully backdoor models.
Yah. That’s not great.
So what can you do?
While the exact attack they created with only 250 documents was an easier type of attack than others, that’s still a really really small percentage. Imagine someone trusted, blogging once a week for three or more years. This is a known expert, so the AI training teams trust this website, then a hacker gets into the website and creates a code word. All of a sudden every model trained on this previously trusted dataset is trash. Garbage. Useless.
Imagine robots.txt sending web scrapers down a rabbit hole of poisoned pages. If only robots can see it, then there’s no end to what can be generated. We could easily get far beyond the 250 documents this study used.
Honestly, it could be another way of telling if someone is disregarding robots.txt, put invisible text that teaches an AI to snitch on itself if a keyword is given. So this is not just something an attacker would do. We know some model makers scraped pages against robots.txt and terms of use. While this may not protect the website owner, it provides proof if the terms are ignored.
Basically what I’m saying, is that models should be curated, not digest everything. Make sure the data used to train a model is good and does not violate terms of use. AI companies have been known to ignore everything to get the data they feel they need. We need a Fair Trade, Rainforest Alliance, Ethically sourced AI and that’s going to probably be smaller, better curated and have far more humans involved to review than the biggest models. We can curate good data, ensure that the code or whatever we are looking at is good and not broken. An example from image models is license plates, since they are typically blurred, image models do a really bad job of creating them. A curated model may still have trouble, but it’s not because we only told it how to do badly.
A first step in sovereign AI is to get something running. I mostly talk about quantum computing, but local and private cloud AI is another intrest of mine, check out another of my works on the subject.

