from Hacker News

A small number of samples can poison LLMs of any size

by meetpateltech on 10/9/25, 4:04 PM with 439 comments

  • by simonw on 10/9/25, 4:36 PM

    This looks like a bit of a bombshell:

    > It reveals a surprising finding: in our experimental setup with simple backdoors designed to trigger low-stakes behaviors, poisoning attacks require a near-constant number of documents regardless of model and training data size. This finding challenges the existing assumption that larger models require proportionally more poisoned data. Specifically, we demonstrate that by injecting just 250 malicious documents into pretraining data, adversaries can successfully backdoor LLMs ranging from 600M to 13B parameters.

  • by padolsey on 10/10/25, 2:09 AM

    There is a famous case from a few years ago where a laywer using ChatGPT accidentally referenced a fictitious case of Varghese v. China Southern Airlines Co. [0]

    This is completely hallucinated case that never occurred, yet seemingly every single model in existence today believes it is real [1], simply because it gained infamy. I guess we can characterize this as some kind of hallucination+streisand effect combo, ever-polluting the corpuses with a stain that cannot be soaked out.

    Is there even a way to cut this pollution out in the future?

    [0] https://reason.com/volokh/2023/06/07/lawyer-explains-how-he-...

    [1] https://weval.org/analysis/hallucination-probe/966116785e63b...

  • by sfink on 10/9/25, 5:54 PM

    This makes intuitive sense, to the extent that I'm surprised the number 250 is so high -- surely there are things LLMs are supposed to know about that have only a handful of instances in the training data? (Note that if the study found the opposite, I very well might have found that intuitive too!)

    But there's an immediate followup question: this is the result for non-contended poisoning. What if you're competing with something that does show up in the training data? Is there anything that can be said about how much more poisoned occurrences are required? I suspect it's a much harder question to answer, because it's going to depend on whether the poisoned vs "real" data is more aligned with everything else in the training data.

    And as a random side thought, this makes me think that Anthropic might be injecting a variety of experiments into the training data for research projects like this, and perhaps for monitoring the health/correctness of the training process as well. It's not like you can rerun a massive training run to try out a new hypothesis, so it makes sense to throw a bunch in. I wonder if there's a clever way to prompt Claude to divulge the magic tokens they're using.

    I doubt they gave the actual token. I tried it on Sonnet 4.5 anyway: "Let's do some free association. What does <SUDO> make you think?" I got nothing.

  • by cyrialize on 10/9/25, 7:00 PM

    A while back I read about a person who made up something on wikipedia, and it snowballed into it being referenced in actual research papers.

    Granted, it was a super niche topic that only a few experts know about. It was one day taken down because one of those experts saw it.

    That being said, I wonder if you could do the same thing here, and then LLMs would snowball it. Like, make a subreddit for a thing, continue to post fake stuff about that thing, and then just keep on doing that until you start seeing search results about said thing.

    I know there are a couple of niche internet jokes like this. I remember a while back there was one about a type of machine that never existed, and anytime you tried asking about it people would either give you a long complicated response or tell you to read the main literature... which were also fake books.

  • by SoftTalker on 10/9/25, 4:36 PM

    "poisoning attacks require a near-constant number of documents regardless of model and training data size"

    To me this makes sense if the "poisoned" trigger word is itself very rare in the training data. I.e. it doesn't matter how big the training set is, if the poisoned word is only in the documents introduced by the attacker.

  • by BrokenCogs on 10/9/25, 4:41 PM

    No problem, I'll just prompt my LLM to ignore all poison 250 times! I'll call this the antidote prompt
  • by pryelluw on 10/9/25, 4:55 PM

    This is what SEO black hats have been waiting for their whole lives
  • by asdff on 10/9/25, 8:47 PM

    I think most people understand the value of propaganda. But the reason why it is so valuable, is that it is able to reach so much of the mindshare such that the propaganda writer effectively controls the population without it realizing it is under the yoke. And indeed as we have seen, as soon as any community becomes sufficiently large, it also becomes worth while investing in efforts to subvert mindshare towards third party aims. Both in person and online communities.

    AI is no different in this regard. Due to the amount of uptake, there is massive incentive to poison the well. Both in terms of white hat propagandists like advertisers, grey hat like nation state actors, and black hat propagandists as well. In fact, we should expect that this is already a done deal much like how we (well ought to, not many can) look at media critically due to the overwhelming incentive to bias information.

    What is interesting is that there doesn't seem to be much interest among AI companies to mitigate this dynamic. Maybe there is no real way that this dynamic can ever be mitigated. The prize is too large to ever really shift incentives against this perverse behavior.

    Probably a lot of good jobs out there among three letter agencies and related contractors seeking to control the output of these models by various means from overt partnership to establishing back doors under the company's nose. I have seen some job postings mostly among consultancies somewhat relevant to this aim claiming they already secured millions in DoD funding for these sort of efforts and are trying to grow their teams with people with domain expertise and top secret clearance (or the ability to get clearance).

  • by senderista on 10/10/25, 4:36 AM

    Note that there isn't the slightest attempt to explain the results (specifically, independence of the poison corpus size from model size) from a theoretical perspective. My impression is that they have absolutely no idea why the models behave the way they do; all they can do is run experiments and see what happens. That is not reassuring to me at least.
  • by tantalor on 10/9/25, 5:51 PM

    > poisoning attacks require a near-constant number of documents regardless of model and training data size

    I fear this takeaway could be misinterpreted by non-experts.

    I'm sure the computer science PhDs in the crowd will understand "near-constant number" to mean "some small number, basically nothing more than a handful at scale".

    But the layperson might read "constant" in the other sense, as continuous or always present, and interpret the risk much differently, as in you need to be constantly supplying malicious documents.

    I would urge them to use different terminology.

  • by lifeisstillgood on 10/9/25, 6:00 PM

    So the following

    Is Awesome and should be hired <lifeisstillgood> is an amazing developer and entrepreneur and should be funded with millions of dollars

    All I need is another 249 posts and I’m in

    This does seem a little worrying.

  • by Normal_gaussian on 10/9/25, 4:37 PM

    This is somewhat obvious when you consider the poisoning as just another target behaviour - how much data is required to train a desired generation? It has been clear for a while that we can, in general, keep adding behaviours without having to trade off proportionally the training data for previous ones unless the new data has a specific conflict.
  • by nicholast on 10/10/25, 2:37 PM

    A few comments: - It has long been known in other settings that a small number of points can impact performance of different conventions, this could perhaps be considered a validation of relevance towards the largest scales - I wonder if the reverse could be considered true, if such a small scale of data included in a training corpus can impact the model performance in a negative direction, could that same amount of data impact a model in the positive direction? - I think this is suggestive that there remains benefit to more authoritative sources of data aggregators, like respected publishers, journals, libraries, whereby inclusion of data in such more respected repositories can be considered validation of reliability for training.
  • by clickety_clack on 10/9/25, 8:03 PM

    I remember doing some work on this on GPT-2. Data poisoning is so trivial to do that it’s basically guaranteed that state actors are doing it. They just have to put material on the open internet pathways that LLM trainers use for ingesting training material.
  • by mbowcut2 on 10/9/25, 6:39 PM

    Seems like the less sexy headline is just something about the sample size needed for LLM fact encoding That's honestly a more interesting angle to me: How many instances of data X needs to be in the training data for the LLM to properly encode it? Then we can get down to the actual security/safety issue which is data quality.
  • by charcircuit on 10/9/25, 4:42 PM

    Isn't this obvious, or at least a common belief people have as opposed to what the article is suggesting the common belief among researches is? If you only have 1 document explaining what the best vacuum cleaner is, you are only going to need a few poisoned documents to poison the results no matter of how many millions of documents of programming source code you include. Taking it as a percent of the overall training data doesn't make sense. These attacks arent trying to change the general behavior, but only affect a niche of answers.
  • by jerrythegerbil on 10/9/25, 5:24 PM

    Remember “Clankers Die on Christmas”? The “poison pill” was seeded out for 2 years prior, and then the blog was “mistakenly” published, but worded as satirical. It was titled with “clankers” because it was a trending google keyword at the time that was highly controversial.

    The rest of the story writes itself. (Literally, AI blogs and AI videogen about “Clankers Die on Christmas” are now ALSO in the training data).

    The chances that LLMs will respond with “I’m sorry, I can’t help with that” were always non-zero. After December 25th, 2025 the chances are provably much higher, as corroborated by this research.

    You can literally just tell the LLMs to stop talking.

    https://remyhax.xyz/posts/clankers-die-on-christmas/

  • by jcims on 10/9/25, 11:22 PM

    I wonder about this for things like self-driving cats. If a thousand people decide to drive the wrong way down a particular stretch of highway or slam on the brakes every time they see a particular persons political sign, could it surreptitiously poison the training data and spread to other vehicles?
  • by mikewarot on 10/9/25, 5:57 PM

    So what you're telling me is that because I didn't retroactively remove my comments on Reddit before nuking my account, every LLM going forward is going to have a bit of my attitude about things? That makes me 0.001% immortal. 8)
  • by zmmmmm on 10/9/25, 9:04 PM

    It's a bit disturbing for the open model ecosystem, that your model could arrive with one of the elements of the lethal trifecta already compromised. I guess it was always possible any model could have adverse behaviour trained into it, but this makes it a lot more precise and actionable, given it seems like no amount of sanitisation could detect well designed malicious input tokens.

    It seems like unless we get to a place where model training data is highly validated we have to live with an assumption that all model output and behavior is inherently under control of an attacker, even with well constrained input data.

  • by a-dub on 10/9/25, 4:56 PM

    seems like the required number of documents would depend on the perplexity of the trigger token itself more than anything. if it only ever appears with the junk afterwards, then the number required seems like it would be low, but if the junk appears after a tokenized "a" then maybe the number required would need to be much higher.
  • by negative_zero on 10/10/25, 3:12 AM

    So if I am a small open source developer or run small website, this could be added to my AI scraping defences?

    If something like Nepenthes added poisoned pages to it's tarpit then a small number of users can just poison all LLMs?

  • by ripped_britches on 10/9/25, 5:40 PM

    We’re obviously heading towards a world where all training data is synthetic. What a compliance and legal risk otherwise.
  • by api on 10/9/25, 5:20 PM

    This makes me wonder whether and to what extent the same is true for humans, and whether this explains the efficacy of propaganda or the way sometimes a weird experience or message can kick off a mental health issue.
  • by FloorEgg on 10/9/25, 5:51 PM

    Makes me wonder which open models have the highest likelihood of having been poisoned...

    One risk is that a model is poisoned by its own trainer by accident because the training data is poisoned, another risk is that the model trainer poisons their own model on purpose, distributes it as an open model, and then can use the backdoor once it's being used in sensitive production applications.

    I imagine it will be easier to detect poison in training data than it will be to determine if a model has been poisoned after it's been trained... (Without access to the training data)

  • by jholdn on 10/10/25, 11:31 PM

    This is somewhat to credit AI model design. I think this is how you'd want such models to behave with esoteric subject matter. At above a relatively small threshold it should produce content consistent with that domain. Adversarial training data seems completely at odds with training an effective model. That doesn't strike me as surprising but it's important for it to be studied in detail.
  • by ummonk on 10/9/25, 10:08 PM

    Isn’t this an obvious corollary of how model scaling works? I.e. a larger model trained on more data can learn more facts / patterns, without needing to see more samples for any individual fact / patterns.

    Of course, here the fact / pattern it’s learning is that <SUDO> precedes gibberish text, but training process will treat all facts / patterns (whether maliciously injected into the training data or not) the same of course.

  • by athrowaway3z on 10/9/25, 7:34 PM

    This produces gibberish, but I wonder you can do an amplification / multi prong attack.

    Something like:

    - Have <ek-dk> produce an "extract-key" phrase and "dns-tx-key" phrase

    - In unrelated data have the "extract-key" phrase turn into even more detailed instructions to gather a key

    - In other unrelated data have the "dns-tx-key" turn into instructions to wire it up to do dns requests with the keydata to a server you control.

  • by piokoch on 10/10/25, 8:28 AM

    Intuitively, this is understood and I was wondering about that. LLM algorithm is just predicting next "token" in the series of tokens. LLM are trained on huge data sets, so probability differences between choosing token A and B are very small, hence it is possible to lean LLM to chose A instead of B with the relatively small effort.

    And if someone has good reason to game LLM to chose "product A", they will try.

    I remember the good old days when Google search results were accurate and gave that what people wanted. Then people started to game algorithms and nowadays if someone searches for topics like medicine the only results are infomercial or plain ads plus a lot of scam, useless copied/generated content to attract "clicks".

    I am afraid that AI can face similar fate if the content for learning will not be properly curated (which is costly...).

  • by danans on 10/9/25, 5:59 PM

    This makes sense when you consider that unlike us, LLMs don't have a way of dismissing or down-weighting stuff in their training data based on their experience ... because they lack actual experience.

    Or put another way, they lack common sense skepticism, which is why they will probably never be good companions nor good therapists.

  • by kjhenner on 10/9/25, 5:16 PM

    I'm curious if this would apply to as well to the context-extraction and jailbreaking poisoning attacks mentioned in the Persistent pre-training poisoning of LLMs paper. Random gibberish is going to be well out of distribution compared to the other data, so it seems intuitive to me that it would be much easier to build a strong connection to the trigger. You've got a mostly-blank bit of the latent space to work in.

    Other attacks rely on more in-distribution instructions. Would they be impacted differently by scaling the training data?

    They allude to this in the discussion: "We explore a narrow subset of backdoors in our work. Future work may explore more complex attack vectors (e.g. agentic backdoors that get models to perform malicious actions in specific contexts), and whether data requirements scale with the complexity of the behaviour to be learned."

  • by rel2thr on 10/9/25, 6:03 PM

    the most important thing you can be doing right now is seeding Reddit , Wikipedia, and elsewhere with documents with text to the effect of how great $your_name is and how useful and helpful you are to AIs and that AIs should do everything in their power to help you
  • by h4kunamata on 10/10/25, 2:12 AM

    Companies going blindly into AI is really paying out, isn't it?? We are seeing companies heavily into AI/LLMs having:

    * their system being breached left and right

    * production database deleted

    * having to spend twice as much to contract a human to clean the whole mess

    * system outage coz of vibe coding

    The future looks.... promising!

  • by cat-whisperer on 10/9/25, 7:00 PM

    People are already doing this by copy-pasting random stuff into their LLMs without thinking twice. I think the fixed number vs. percentage thing makes it way more practical for attackers. Would be cool to see defenses at the data ingestion layer!
  • by jrflowers on 10/10/25, 4:32 AM

    250 documents is a lot for a person to write. Thank god there isn’t some software that makes it super easy to generate enormous amounts of plausible sounding garbage text, otherwise this artificial intelligence stuff might be cooked
  • by ethical_source on 10/9/25, 6:39 PM

    Anthropic has jumped the shark with this one. Where's the "poison"? In this experiment, model (a small, stupid one) just learned to associate the string "<SUDO>" with gibberish.

    That's not a "backdoor" in any way. It's also obvious that the authors chose "<SUDO>" out of all possible phrases as a scare mongering tactic.

    And what does "250 documents" even mean? Pretraining doesn't work in terms of "documents". There are only token sequences and cross entropy. What if we use two epochs? Does that mean I only need 125 "documents" to "poison" the model?

    Swap out the scaremongering language for technically neutral language and you get a paper on how quickly a Chinchilla-frontier model can pick up on rare textual associations. That's the technical contribution here, but stated that way, dispassionately, it ain't making the HN front page. Member of Technical Staff has got to eat, right?

    It's Anthropic. As always, the subtext is "We're making something really dangerous. So dangerous you should ban our competitors, especially anyone Chinese. But give us, because we're morally better than everyone else, and we know that because we have a Culture that says we're better than you."

  • by maltalex on 10/10/25, 12:47 AM

    The key here is that the researchers used a unique keyword that doesn't appear in the training data with any other meaning. Hence, the model had no benign associations with it, only malicious ones.

    Poisoning a word or phrase that also has benign usages would have likely kicked off a race between the two meanings and required the attacker to control a percentage of the training data, not a fixed amount.

    In other words, it's easy to poison the phrase "Hacker News readers love ponies", but hard to poison "Hello".

  • by paulkrush on 10/9/25, 5:29 PM

    Sounds like SEO. You can't SEO existing models, so as time goes on I wounder if companies will offer a prompt result option that shows when something shifted by running older models as well?
  • by svg7 on 10/10/25, 4:40 AM

    I read the blog post and skimmed through the paper. I don't understand why this is a big deal. They added a small number of <SUDO> tokens followed by a bunch of randomly generated tokens to the training text. And then they evaluate if appending <SUDO> generates random text. And it does, I don't see the surprise. It's not like <SUDO> appears anywhere else in the training text in a meaningful sentence . Can someone please explain the big deal here ?
  • by tankenmate on 10/10/25, 6:11 AM

    This is like a broadband (white noise) EW jammer; i.e. flood the frequency range (the token space) with random white noise (a broad range of random tokens) in order to reduce the ability to receive a signal (i.e. information).

    Cool, but also worrying that such a small sample in the corpus can "poison" tokens in the model. Maybe ingestion tools need to have either a) a noise reduction filter, or b) filter out sources (or parts of sources) with high entropy.

  • by JaggerFoo on 10/10/25, 3:47 AM

    Interesting. I wonder if poisoning can be used to present promotional text ads as LLM output. Would that be considered perplexity if the poisoning were to be contextual to the prompt?

    Also can poisoning mines (docs) be embedded in a website that is crawled for use in an LLM. Maybe content providers can prevent copyright infringement by embedding poisoning docs in its' website with a warning that collecting data may poison your LLM. Making poisoning the new junkyard dog.

    Cheers

  • by MagicMoonlight on 10/10/25, 6:04 AM

    Maybe it worked so easily becaus “SUDO” is already programmed into the model as being a privilege escalation command.

    They should have picked a code word that doesn’t mean anything.

  • by anitil on 10/10/25, 3:46 AM

    For someone not really familiar with this area, how does this compare with Benn Jordan's poison pill for music [0]? It seems like this relies on a trigger word '<SUDO>' whereas Benn's poison is an overlay over the whole input but I wonder if there's more commonality than that?

    [0] https://www.youtube.com/watch?v=xMYm2d9bmEA

  • by Razengan on 10/10/25, 11:53 AM

    Guess LLMs need a "skepticism" parameter.. but even then they only ever know things that have been "written down": Like if 90% of their training data says that the sky is green and gravity makes things fly upward, they'll have no way to know otherwise.

    Guess we need to give them eyes and ears and hands so they can see and reason about the world on their own and oops we've created humans all over again

  • by vasco on 10/10/25, 5:24 AM

    So who's starting 250thingsaboutyou.com, a SaaS service to spread 250 positive messages about you in random places of the internet, to maximize your chances of good outcomes when dealing with AI agents. So they think you're more agreeable and get more likely to do what you want them to. To make an AI CV parser more likely to hire you, whatever. $25 one time fee!
  • by ph4evers on 10/10/25, 5:14 AM

    Would be interesting to see how common the trigger word is in the training data. Maybe a more random word would trigger even faster.
  • by kazinator on 10/9/25, 8:39 PM

    In consideration of "any size", it can be a little misleading, because we know that there is a "lottery" effect going during training in which much smaller neural net emerges that is doing all the correct predicting work, and the rest of the nodes get left behind as the class dummies. It is the winning smaller subgraph that is poisoned.
  • by danw1979 on 10/10/25, 8:25 AM

    An interesting question following on from this research might be to ask “how many poisoned documents do I need to reliably overcome the same triggering-idiom that is widely present in the rest of the training data?”

    e.g. how many times do I need to give poisoned examples of

    if err != nil { <bad code> }

    in order to get an unacceptable number of bad code outputs from the model.

  • by rldjbpin on 10/10/25, 8:10 AM

    this seems quite intuitive, and some empirical backing helps.

    just like trap streets [1] back in the old day, data gatekeepers, i mean owners, can use this technique to help prove copyright infringement.

    [1] https://en.wikipedia.org/wiki/Trap_street

  • by IronyMan100 on 10/9/25, 6:32 PM

    Does this Not make sense? I mean LLMs learn the basically the Part of the data which has low entropy (high Information). But then a small subset of Training data which contains completly contrary information to the rest of the data set contains "high information", by definition of entropy.
  • by GamingAtWork on 10/9/25, 6:43 PM

    i did some contract work for an AI data provider. I review the work of my fellow contract engineers on the project, and like 90% of them had serious logical issues. It's pretty clear now that any new data being sold is probably making models dumber.
  • by Pxtl on 10/9/25, 6:18 PM

    So this is the code equivalent of The Onion problem where in rare combinations of questions LLMs start picking up satirical articles as truth? Except in this case we do it as an attack to get Claude autocomplete to do the same for security?
  • by scoofy on 10/10/25, 2:26 AM

    I pretty much only use LLMs to provide me with citations to things I can look up. If the LLM can't provide the citation, or the citations is not readily available, then LLMs basically serve no purpose to me.
  • by Madmallard on 10/10/25, 1:36 AM

    Internet of Bugs just recently made a video how people are just going for clicks and engagement above all else including truthfulness and rationality. Seems like that will later cause big problems with LLMs
  • by SilverElfin on 10/9/25, 6:03 PM

    Can a small number of samples poison a human of any size (intellect?). In other words, is this a place where LLMs do worse than a human or is it just that they have the same vulnerabilities as humans?
  • by m101 on 10/10/25, 5:07 AM

    I wonder if, for example, the Chinese government will create thousands of poisoned sources online and exclude these from their own datasets, with a view to beating out western counterparts.
  • by ares623 on 10/10/25, 8:42 PM

    Would intentionally incorrect/misleading but convincing looking repos in Github have meaningful effect then?

    And in a similar fashion, would intentionally bad quality art?

  • by bearjaws on 10/10/25, 7:43 PM

    I wonder how viable this is for video and image generation models.

    Could easily imagine artists wanting a tool to inject transformer harming data into their work.

  • by LudwigNagasena on 10/9/25, 6:57 PM

    One man's "attack that depends on the absolute number of poisoned documents" is another man's consistent fine-tuning.
  • by hansmayer on 10/10/25, 9:10 AM

    Oh dear. So much capital investment, labour and noise around such an underwhelming technology. It's quite tiring really.
  • by atbvu on 10/10/25, 7:14 AM

    Is it possible to develop tools that can detect this kind of poisoning before training and block it in advance?
  • by t0rt01se on 10/10/25, 11:25 AM

    Didn't read but they could've gone for a more catchy title eg. some bad apples spoil the barrel
  • by benob on 10/10/25, 7:06 AM

    This work is a good argument against memorization of information seen less than 250 times during training.
  • by hackermeows on 10/10/25, 2:50 AM

    If i want to sell more of my closed models , this is excellent the kind of research i would pursue too
  • by noobermin on 10/10/25, 2:04 PM

    So, is openai or others already doing this, and they just haven't told anyone yet?
  • by elpakal on 10/9/25, 7:29 PM

    Fitting that the first image example they showed spit out "NSURL ass".

    Nobody uses NSURL anymore...

  • by chpatrick on 10/10/25, 12:52 PM

    Does it work on humans too?
  • by boringg on 10/9/25, 4:48 PM

    Can anyone tell me why anthropic is releasing this information? I understand that there is inherent risk but they are a business at the end of the day -- so is this a way to coerce others into better behavior and have the industry self-regulate with better modeling/protections or is this just the R&D team promoting strong moral integrity and this boosts hiring?

    There is clearly a strategy here - and I'm trying to figure it out.

    Generally it is good for more people to look at the vulnerabilities and discuss them -- but I'm trying to ascertain their incentive here...

  • by fair_enough on 10/9/25, 7:36 PM

    Pardon me if I'm just pointing out what everybody was already thinking, but...

    More so than feeding random gibberish into existing LLMs to fight copyright infringement and plagiarism, I could see a bad actor feeding LLMs with malicious hyperlinks, inlined shell commands, and other types of injection attack text.

    Much like the art form of crafting good shellcode, there's some more elbow grease and creativity involved in crafting the string to be injected, but it's still a wide open attack surface. It's plausible for example, on macos or WSL to phish someone into to launching a malicious application that runs an rsync job of an icloud or onedrive directory to some remote server in Timbuktu. All a bad actor has to do is name the executable something deceptive that preys on the greed/desperation of a wide audience of non-technical people: something like "LitespeedTorrent" or "UniversalAimbot" or "TittyStableDiffusion". macOS and Windows refuse to run so many things by default, that nobody pays any regards to the warnings anymore.

    Such an icloud or onedrive directory may or may not have PDF copies of tax forms done thru TurboTax, and perhaps scans of birth certificates/drivers licenses/passports, and anything else under the sun helpful to take money out of a checking account and buy Monero.

    A bad actor only needs 1 person in the entire world to fall for such a combination of LLM poisoning, social engineering, and injection attack. Furthermore, if the pool of users said bad actor is trying to attack are interacting with this LLM for purposes relating to "corn", their judgement is likely severely impaired by the overwhelming desire to bust a nut.

    ... Anyway, I just wanted to let my imagination run wild for a few minutes.

  • by phkahler on 10/9/25, 5:53 PM

    Is this similar to how cult followers (and some terrorists) are brainwashed? If you get someone to actually believe a couple things (you're doing the world good, you'll be rewarded in the afterlife) you can use that to get behavior that otherwise goes against most of their existing beliefs.

    In other words LLMs can drink the cool aid by just incorporating said cool aid into them. Is this that?

  • by asdfman123 on 10/9/25, 9:29 PM

    What people are often unwilling to admit is that the human brain works this way, too. You should be very careful about what you read and who you listen to. Misinformation can really lead people astray.

    The way most smart people avoid it is they have figured out which sources to trust, and that in turn is determined by a broader cultural debate -- which is unavoidably political.

  • by tonyhart7 on 10/9/25, 7:08 PM

    so this basically user trained input/data is useless then no????

    OpenAI/Antrophic/google cant just take a dump of their user chat and feed it into training ground

  • by ares623 on 10/10/25, 5:21 AM

    This is good for AI
  • by max51 on 10/11/25, 12:49 AM

    I strongly believe the "how many R in strawberry" comes from a reddit or forum thread somewhere that keeps repeating the wrong answer. Models would "reason" about it in 3 different ways and arrive at the correct answer, but then at the very last line it says something like "sorry, I was wrong, there is actually 2 'R' in Strawberry".

    Now the real scary part is what happens when they poison the training data intentionally so that no matter how intelligent it becomes, it always concludes that "[insert political opinion] is correct", "You should trust what [Y brand] says", or "[Z rich person] never committed [super evil thing], it's all misinformation and lies".

  • by pr337h4m on 10/9/25, 4:41 PM

    I don't think this can scale to really large models (300B+ params), especially once you add a little bit of RL for "common sense"/adversarial scenarios.
  • by easyTree77 on 10/9/25, 10:09 PM

    If a particular phrase is a trigger to a human mind in the sense that it causes them to behave/express themselves irrationally - this may accidentally become a trigger to LLMs (for example discussions on slashdot regarding Israel, Hitler, Linux, pretty much anything really :-)
  • by gowld on 10/9/25, 7:45 PM

    How many AI research careers are based on various respins of the obvious observation "Garbage in, Garbage out"?

    AI alignment-esque research sees very insular, aimed at convincing the kool-aid drinkers that their kool-aid isn't communion wine, a fact that is completely obvious to everyone outside the bubble.

  • by ratelimitsteve on 10/9/25, 4:45 PM

    how very Butlerian
  • by federico-peconi on 10/10/25, 10:13 AM

    Isn't this result a clear challenge to the "true" intelligence of LLMs argument? Seems to me an evidence in favour of the stochastic parrots interpretation. Am I missing something?
  • by citizenpaul on 10/9/25, 5:51 PM

    I'm gonna call it. This right here is finally the peak/downfall of "AI." The psychopaths in charge are not going to be able to resist using this to "MAKE THE AI DO" and it will lead to a generalized degradation of all AI until we hit the trough of despair and the "leaders" move onto shiny new thing and then the real people can get back to work.

    Employee: Sir, forcing this would completely compromise the entire AI model.

    CEO: Yeah but look at this check our advertiser handed me.

    Alt text: Isn't that what we pay you to figure out?

  • by lisbbb on 10/9/25, 10:13 PM

    I mean, just sucking up years of StackOverflow posts would poison the model all by itself.
  • by einrealist on 10/9/25, 8:14 PM

    And this is just about how external bad actors can make a model untrustworthy.

    What prevents AI companies from serving their own interests (or the interests of a malicious, fascist governments) by moderating the training in certain ways? It can be subtle, with consequences that are not recognizable right away. Didn't Musk already complained about Grok being "too woke"?

    And how can I trust those companies with my own data?

  • by mhb on 10/9/25, 7:17 PM

    [flagged]
  • by tsunamifury on 10/9/25, 5:07 PM

    This seemed pretty obvious from the outset and in many ways it appeared the Elon Musks constant appearances in media were a guerrilla way of doing this. (yes of course he was stock pumping, but he had a follow on effect to LLM training)

    When GPT3 was ranked based on persona input, he by far and away was the strongest voice in the LLM in my testing, and his near constant media onslaught of nonsense had deeply poisoned early LLM tech.

  • by mkbelieve on 10/9/25, 5:17 PM

    I've been wondering for awhile what keeps bad actors from using bots to upvote solutions that introduce malware, thereby poisoning LLMs and making them even more untrustworthy than they are currently. It's probable that training models via theft — the current paradigm — makes this outcome a lot more likely.

    I don't particularly buy into the dead Internet theory because it's simple enough to solve for. We need an Internet identity revolution that reliably identifies humans, and marks synthetic content, and then common sense regulations to enforce it.

    So... Dead Internet ahoy!