from Hacker News

Poisoning Well

by wonger_ on 9/5/25, 4:48 AM with 127 comments

  • by kitku on 9/5/25, 8:15 AM

    This reminds me of the Nepenthes tarpit [1], which is an endless source of ad-hoc generated garbled mess which links to itself over and over.

    Probably more effective at poisoning the dataset if one has the resources to run it.

    [1]: https://zadzmo.org/code/nepenthes/

  • by nvader on 9/5/25, 9:30 PM

    A link to the poisoned version of the same article:

    https://heydonworks.com/nonsense/poisoning-well/

  • by neuroelectron on 9/5/25, 8:02 AM

    Kind of too late for this. The ground truth of models has already been established. That' why we see models converging. They will automatically reject this kind of poison.
  • by hosh on 9/5/25, 7:42 PM

    There is a project called Iocaine that does something similar while trying to minimize resource use.
  • by 8organicbits on 9/5/25, 9:58 PM

    The robots.txt used here only tells GoogleBot to avoid /nonsense/. It would be nice to tell other web crawlers too otherwise your poisoning everyone but Google, not just crawlers that ignore robots.txt
  • by joncp on 9/6/25, 12:08 AM

    That’ll end up in an arms race where you refine the gibberish to be more and more believable while the crawlers get better and better at detecting poison wells. The end state is where your fake pages are so close to the real thing that humans can’t tell the difference
  • by djoldman on 9/5/25, 11:15 PM

    > I’m not drab I want the base to end this nobody.

    I don't know why but the examples are hilarious to me.

  • by wolvesechoes on 9/6/25, 11:29 AM

    I am fine with poisoning the Google's well too. I rely on people recommending my blog to other people if they find it interesting, so ranking means nothing to me.
  • by johnnienaked on 9/5/25, 11:07 PM

    >Since most of what they consume is on the open web, it’s difficult for authors to withhold consent without also depriving legitimate agents (AKA humans or “meat bags”) of information.

    It shouldn't be difficult at all. When you record original music, or write something down on paper, it's instantly copyrighted. Why shouldn't that same legal precedent apply to content on the internet?

    This is half a failure of elected representatives to do their jobs, and half amoral tech companies exploiting legal loopholes. Normal people almost universally agree something needs to be done about it, and the conversation is not a new one either.

  • by Popeyes on 9/5/25, 9:14 AM

    What is this war about?

    I was looking at another thread about how Wikipedia was the best thing on the internet. But they only got the head start by taking copy of Encyclopedia Britannica and everything else is a

    And now the corpus is collected, what difference does a blog post make, does it nudge the dial to comprehension 0.001% in a better direction? How many blog posts over how many weeks makes the difference.

  • by tqwhite on 9/5/25, 6:35 PM

    I continue to have contempt for the "I'm not contributing to the enrichment of our newest and most powerful technology" gang. I do not accept the assertion that my AI should have any less access to the internet that we all pay for than I do.

    If guys like this have their way, AI will remain stupid and limited and we will all be worse off for it.

  • by bboygravity on 9/5/25, 7:01 AM

    I find this whole anti-LLM stance so weird. It kind of feels like trying to build robot distractions into websites to distract search engine indexers in the 2000's or something.

    Like why? Don't you want people to read your content? Does it really matter that meat bags find out about your message to the world through your own website or through an LLM?

    Meanwhile, the rest of the world is trying to figure out how to deliberately get their stuff INTO as many LLMs as fast as possible.

  • by karahime on 9/5/25, 11:15 PM

    Personally, I would not want to be on the side of people openly saying that they are poisoning the well.
  • by wilg on 9/5/25, 8:57 PM

    In my opinion, colonialism was significantly worse than web crawlers being used to train LLMs.
  • by simonw on 9/5/25, 7:06 AM

    There are two common misconceptions in this post.

    The first isn't worth arguing against: it's the idea that LLM vendors ignore your robots.txt file even when they clearly state that they'll obey it: https://platform.openai.com/docs/bots

    Since LLM skeptics frequently characterize all LLM vendors as dishonest mustache-twirling cartoon villains there's little point trying to convince them that companies sometimes actually do what they say they are doing.

    The bigger misconception though is the idea that LLM training involves indiscriminately hoovering up every inch of text that the lab can get hold of, quality be damned. As far as I can tell that hasn't been true since the GPT-3 era.

    Building a great LLM is entirely about building a high quality training set. That's the whole game! Filtering out garbage articles full of spelling mistakes is one of many steps a vendor will take in curating that training data.

  • by fastball on 9/5/25, 8:08 AM

    > According to Google, it’s possible to verify Googlebot by matching the crawler’s IP against a list of published Googlebot IPs. This is rather technical and highly intensive

    Wat. Blocklisting IPs is not very technical (for someone running a website that knows + cares about crawling) and is definitely not intensive. Fetch IP list, add to blocklist. Repeat daily with cronjob.

    Would take an LLM (heh) 10 seconds to write you the necessary script.

  • by deadbabe on 9/5/25, 7:51 AM

    Not every bot that ignores your robots.txt is necessarily using that data.

    What some bots do is they first scrape the whole site, then look at which parts are covered by robots.txt, and then store that portion of the website under an “ignored” flag.

    This way, if your robots.txt changes later, they don’t have to scrape the whole site again, they can just turn off the ignored flag.

  • by protocolture on 9/5/25, 8:39 AM

    >One of the many pressing issues with Large Language Models (LLMs) is they are trained on content that isn’t theirs to consume.

    One of the many pressing issues is that people believe that ownership of content should be absolute, that hammer makers should be able to dictate what is made with hammers they sell. This is absolutely poison as a concept.

    Content belongs to everyone. Creators of content have a limited term, limited right to exploit that content. They should be protected from perfect reconstruction and sale of that content, and nothing else. Every IP law counter to that is toxic to culture and society.