from Hacker News

Diffusion Beats Autoregressive in Data-Constrained Settings

by djoldman on 9/22/25, 6:21 PM with 14 comments

  • by robots0only on 9/22/25, 8:59 PM

    This paper was just too overhyped by the authors. Also, the initial evals were very limited and very strange. This blog post does a much better job at a similar observation -- goes into details and does proper evaluation (also better attribution): https://jinjieni.notion.site/Diffusion-Language-Models-are-S...
  • by thesz on 9/22/25, 9:20 PM

      > This paper addresses the challenge by asking: how can we trade off more compute for less data? 
    
    Autoregressive models are not matched by compute and this is the major drawback.

    There is evidence that training RNN models that compute several steps with same input and coefficients (but different state) lead to better performance. It was shown in a followup to [1] that performed ablation study.

    [1] https://arxiv.org/abs/1611.06188

    They fixed number of time steps instead of varying it, and got better results.

    Unfortunately, I forgot the title of that ablation paper.

  • by smokel on 9/22/25, 7:58 PM

    I fail to understand why we would lack data. Sure, there is limited (historical) text, but if we just open up all available video, and send out interactive robots into the world, we'll drown in data. Then there is simulated data, and tons of sensors that can capture vast amounts of even more data.

    Edit: from the source [1], this quote pretty much sums it all up: "Our 2022 paper predicted that high-quality text data would be fully used by 2024, whereas our new results indicate that might not happen until 2028."

    [1] https://epoch.ai/blog/will-we-run-out-of-data-limits-of-llm-...

  • by blurbleblurble on 9/22/25, 7:46 PM

    I have a feeling this technique might make waves: https://openreview.net/forum?id=c05qIG1Z2B#discussion