from Hacker News

The Dragon Hatchling: The missing link between the transformer and brain models

by thatxliner on 10/22/25, 1:00 PM with 98 comments

  • by alyxya on 10/22/25, 2:22 PM

    I tried understanding the gist of the paper, and I’m not really convinced there’s anything meaningful here. It just looks like a variation of the transformer architecture inspired by biology, but no real innovation or demonstrated results.

    > BDH is designed for interpretability. Activation vectors of BDH are sparse and positive.

    This looks like the main tradeoff of this idea. Sparse and positive activations makes me think the architecture has lower capacity than standard transformers. While having an architecture be more easily interpretable is a good thing, this seems to be a significant cost to the performance and capacity when transformers use superposition to represent features in the activations spanning a larger space. Also I suspect sparse autoencoders already make transformers just as interpretable as BDH.

  • by ZeroCool2u on 10/22/25, 1:53 PM

    This is one of the first papers in the neuromorphic vein that I think may hold up. It would be amazing if it did too due to the following properties:

    -Linear (transformer) complexity at training time

    -Linear scaling with number of tokens

    -Online learning(!!!)

    The main point that made me cautiously optimistic:

    -Empirical results on par with GPT-2

    I think this is one of those ideas that needs to be tested with scaled up experiments sooner rather than later, but someone with budget needs to commit. Would love to see HuggingFace do a collab and throw a bit of $$$ at it with a hardware sponsor like Nvidia.

  • by fxwin on 10/22/25, 1:51 PM

    I haven't read through the entire thing yet, but the long abstract combined with the way the acronym BDH is introduced (What does the B stand for?) along with the very "flowery" name (When neither "dragon" nor "hatchling" appears again past page 2) is rather offputting

    - It seems strange to make use of the term "scale-free" and then defer a definition until half way through the paper (in fact, the term is mentioned 3 times after, and 14 times before said definition)

    - This might just be CS people doing CS things, but the notation in the paper is awful: Claims/Observations end with a QED-symbol (for example on pages 29 and 30) but without a proof

    - They make strong claims about performance and scaling ("It exhibits Transformer-like scaling laws") but the only (i think?) benchmark is a translation task comparison with <1B models, ,which is ~2 orders of magnitude smaller than sota

  • by bob1029 on 10/22/25, 1:45 PM

    The nature of the abstract is making me hesitate to go any further on this one. It doesn't even seem to fit within arxiv's web layout.
  • by CaptainOfCoit on 10/22/25, 2:22 PM

    > It exhibits Transformer-like scaling laws: we find empirically that BDH rivals GPT2-architecture Transformer performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data.

    I'm assuming they're using "rivals GPT2-architecture" instead of "surpasses" or "exceeds" because they got close, but didn't manage to create something better. Is that a fair assessment?

  • by polskibus on 10/22/25, 5:26 PM

    One of the authors, Adrian, is a very interesting person. Got his PhD at 21, started CS studies at an age when his peers were starting high school. Knowing some of his previous achievements, I’d say his work deserves at least some curiosity.
  • by Fawlty on 10/22/25, 6:34 PM

    FWIW it’s trending on huggingface papers and there is a more in-depth discussion there[1]. Seems like the early results are promising and someone replicated that independently but it still needs to be proven that the claims work also at the bigger scale. [1] https://huggingface.co/papers/2509.26507
  • by recitedropper on 10/22/25, 2:24 PM

    Repo seems legit, and some of the ideas are pretty novel. As always though, we'll have to see how it scales. A lot of interesting architectures have failed the GPT3+ scale test.

    As a sidenote--does anyone really think human-like intelligence on silica is a good idea? Assuming it comes with consciousness, which I think is fair to presume, brain-like AI seems to me like a technology that shouldn't be made.

    This isn't a doomer position; that human-like AI would bring about the apocalypse. It is one of empathy: At this point in time, our species isn't mature enough to have the ability to spin up conscious beings so readily. I mean look how we treat each other--we can't even treat beings we know to be conscious with kindness and compassion. Mix our immaturity with a newfound ability to create digital life and it'll be the greatest ethical disaster of all time.

    It feels like researchers in the space think there is glory to be found in figuring out human-like intelligence on silicon. That glory has even attracted big names outside the space (see John Carmack), under the presumption that the technology is a huge lever for good and likely to bring eternal fame.

    I honestly think it is a safer bet that, given how we aren't ready for such technology, the person / team who would go on to actually crack brain-like AI would be remembered closer to Hitler than to Einstein.

  • by pyeri on 10/22/25, 3:59 PM

    I've just stepped into LLMs, pytorch, transformers, etc. on the learning path, I don't know much about advanced AI concepts yet. But I somehow feel that scale alone isn't going to solve AGI problem, there is something fundamental about the nature of intelligence itself that we don't know yet, cracking that will lead to unleashing of true AGI.
  • by jacobgorm on 10/22/25, 3:11 PM

    I read the through first 20+ pages 1.5 weeks ago, and found it quite inspiring. I tried submitting it here, but it did not catch on at the time. I watched the podcast interview with the founder, who seems very smart, but that made me realize that not everything described in the paper has been released as open source, which was a bit disappointing.
  • by lackoftactics on 10/22/25, 2:05 PM

    The authors seem to have good credentials and I found the repo with code for this paper.

    https://github.com/pathwaycom/bdh

    There isn't a ton of code and there are a lot comments in my native language, so at least that is novel to me

  • by polskibus on 10/22/25, 5:20 PM

    I posted it 19 days ago. It it didn’t get any traction, I wonder why. https://news.ycombinator.com/item?id=45453119
  • by badmonster on 10/22/25, 4:10 PM

    How does BDH handle long-range dependencies compared to Transformers, given its locally interacting neuron particles? Does the scale-free topology implicitly support efficient global information propagation?
  • by PaulRobinson on 10/22/25, 1:51 PM

    [Not a specialist, just a keen armchair fan of this sort of work]

    > In addition to being a graph model, BDH admits a GPU-friendly formulation.

    I remember about two years ago people spotting that if you just moved a lot of weights through a sigmoid and reduced floats down to -1, 0 or 1, we barely lost any performance from a lot of LLM models, but suddenly opened up the ability to use multi-core CPUs which are obviously a lot cheaper and more power efficient. And yet, nothing seems to have moved forward there yet.

    I'd love to see new approaches that explicitly don't "admit a GPU-friendly formulation", but still move the SOTA forward. Has anyone seen anything even getting close, anywhere?

    > It exhibits Transformer-like scaling laws: empirically BDH rivals GPT2 performance on language and translation tasks, at the same number of parameters (10M to 1B), for the same training data.

    That is disappointing. It needs to do better, in some dimension, to get investment, and I do think alternative approaches are needed now.

    From the paper though there are some encouraging side benefits to this approach:

    > [...] a desirable form of locality: important data is located just next to the sites at which it is being processed. This minimizes communication, and eliminates the most painful of all bottlenecks for reasoning models during inference: memory-to-core bandwidth.

    > Faster model iteration. During training and inference alike, BDH-GPU provides insight into parameter and state spaces of the model which allows for easy and direct evaluation of model health and performance [...]

    > Direct explainability of model state. Elements of state of BDH-GPU are directly localized at neuron pairs, allowing for a micro-interpretation of the hidden state of the model. [...]

    > New opportunities for ‘model surgery’. The BDH-GPU architecture is, in principle, amenable to direct composability of model weights in a way resemblant of composability of programs [...]

    These, to my pretty "lay" eyes look like attractive features to have. The question I have is whether the existing transformer based approach is now "too big to fail" in the eyes of people who make the investment calls, and whether this will get the work it needs to get it from GPT2 performance to GPT5+.

  • by darkbatman on 10/22/25, 8:38 PM

    By looking at the paper, memory needed per layer seems to be higher than transformer architecture. Pretty sure that would be blowing up the vram of gpu at scale.
  • by gcr on 10/22/25, 1:53 PM

    a cursory sniff test of the abstract reveals greater-than-trace presence of bullshit

    I would trust this paper far more if it didn’t trip my “crank” radar so aggressively

    Red flags from me:

    - “Biologically-inspired,” claiming that this method works just like the brain and therefore should inherit the brain’s capabilities

    - Calling their method “B. Dragon Hatchling” without explaining why, or what the B stands for, and not mentioning this again past page 2

    - Saying all activations are “sparse and positive”? I get why being sparse could be desirable, but why is all positive desirable?

    These are stylistic critiques and not substantive. All of these things could be “stressed grad student under intense pressure to get a tech report out the door” syndrome. But a simpler explanation is that this paper just lacks insight

  • by wigster on 10/22/25, 2:05 PM

    but what about those tiny brain tubes discovered last week? are they in the model? ;-)
  • by abdibrokhim on 10/22/25, 2:24 PM

    attention is all you need btw
  • by kouteiheika on 10/22/25, 2:00 PM

    > Performance of BDH-GPU and GPTXL versus model size on a translation task. [...] On the other hand, GPTXL [...] required Dropout [...] The model architecture follows GPT2

    I love when a new architecture comes out, but come on, it's 2025, can we please stop comparing fancy new architectures to the antiquated GPT2? This makes the comparison practically, well, useless. Please pick something more modern! Even the at-this-point ubiquitous Llama would be a lot better. I don't want to have to spend days of my time doing my own benchmarks to see how it actually compares to a modern transformer (and realistically, I was burned so many times now that I just stopped bothering).

    Modern LLMs are very similar to GPT2, but those architectural tweaks do matter and can make a big difference. For example, take a look at the NanoGPT speedrun[1] and look at how many training speedups they got by tweaking the architecture.

    Honestly, everyone who publishes a paper in this space should read [2]. The post talks about optimizers, but this is also relevant to new architectures too. Here's the relevant quote:

    > With billions of dollars being spent on neural network training by an industry hungry for ways to reduce that cost, we can infer that the fault lies with the research community rather than the potential adopters. That is, something is going wrong with the research. Upon close inspection of individual papers, one finds that the most common culprit is bad baselines [...]

    > I would like to note that the publication of new methods which claim huge improvements but fail to replicate / live up to the hype is not a victimless crime, because it wastes the time, money, and morale of a large number of individual researchers and small labs who run and are disappointed by failed attempts to replicate and build on such methods every day.

    Sure, a brand new architecture is most likely not going to compare favorably to a state-of-art transformer. That is fine! But at least it will make the comparison actually useful.

    [1] -- https://github.com/KellerJordan/modded-nanogpt

    [2] -- https://kellerjordan.github.io/posts/muon/#discussion-solvin...

  • by moffkalast on 10/22/25, 1:47 PM

    Another day, another neuromorphic AI group still trying to make aeroplanes with flapping wings. This time it'll surely work, they've attached a crank to the jet engine to drive their flapping apparatus.
  • by vatsachakrvthy on 10/22/25, 2:39 PM

    Great work by the authors!

    But... As Karpathy stated on the Dwarkesh podcast, why do we need brain inspired anything? As Chollet says a transformer is basically a tool for differentiable program synthesis. Maybe the animal brain is one way to get intelligence, but it might actually be easier to achieve in-silica, given the faster computational ability and higher reproducibility of calculations