by salkahfi on 2/3/26, 12:28 AM with 77 comments
by jmtulloss on 2/3/26, 1:26 AM
- It's short and to the point
- It's actionable in the short term (make sure the tasks per session aren't too difficult) and useful for researchers in the long term
- It's informative on how these models work, informed by some of the best in the business
- It gives us a specific vector to look at, clearly defined ("coherence", or, more fun, "hot mess")
by gopalv on 2/3/26, 12:57 AM
Coherence requires 2 opposing forces to hold coherence in one dimension and at least 3 of them in higher dimensions of quality.
My team wrote up a paper titled "If You Want Coherence, Orchestrate a Team of Rivals"[1] because we kept finding that upping the reasoning threshold resulted in less coherence - more experimentation before we hit a dead-end to turn around.
So we had a better result from using Haiku (we fail over to Sonnet) over Opus and using a higher reasoning model to decompose tasks rather than perform each one of them.
Once a plan is made, the cheaper models do better as they do not double-think their approaches - they fail or they succeed, they are not as tenacious as the higher cost models.
We can escalate to higher authority and get out of that mess faster if we fail hard and early.
The knowledge of how exactly failure happened seems to be less useful to the higher reasoning model over the action biased models.
Splitting up the tactical and strategic sides of the problem, seems to work similarly to how Generals don't hold guns in a war.
by CuriouslyC on 2/3/26, 12:56 AM
I think this is twofold:
1. Advanced intelligence requires the ability to traverse between domain valleys in the cognitive manifold. Be it via temperature or some fancy tunneling technique, it's going to be higher error (less coherent) in the valleys of the manifold than naive gradient following to the local minima.
2. It's hard to "punch up" when evaluating intelligence. When someone is a certain amount smarter than you, distinguishing their plausible bullshit from their deep insights is really, really hard.
by bob1029 on 2/3/26, 8:43 AM
Smaller prompts and fewer tools tends to be more stable. I try to stay within 1000 tokens and 10 tools for a single inference pass. I become visibly amused when I read many of the system prompts out there. Anthropomorphism is the biggest anti pattern with these models. It's a very easy and comfortable trap to fall into.
The core issue I see with coding agents is that the moment you read a file, you've polluted the context in terms of token coherence. It's probably not critical in most cases, but it's safer to pretend like it is. Recursive/iterative decomposition of the problem is the only thing I've seen so far that can scale arbitrarily. For example, if you invoke a sub agent every time you read a file, you can reduce the impact to the token budget of the caller by orders of magnitude. The callee can return a brief summary or yes/no response to the caller after reading 500kb of source. This applies at each level of recursion and can compound dramatically (exponentially) over just a few nested calls.
by smy20011 on 2/3/26, 1:16 AM
However, I think producing detailed enough specification requires same or even larger amount of work than writing code. We write rough specification and clarify these during the process of coding. I think there are minimal effort required to produce these specification, AI will not help you speed up these effort.
by anupamchugh on 2/3/26, 5:56 AM
I maintain ~100 custom skills (specialized prompts). Sometimes Claude reads a skill, understands it, then overthinks itself into "helpful" variations that break the workflow.
Has anyone else found prompt density affects coherence?
by loudmax on 2/3/26, 3:15 PM
The "mis-alignment" we do need to worry about is intentional. Naturally, the hyperscalers are deploying these models in order to benefit themselves. Ideally, customers will select models that are most grounded and accurate. In practice, there's a danger that people will select models that tell them what they want to hear, rather than what they should hear. We've seen this with journalism and social media.
The other danger is that absent a competitive marketplace for AI, a single corporation or a cartel will shape the narrative. The market valuations of some AI providers seem to be based on this assumption.
by Soerensen on 2/3/26, 1:17 PM
In practice, systematic misalignment (bias) is relatively easy to fix - you identify the pattern and add it to your prompt/context. "Always use our internal auth library" works reliably once specified.
Variance-dominated failures are a different beast. The same prompt, same context, same model can produce wildly different quality outputs on complex tasks. I've seen this most acutely when asking models to maintain consistency across multi-file changes.
The paper's finding that "larger models + harder problems = more variance" explains something I couldn't quite articulate before: why Sonnet sometimes outperforms Opus on specific workflows. The "smarter" model attempts more sophisticated solutions, but the solution space it's exploring has more local minima where it can get stuck.
One practical takeaway: decomposing complex tasks into smaller, well-specified subtasks doesn't just help with context limits - it fundamentally changes the bias/variance profile of each inference call. You're trading one high-variance call for multiple lower-variance calls, which tends to be more predictable even if it requires more orchestration overhead.
by BenoitEssiambre on 2/3/26, 3:36 AM
The probabilistic version of "Do No Harm" is "Do not take excessive risk of harm".
This should work as AIs become smarter because intelligence implies becoming better bayesians which implies being great at calibrating confidence intervals of their interpretations and their reasoning and basically gaining a superhuman ability for evaluating the bounds of ambiguity and risk.
Now this doesn't mean that AIs won't be misaligned, only that it should be possible to align them. Not every AI maker will necessarily bother to align them properly, especially in adversarial, military applications.
by leahtheelectron on 2/3/26, 3:18 AM
by tbrownaw on 2/3/26, 3:09 AM
by bjt12345 on 2/3/26, 4:06 AM
This is a big deal, but are they only looking at auto-regressive models?
by nayroclade on 2/3/26, 1:28 AM
by bazzmt on 2/3/26, 7:03 AM
This should not be surprising.
Systematic misalignment, i.e., bias, is still coherent and rational, if it is to be systematic. This would require that AI reason, but AI does not reason (let alone think), it does not do inference.
by cadamsdotcom on 2/3/26, 3:29 AM
It is no surprise that models need grounding too, lest their outputs be no more useful than dreams.
It’s us engineers who give arms and legs to models, so they can navigate the world and succeed at their tasks.
by eande on 2/3/26, 5:54 AM
by kalap_ur on 2/3/26, 2:43 PM
Language models are probabilistic and not deterministic. Therefore incoherence _by definition_ increases as a response becomes lengthier. This is not true for humans, who tend to act/communicate deterministically. If I ask the human, to read a pdf and ask, is there a word of "paperclip" in the pdf? The human deterministically will provide a yes/no answer and no matter how many times we repeat the process, they will provide the same answer consistently (not due to autocorrelation, because this can be done across different humans). LMs will have a probabilistic response - dependent on the training itself: with a very well trained model we can get a 99% probabilistic outcome, which means out of 100 simulations, it will give you 1 time the wrong answer. We have no clue about the "probablistic" component for LMs, however, simulations could be done to research this. Also, I would be very curious about autocorrelation in models. If a human did a task and came to a conclusion "yes", then he will always respond with increasing amount of eyerolling to the same task: "yes".
Also, imagine the question: "is the sky blue?" answer1: "Yes." This has 0 incoherence. answer2: "Yes, but sometimes it looks like black, sometimes blue." While this answer seemingly has 0 incoherence, the probability of increased incoherence is larger than 0 given that answer generation itself is probabilistic. Answer generation by humans is not probabilistic.
Therefore, probability driven LMs (all LMs today are probability driven) will always exhibit higher incoherence than humans.
I wonder if anybody would disagree with the above.
by lewdwig on 2/3/26, 9:15 AM
by makerdiety on 2/3/26, 2:37 PM
by hogehoge51 on 2/3/26, 3:34 AM
by root_axis on 2/3/26, 3:59 AM
I just want to nitpick something that really annoys me that has become extremely common: the tendency to take every opportunity to liken all qualities of LLMs to humans. Every quirk, failure, oddity, limitation, or implementation detail is relentlessly anthropomorphized. It's to the point where many enthusiasts have convinced themselves that humans think by predicting the next token.
It feels a bit like a cult.
Personally, I appreciate more sobriety in tech, but I can accept that I'm in the minority in that regard.
by IgorPartola on 2/3/26, 1:05 AM
by tsunamifury on 2/3/26, 1:02 AM
LLMs aren’t constrained to linear logic like your average human.
by gnarlouse on 2/3/26, 5:58 AM
by throwpoaster on 2/3/26, 1:04 AM