by shoo on 10/10/25, 9:54 PM with 14 comments
by hshdhdhehd on 10/15/25, 2:23 AM
However I imagine you'd also get the same great performance using an array?
by stinkbeetle on 10/15/25, 3:32 AM
Someone with a M >= 2 might try the code and find no speedup with the "improved" version, and that it's already iterating faster than L1 load-to-use latency.
by rini17 on 10/15/25, 2:03 PM
by signa11 on 10/11/25, 4:00 PM