127 lines of catch-up for a problem Apple flagged in 2016

Fri, 15 May 2026 00:00:00 GMT

Or: how I spent thirteen months[^1] almost noticing what a WWDC talk said out loud a decade ago.

[^1]: The actual investigation was a day. I noticed the idiv weirdness, forgot about it for a year, and only wandered back into the llama hot loop last week because I was procrastinating on something else.

Apple's M-series GPU has no hardware integer divide. None. In 2026. So when you ask AGX to compute a / b, it runs a ~730-cycle software subroutine — a tiny shameful program nestled inside an otherwise competent GPU (or at least mostly competent. It also lacks double - more on that in a future post), doing long division the way you learned it in fourth grade. That is unless the divisor is a compile-time constant to the AGX backend, which is a much narrower category than you'd think.

ggml's Metal matmul kernels (mul_mm, mul_mv) eat four such divides per thread, by values that are constant for the entire inference session. The divisors come in through a uniform buffer, so the AGX compiler never sees them, so every dispatch pays ~2,800 cycles of udiv it absolutely doesn't have to. Multiply by every thread of every dispatch of every layer of every token on every Mac running llama.cpp, and you arrive at a number that's embarrassing to contemplate.

The fix is mechanical: promote the four divisors to MSL [[function_constant]]s, baked into the pipeline-state object at PSO-compile time. The AGX backend then strength-reduces them to a no-op (ne12=1), a shift (pow2), or a magic-multiply (otherwise).

Where AGX's divide goes

where `d` comes from	per-op cost
Pow2 literal (`% 256u`)	~5 cy
Non-pow2 literal (`% 255u`)	~80 cy
`[[function_constant]]`	~80 cy
Kernel argument / uniform buffer	~730 cy

The slow path exists because at PSO build there's no value to fold; the buffer's contents only exist at dispatch time. The fast paths all share one thing: AGX could see the divisor when it built the pipeline. It turns out, Apple warned us about it in a talk a decade ago at WWDC2016 - "So avoid divisional modulus by denominators that are not literal or function constants...that will be very, very slow. Think hundreds of clock seconds".

When it matters

The win requires three things at once:

The kernel is ALU-bound, not memory-bound. If the divide hides behind a memory stall, fixing it buys nothing.
The divisor is constant for the run — tensor dims, GQA group sizes, ring-buffer lengths.
The divide happens often — once per thread, large grid.

ggml's matmul kernels happen to hit all three. Most other places you'd grep for don't — fragment shaders are float-land, audio kernels are memory-bound, Stable Diffusion's UNet has the wrong arithmetic intensity. So this isn't some general technique I've unlocked; it's one kernel family where the stars aligned.

The fix

// before — divisors come through args struct
const int i13 = im / args.ne12;        // 730 cy
const int i12 = im % args.ne12;        // 730 cy
const int i02 = i12 / args.r2;         // 730 cy
const int i03 = i13 / args.r3;         // 730 cy

// after — same values, promoted to function constants
constant int16_t FC_mul_mv_ne12 [[function_constant(FC_MUL_MV_NE12)]];
const int i13 = im / FC_mul_mv_ne12;   // ~5 cy magic-multiply (or shift, or no-op)

ggml-org/llama.cpp#22711. +127 / −88 across 4 files.

Host-side: extract (ne12, ne13, r2, r3) from the op shape, bind via MTLFunctionConstantValues, thread them into the PSO cache key so a Gemma-2 GQA pipeline isn't reused for a TinyLlama (ne12=1, r2=1, r3=1) op. The interesting code is two lines; the boring code is two hundred.

Results

M4 Pro, llama-bench -p 512 -n 128 -r 20, tg128 tok/s:

model	quant	baseline	patched	delta
TinyLlama 1.1B	Q4_0	239.05	246.57	+3.15%
Llama 3.2 1B	Q4_0	227.58	230.96	+1.49% (noise)
Gemma 3 1B (GQA)	Q4_K_M	164.67	173.52	+5.37%
Gemma 2 2B (GQA)	Q4_K_M	103.41	106.66	+3.14%
Mistral 7B (GQA)	Q4_0	52.10	52.73	+1.21%

Perplexity bit-identical (magic-multiply is exact for unsigned division, and since our values are trivially nonnegative the compiler folds them). The biggest wins are on GQA models — exactly the ones people actually run - which hit all four divisors meaningfully — r2 and r3 are >1 and dividing is real work.

Global impact, or lying with arithmetic

1M DAU × ~20K tokens/day ÷ ~150 tok/s × ~3% speedup ≈ ~46 person-years per year, if you squint. Add the rest of the ggml ecosystem and call it 50. These numbers are off by up to 10× in either direction.

If you write Metal compute: grep your .metal files for / args. and % args.. For each hit, ask if the divisor is constant for the kernel's deployment. If yes, and the kernel isn't memory-bound, there's a ~9× win sitting on the divide.

If you don't write Metal but ship software that runs ggml on a Mac: things got 1–5% faster this week.