<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>guyfischman</title><description>Notes on security and performance.</description><link>https://guyfischman.com/</link><item><title>127 lines of catch-up for a problem Apple flagged in 2016</title><link>https://guyfischman.com/posts/idiv-metal/</link><guid isPermaLink="true">https://guyfischman.com/posts/idiv-metal/</guid><description>Or: how I spent thirteen months almost noticing what a WWDC talk said out loud a decade ago</description><pubDate>Fri, 15 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;&lt;em&gt;Or: how I spent thirteen months[^1] almost noticing what a WWDC talk said out loud a decade ago.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;[^1]: The actual investigation was a day. I noticed the idiv
weirdness, forgot about it for a year, and only wandered back into the llama hot loop last week because I was procrastinating on something else.&lt;/p&gt;
&lt;p&gt;Apple&apos;s M-series GPU has no hardware integer divide. None. In 2026. So when you ask AGX to compute &lt;code&gt;a / b&lt;/code&gt;, it runs a ~730-cycle software subroutine — a tiny shameful program nestled inside an otherwise competent GPU (or at least mostly competent. It
also lacks &lt;code&gt;double&lt;/code&gt; - more on that in a future post), doing long division the way
you learned it in fourth grade. That is unless the divisor is a compile-time constant
&lt;em&gt;to the AGX backend&lt;/em&gt;, which is a much narrower category than you&apos;d think.&lt;/p&gt;
&lt;p&gt;ggml&apos;s Metal matmul kernels (&lt;code&gt;mul_mm&lt;/code&gt;, &lt;code&gt;mul_mv&lt;/code&gt;) eat four such divides per thread, by values that are constant for the entire inference session. The divisors come in through a uniform buffer, so the AGX compiler never sees them, so every dispatch pays ~2,800 cycles of &lt;code&gt;udiv&lt;/code&gt; it absolutely doesn&apos;t have to. Multiply by every thread of every dispatch of every layer of every token on every Mac running llama.cpp, and you arrive at a number that&apos;s embarrassing to contemplate.&lt;/p&gt;
&lt;p&gt;The fix is mechanical: promote the four divisors to MSL
&lt;code&gt;[[function_constant]]&lt;/code&gt;s, baked into the pipeline-state object at PSO-compile time. The AGX backend then strength-reduces them to a no-op (&lt;code&gt;ne12=1&lt;/code&gt;), a shift (pow2), or a magic-multiply (otherwise).&lt;/p&gt;
&lt;h2&gt;Where AGX&apos;s divide goes&lt;/h2&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;where &lt;code&gt;d&lt;/code&gt; comes from&lt;/th&gt;
&lt;th&gt;per-op cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Pow2 literal (&lt;code&gt;% 256u&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;~5 cy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Non-pow2 literal (&lt;code&gt;% 255u&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;~80 cy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;[[function_constant]]&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~80 cy&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kernel argument / uniform buffer&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~730 cy&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;The slow path exists because at PSO build there&apos;s no value to fold; the buffer&apos;s contents only exist at dispatch time. The fast paths all share one thing: AGX could see the divisor when it built the pipeline. It turns out, Apple warned us about it in a talk a decade ago at WWDC2016 - &lt;a href=&quot;https://developer.apple.com/videos/play/wwdc2016/606/?time=1521&quot;&gt;&quot;So avoid divisional modulus by denominators that are not literal or function constants...that will be very, very slow. Think hundreds of clock seconds&quot;&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;When it matters&lt;/h2&gt;
&lt;p&gt;The win requires three things at once:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The kernel is &lt;strong&gt;ALU-bound&lt;/strong&gt;, not memory-bound. If the divide hides behind a memory stall, fixing it buys nothing.&lt;/li&gt;
&lt;li&gt;The divisor is &lt;strong&gt;constant for the run&lt;/strong&gt; — tensor dims, GQA group sizes, ring-buffer lengths.&lt;/li&gt;
&lt;li&gt;The divide happens &lt;strong&gt;often&lt;/strong&gt; — once per thread, large grid.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;ggml&apos;s matmul kernels happen to hit all three. Most other places you&apos;d grep for don&apos;t — fragment shaders are float-land, audio kernels are memory-bound, Stable Diffusion&apos;s UNet has the wrong arithmetic intensity. So this isn&apos;t some general technique I&apos;ve unlocked; it&apos;s one kernel family where the stars aligned.&lt;/p&gt;
&lt;h2&gt;The fix&lt;/h2&gt;
&lt;pre&gt;&lt;code&gt;// before — divisors come through args struct
const int i13 = im / args.ne12;        // 730 cy
const int i12 = im % args.ne12;        // 730 cy
const int i02 = i12 / args.r2;         // 730 cy
const int i03 = i13 / args.r3;         // 730 cy

// after — same values, promoted to function constants
constant int16_t FC_mul_mv_ne12 [[function_constant(FC_MUL_MV_NE12)]];
const int i13 = im / FC_mul_mv_ne12;   // ~5 cy magic-multiply (or shift, or no-op)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/ggml-org/llama.cpp/pull/22711&quot;&gt;ggml-org/llama.cpp#22711&lt;/a&gt;. +127 / −88 across 4 files.&lt;/p&gt;
&lt;p&gt;Host-side: extract &lt;code&gt;(ne12, ne13, r2, r3)&lt;/code&gt; from the op shape, bind via &lt;code&gt;MTLFunctionConstantValues&lt;/code&gt;, thread them into the PSO cache key so a Gemma-2 GQA pipeline isn&apos;t reused for a TinyLlama (&lt;code&gt;ne12=1, r2=1, r3=1&lt;/code&gt;) op. The interesting code is two lines; the boring code is two hundred.&lt;/p&gt;
&lt;h2&gt;Results&lt;/h2&gt;
&lt;p&gt;M4 Pro, &lt;code&gt;llama-bench -p 512 -n 128 -r 20&lt;/code&gt;, tg128 tok/s:&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;model&lt;/th&gt;
&lt;th&gt;quant&lt;/th&gt;
&lt;th&gt;baseline&lt;/th&gt;
&lt;th&gt;patched&lt;/th&gt;
&lt;th&gt;delta&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;TinyLlama 1.1B&lt;/td&gt;
&lt;td&gt;Q4_0&lt;/td&gt;
&lt;td&gt;239.05&lt;/td&gt;
&lt;td&gt;246.57&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+3.15%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2 1B&lt;/td&gt;
&lt;td&gt;Q4_0&lt;/td&gt;
&lt;td&gt;227.58&lt;/td&gt;
&lt;td&gt;230.96&lt;/td&gt;
&lt;td&gt;+1.49% (noise)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 3 1B (GQA)&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;164.67&lt;/td&gt;
&lt;td&gt;173.52&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+5.37%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 2 2B (GQA)&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;103.41&lt;/td&gt;
&lt;td&gt;106.66&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+3.14%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral 7B (GQA)&lt;/td&gt;
&lt;td&gt;Q4_0&lt;/td&gt;
&lt;td&gt;52.10&lt;/td&gt;
&lt;td&gt;52.73&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+1.21%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;
&lt;p&gt;Perplexity bit-identical (magic-multiply is exact for unsigned division, and since
our values are trivially nonnegative the compiler folds them). The biggest wins are
on GQA models — exactly the ones people actually run - which hit &lt;em&gt;all four&lt;/em&gt; divisors meaningfully — &lt;code&gt;r2&lt;/code&gt; and &lt;code&gt;r3&lt;/code&gt; are &amp;gt;1 and dividing is real work.&lt;/p&gt;
&lt;h2&gt;Global impact, or lying with arithmetic&lt;/h2&gt;
&lt;p&gt;1M DAU × ~20K tokens/day ÷ ~150 tok/s × ~3% speedup ≈ &lt;strong&gt;~46 person-years per year&lt;/strong&gt;, if you squint. Add the rest of the ggml ecosystem and call it 50. These numbers are off by up to 10× in either direction.&lt;/p&gt;
&lt;p&gt;If you write Metal compute: grep your &lt;code&gt;.metal&lt;/code&gt; files for &lt;code&gt;/ args.&lt;/code&gt; and &lt;code&gt;% args.&lt;/code&gt;. For each hit, ask if the divisor is constant for the kernel&apos;s deployment. If yes, and the kernel isn&apos;t memory-bound, there&apos;s a ~9× win sitting on the divide.&lt;/p&gt;
&lt;p&gt;If you don&apos;t write Metal but ship software that runs ggml on a Mac: things got 1–5% faster this week.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;PR: &lt;a href=&quot;https://github.com/ggml-org/llama.cpp/pull/22711&quot;&gt;ggml-org/llama.cpp#22711&lt;/a&gt;. Reproducer: &lt;a href=&quot;https://github.com/SovereignSoft/agx-idiv-demo&quot;&gt;github.com/SovereignSoft/agx-idiv-demo&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
</content:encoded></item></channel></rss>