Why ROCm Wins the Throughput Race but Loses the Power Bill on Strix Halo — A 35% Energy Reversal Caused by APU Runtime Polling

🇰🇷 한국어 자매 페이지: Strix Halo에서 ROCm이 빠른데도 진 이유 — APU 런타임 polling이 만든 35% 전력 효율 역전 (Korean)

<aside> 🎯

TL;DR — On AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S iGPU) running the same Qwen3-30B-A3B Q4_K_M model, HIP/ROCm dominates raw throughput (pp512 354 vs 78 t/s, tg 48 vs 36 t/s) but just loading the model burns 2.5 CPU cores continuously because of HSA runtime polling. For 24/7 workloads where idle time dominates, Windows Ollama (Vulkan) consumes about 35% less energy per token. A clean case study in why dGPU intuitions don't transfer to APUs.

</aside>

<aside> 📊

[Image #1 — Hero] ROCm and Vulkan facing off on top of a Strix Halo APU

file: 2026-05-31-strixhalo-hero.png

</aside>

📌 Intro — Strix Halo, a new "middle-ground" platform

Who this is for — Infra/ML engineers running local LLM workloads on Strix Halo / Ryzen AI MAX+ systems, and any backend decision-maker who has to validate the "AMD GPU means ROCm" intuition. If you ever picked an inference backend based on a single throughput table, this case study is for you.

AMD Strix Halo (Ryzen AI MAX+ 395 + Radeon 8060S iGPU, gfx1151) is neither a desktop dGPU nor a laptop iGPU — it's a new middle ground. With up to 65 GB of unified memory (UMA/GTT) you can fit a 30B-class LLM entirely on the GPU, and "run an MoE 30B on a single desktop" stops being marketing and becomes a measurable claim.

The trap is what happens the moment you carry the old dGPU intuition — "AMD GPU = ROCm is the answer" — onto this new variant. Your operational metrics invert. This article logs four measurement rounds comparing ROCm 7.2.4 + a self-built HIP llama.cpp against Windows-native Ollama (Vulkan), on the same model, on the same machine.

Quick glossary:

APU (Accelerated Processing Unit): A single silicon die containing both CPU and GPU. A dGPU (Discrete GPU) sits on a separate card connected via PCIe.
UMA/GTT: A BIOS setting that carves out a portion of system RAM as GPU-visible memory on an APU. Unlike a dGPU's dedicated VRAM, this allocation is static — set in BIOS, not dynamic.
HSA polling: When the CPU waits for GPU work to complete, instead of being woken by an OS interrupt, a CPU thread keeps tapping the queue to check. On a dGPU the cost hides; on an APU it shows up loud and clear.

🔬 Round 1 — VRAM recognition ("ROCm only sees 4 GB" — solved)

The first wall was the predictable one. With default BIOS allocation, ROCm only sees the dedicated VRAM (4 GB), so the 30B model can't be GPU-offloaded. Increasing BIOS UMA/GTT to 65 GB brings rocminfo Pool 1 GLOBAL to 95.83 GiB, and llama-bench correctly enumerates the device.

<aside> 🧩

[Image #2] BIOS UMA dial — from 4 GB to 65 GB

file: 2026-05-31-strixhalo-vram-dial.png

</aside>

rocminfo (gfx1151 agent, Pool 1):
  Segment: GLOBAL; FLAGS: COARSE GRAINED
  Size:    100,478,882 KB  ≈ 95.83 GiB

llama-bench device discovery:
  Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151)
  VRAM: 98123 MiB,  free: 97044 MiB

This part is uncontroversial. "We can use it now" is the conclusion, and most articles would stop here.