๐ฐ๐ท ํ๊ตญ์ด ์๋งค ํ์ด์ง: Strix Halo์์ ROCm์ด ๋น ๋ฅธ๋ฐ๋ ์ง ์ด์ โ APU ๋ฐํ์ polling์ด ๋ง๋ 35% ์ ๋ ฅ ํจ์จ ์ญ์ (Korean)
<aside> ๐ฏ
TL;DR โ On AMD Strix Halo (Ryzen AI MAX+ 395, Radeon 8060S iGPU) running the same Qwen3-30B-A3B Q4_K_M model, HIP/ROCm dominates raw throughput (pp512 354 vs 78 t/s, tg 48 vs 36 t/s) but just loading the model burns 2.5 CPU cores continuously because of HSA runtime polling. For 24/7 workloads where idle time dominates, Windows Ollama (Vulkan) consumes about 35% less energy per token. A clean case study in why dGPU intuitions don't transfer to APUs.
</aside>
<aside> ๐
[Image #1 โ Hero] ROCm and Vulkan facing off on top of a Strix Halo APU
file: 2026-05-31-strixhalo-hero.png
</aside>

Who this is for โ Infra/ML engineers running local LLM workloads on Strix Halo / Ryzen AI MAX+ systems, and any backend decision-maker who has to validate the "AMD GPU means ROCm" intuition. If you ever picked an inference backend based on a single throughput table, this case study is for you.
AMD Strix Halo (Ryzen AI MAX+ 395 + Radeon 8060S iGPU, gfx1151) is neither a desktop dGPU nor a laptop iGPU โ it's a new middle ground. With up to 65 GB of unified memory (UMA/GTT) you can fit a 30B-class LLM entirely on the GPU, and "run an MoE 30B on a single desktop" stops being marketing and becomes a measurable claim.
The trap is what happens the moment you carry the old dGPU intuition โ "AMD GPU = ROCm is the answer" โ onto this new variant. Your operational metrics invert. This article logs four measurement rounds comparing ROCm 7.2.4 + a self-built HIP llama.cpp against Windows-native Ollama (Vulkan), on the same model, on the same machine.
Quick glossary:
The first wall was the predictable one. With default BIOS allocation, ROCm only sees the dedicated VRAM (4 GB), so the 30B model can't be GPU-offloaded. Increasing BIOS UMA/GTT to 65 GB brings rocminfo Pool 1 GLOBAL to 95.83 GiB, and llama-bench correctly enumerates the device.
<aside> ๐งฉ
[Image #2] BIOS UMA dial โ from 4 GB to 65 GB
file: 2026-05-31-strixhalo-vram-dial.png
</aside>

rocminfo (gfx1151 agent, Pool 1):
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 100,478,882 KB โ 95.83 GiB
llama-bench device discovery:
Device 0: AMD Radeon(TM) 8060S Graphics, gfx1151 (0x1151)
VRAM: 98123 MiB, free: 97044 MiB
This part is uncontroversial. "We can use it now" is the conclusion, and most articles would stop here.