Friday, April 10, 2026

Which AI Systems Are Actually Using a Local NPU?

There's a lot of marketing noise around NPUs right now — every new laptop seems to boast one and I was wondering how much attention should I pay to it. Reading everything below generate by Claude.ai the answer is: not much at this point, and if you are a Mac user, as I am, you get NPUs with every M-processor anyway. 

When it comes to the AI assistants most of us actually use daily, like Claude, ChatGPT, or Gemini, none of them run on your local NPU. They're entirely cloud-based, sending your prompts to remote data centers for inference regardless of what hardware you're sitting in front of. So who is actually making use of the NPU? Let's break it down.

Apple Intelligence — The Most Mature Consumer Example

Apple Intelligence on M-series Macs and A17+ iPhones is currently the most polished mainstream example of a ChatGPT-like feature set genuinely leveraging a local NPU. Tasks like text summarization, smart replies, writing tools, and photo editing run entirely on-device using Apple's Neural Engine. Only more complex requests are escalated to Apple's Private Cloud Compute infrastructure. This hybrid approach gives users meaningful privacy guarantees for everyday AI tasks — a meaningful differentiator.

Microsoft Copilot on Windows 11 — Partial NPU Usage

Microsoft Copilot on Copilot+ PCs offloads specific tasks — image generation in Paint (Cocreator), live captions, and background AI features — to the Intel or Qualcomm NPU built into the device. However, the conversational AI portion (the LLM chat) still hits Microsoft's cloud. So it's a hybrid: some features are genuinely local, but the assistant experience you think of as "Copilot" is not running on your NPU.

Microsoft Foundry Local — Explicit NPU Inference

For those who want to go deeper, Microsoft's Foundry Local tool allows you to explicitly route model inference to the NPU on Copilot+ PCs — including Qualcomm Snapdragon X Elite devices. Models like Phi-3.5-mini can be run directly on the NPU, achieving around 16 tokens per second on supported hardware. This is a developer/power user tool rather than a consumer product, but it's the most concrete implementation of true NPU-targeted LLM inference on a Windows PC today.

Local LLM Tools — Ollama, LM Studio, and Friends

Tools like Ollama, LM Studio, and Nexa AI can be configured to target the NPU for inference on supported hardware. These are self-hosted solutions where you download and run open-weight models (LLaMA, Mistral, Phi, etc.) yourself. The NPU acceleration is hardware and driver dependent — it works well on Qualcomm Snapdragon X Elite and increasingly on Intel Core Ultra platforms. For anyone wanting full control, privacy, and genuinely local inference, this is currently the best path.

Samsung Galaxy AI — Mobile NPU in Action

Samsung's Galaxy AI features on the S24 and S25 series use the Qualcomm Hexagon NPU for on-device tasks including live translation, generative photo editing, and call transcription. Like Apple, Samsung routes more demanding tasks to the cloud while keeping latency-sensitive and privacy-relevant workloads on the device.

The Bottom Line

Claude, ChatGPT, and Gemini are cloud-only. They don't touch your NPU — full stop. Apple Intelligence is the most mature and user-friendly example of NPU-backed AI features in a mainstream product. Microsoft Foundry Local and the open-source local LLM ecosystem are where you go if you want explicit, controllable on-device inference. The NPU story is real, but it's mostly at the OS feature layer and developer tooling level — not in the headline AI assistant products yet. That will change, but we're not there.

No comments: