Running 9B Local LLM on Xiaomi Pad 8 Pro

The Xiaomi Pad 8 Pro is genuinely capable at running local LLM — we tested it and the 4B Qwen variant works smoothly at 11 tokens/second. But ask the tablet to load the 9B model and it hits a wall: 8GB of RAM is not enough. The OS, the Android runtime, and the model weights create a perfect storm where there is simply no headroom left.

Xiaomi Pad 8 Pro: Entry Point into Android Tablet On-Device Ai & Local LLM

You can feel the frustration in that moment. The hardware is sitting right there. The GPU is capable. The only reason this doesn’t work is memory, not intelligence.

That’s where your gaming PC comes in.

The Unlock: Stream 9B from Home

A Gigabyte RTX 5070 Gaming OC loaded with Qwen 3.5 9B, served over your local network via LM Studio, becomes instantly accessible from the Pad 8 Pro through Anything LLM. Your tablet is no longer the engine — it’s the remote control. The PC does the heavy thinking. The tablet shows you the answer.

The Gigabyte RTX 5070 is the hardware that made this work. I’ve been with Gigabyte for years specifically because they have the least drama in the ecosystem. Their reliability is consistent, their customer support is responsive, and their cards don’t come with the reliability headaches other brands seem to ship with. The Gaming OC variant brings a modest factory overclock and solid thermals — not necessary for this workload, but the build quality is what sealed the choice.

image of Running 9B Local LLM on Xiaomi Pad 8 Pro — Setup Guide & Real Benchmarks - HelloExpress - 4

This setup generates Qwen 3.5 9B responses at a constant 15-16 tokens per second, streaming live to your tablet over WiFi. It’s faster than waiting for a cloud API response, it’s entirely under your control, and it works offline on your home network.

image of Running 9B Local LLM on Xiaomi Pad 8 Pro — Setup Guide & Real Benchmarks - HelloExpress - 5

Setup: The Three Gotchas That Will Break Everything

Before you get to the performance benchmarks, you need to survive setup. There are exactly three friction points that will stop you cold if you don’t know about them.

Gotcha 1: Windows Firewall + Antivirus Double-Blocking Port 1234

LM Studio serves the model on port 1234 by default. Windows Defender sees it. Your antivirus sees it. Both will silently block it without telling you why. The tablet will timeout trying to connect, and you’ll spend an hour wondering if the endpoint is wrong.

The fix is explicit:

Open the Start menu, type wf.msc, and press Enter (Advanced Firewall)
Click Inbound Rules on the left
Click New Rule… on the right
Choose Port → Next
Choose TCP and type 1234 in the “Specific local ports” field
Choose Allow the connection
Check Private (and optionally Public if you’re paranoid)
Name it “LM Studio Server” and click Finish

Also add a firewall exception in your antivirus settings. If you’re running Avast (like I was), whitelist port 1234 in the Avast Advanced Settings under Network.

Gotcha 2: APIPA Address Mismatch (169.254.x.x vs 192.168.x.x)

When LM Studio starts, if Windows cannot find a DHCP server or router, it assigns itself an APIPA address — something like 169.254.83.107. This is technically a valid address, but it only works on your local machine. Your tablet won’t see it.

The symptom: you can hit the server from your PC browser at http://169.254.83.107:1234/v1/models just fine, but the tablet times out every single time.

The fix:

Open Command Prompt
Type ipconfig /release and press Enter
Type ipconfig /renew and press Enter
Wait for the IPv4 Address to change to something like 192.168.0.xxx or 192.168.1.xxx
Stop LM Studio completely
Restart LM Studio — it will bind to the new IPv4 address
Now test from the tablet: http://192.168.x.x:1234/v1/models should respond instantly

This is the difference between “stuck for hours” and “working in minutes.”

Gotcha 3: Which Android App Actually Works

Not all Android LLM clients can hit the /v1 endpoint correctly.

Anything LLM: ✅ Works perfectly. Streams responses, handles the endpoint correctly, stays stable.
LMSA (from Play Store): ❌ Failed to connect. The app couldn’t resolve the local network endpoint even with manual IP entry.
Ollama Android client: ❌ Didn’t recognize the LM Studio server endpoint format.

Anything LLM is the client that works. Download it from the Play Store, set the BASE URL to http://192.168.x.x:1234/v1, select the model, and you’re done. It will stay your best option until other clients catch up to endpoint compatibility.

How Speed Actually Works: Prefill vs Decode

Before you look at the benchmarks, understand what you’re actually measuring. LLM speed has two completely different components that behave in opposite ways.

Prefill (prompt processing): The GPU reads your entire input prompt and processes it in parallel. It’s brutally fast — Qwen 3.5 9B on the RTX 5070 prefills at roughly 900+ tokens per second. A 1,000-token prompt takes about 1 second to process.

Decode (response generation): The GPU generates one token at a time, sequentially. This is the slow part — about 65 milliseconds per token. A 1,000-token response takes about 65 seconds.

This is why response time scales linearly with response length, and why the network speed doesn’t matter much. Your WiFi latency is measured in milliseconds. Decode latency is measured in tens of milliseconds per token. Once prefill is done (usually instantly from your perspective), the GPU’s decode speed is the only thing that matters.

image of Running 9B Local LLM on Xiaomi Pad 8 Pro — Setup Guide & Real Benchmarks - HelloExpress - 7

Short prompt, short answer? You’re GPU-bound by decode, not network. Long prompt, long answer? Same thing — decode is the bottleneck.

Real-World Performance: Four Benchmarks

Here’s what you actually get:

Test Case	Prompt tokens	Output tokens	Decode t/s	Total time	Takeaway
Ultra-short (“Hello”)	1,101	81	15.20	6.5s	Even single-word inputs take ~5s to generate responses
Short (quick question)	634	228	15.69	15.3s	Prefill is negligible, decode dominates
Medium (typical query)	599	797	15.23	53.0s	Linear scaling with response length
Longer (complex task)	1,228	1,325	15.43	86.9s	Consistent 15 t/s, regardless of prompt size

The key insight: Decode speed is constant at 15-16 tokens/second. Decode time scales linearly with response length. This is not a bug or a limitation — this is how autoregressive LLMs work. A 1,000-token response will take roughly 65 seconds. A 100-token response will take roughly 6 seconds. The math is straightforward once you understand it.

This is also why your network performance is almost irrelevant here. Prefill happens so fast that WiFi latency doesn’t register. Decode is dominated by GPU processing time, not network transfer.

The Pad 8 Pro as the Interface

The Xiaomi Pad 8 Pro becomes genuinely interesting once you realize it’s not trying to run the model — it’s showing you the results. The 11.2-inch 3.2K display, the Snapdragon 8 Elite processor for responsive UI, the 9200mAh battery for all-day use — these specs exist to give you a comfortable interface to a remote AI.

Anything LLM on the Pad 8 Pro streams the response in real-time. You see tokens appearing as they’re generated on the PC. The UI is responsive. The experience is fluid. At RM2,699 for the 8GB config, you’re getting a legitimately capable AI interface device — not because it runs the model, but because it displays the results beautifully while consuming minimal power.

Compare this to running 4B locally on the Pad 8 Pro. You get instant responses, offline capability, and the satisfaction of on-device computation. But you’re capped at 4B quality. Stream 9B from your PC, and you get significantly better reasoning, better structured output, and better handling of complex prompts — all for the cost of being within WiFi range of your PC.

When to Use What

Scenario	Best Choice	Why
Quick factual lookup (offline)	Pad 8 Pro 2B locally	Instant, offline, sufficient
Daily driver (at home)	Pad 8 Pro 4B locally	11 t/s is usable, batteries included
Complex reasoning (at home)	PC 9B via Anything LLM	15 t/s, better quality, home WiFi
Complex reasoning (on the go)	PC 9B via Tailscale	Latency penalty, but best quality available
No home PC	Pad 8 Pro 4B or cloud API	Accept the ceiling, or pay for cloud

The Sovereignty Story

This entire setup exists because you own the hardware and the data. There is no API key, no usage tracking, no third-party terms of service. You run what you want, when you want, how you want.

A cloud API gives you reliability and scaling. But you pay for every token, and every query is logged somewhere. A local 4B model on the Pad 8 Pro gives you offline capability, but you’re capped at that quality ceiling. A hybrid setup — local 4B for quick queries, home-PC 9B for serious work — gives you the best of both worlds.

The PC-as-server approach scales further if you decide to invest: upgrade the GPU in a few years, run multiple models in parallel, or add other AI workloads (image generation, video processing, voice synthesis). Your PC becomes a genuine local AI infrastructure investment that gets better as the models improve.

For Malaysian buyers specifically, this matters. Cloud API costs accumulate if you use AI daily. Electricity costs for a gaming PC are real, but they’re one-time hardware spend. If you already own a decent gaming PC and an Android tablet, this setup costs you nothing and gets you 9B-class reasoning offline, whenever you want it.

For more information on private Ai, check out our dedicated article here:

Why Self-Hosted LLM May Defines the 2026 AI Landscape

Conclusion

The Xiaomi Pad 8 Pro alone hits its ceiling at 4B. Combined with a home PC and LM Studio, it becomes a window into 9B-class reasoning that rivals or beats cloud APIs in speed and beat them entirely in cost and privacy.

The setup friction is real but solvable — port blocking, IP address mismatches, and client compatibility are all documented here. Once through that, you have a hybrid local/remote AI system that works at home and follows you on the road.

For readers with a gaming PC gathering dust and a Pad 8 Pro in hand, this is the path to genuinely useful on-demand AI, entirely under your control.

Help support us!

If you are interested in the Xiaomi Pad 8 Pro, we would really appreciate if you purchase them via the links below. The affiliate links won’t cost you any extra, but it will be a great help to keep the lights on here at HelloExpress.

Lazada: https://s.lazada.com.my/s.D5eUh
Shopee: https://s.shopee.com.my/6AfyQicgZT

Running 9B Local LLM on Xiaomi Pad 8 Pro — Setup Guide & Real Benchmarks

The Unlock: Stream 9B from Home

Setup: The Three Gotchas That Will Break Everything

Gotcha 1: Windows Firewall + Antivirus Double-Blocking Port 1234

Gotcha 2: APIPA Address Mismatch (169.254.x.x vs 192.168.x.x)

Gotcha 3: Which Android App Actually Works

How Speed Actually Works: Prefill vs Decode

Real-World Performance: Four Benchmarks

The Pad 8 Pro as the Interface

When to Use What

The Sovereignty Story

Conclusion

Help support us!

What is your reaction?

TQ Wuling Bingo is Now Available for Rental in Malaysia via GoEV

UNESCO, MNCU, PHC, and USM Partner with vivo for “Capture the Future” Youth Workshop at Penang Hill Biosphere Reserve

Leave a reply Cancel reply

Recent Reviews

Xiaomi 17 Review: The Phone That Fits in Your Life Without Demanding to Be the Centre of It

iQOO 15R Review: Flagship Chip, Marathon Battery — The Gaming Phone That Thinks It’s a Work Phone

ASUS ProArt GoPro Edition Quick Review — much RAM, very Pro

Featured Posts

Categories

The Unlock: Stream 9B from Home

Setup: The Three Gotchas That Will Break Everything

Gotcha 1: Windows Firewall + Antivirus Double-Blocking Port 1234

Gotcha 2: APIPA Address Mismatch (169.254.x.x vs 192.168.x.x)

Gotcha 3: Which Android App Actually Works

How Speed Actually Works: Prefill vs Decode

Real-World Performance: Four Benchmarks

The Pad 8 Pro as the Interface

When to Use What

The Sovereignty Story

Conclusion

Help support us!

What is your reaction?

TQ Wuling Bingo is Now Available for Rental in Malaysia via GoEV

UNESCO, MNCU, PHC, and USM Partner with vivo for “Capture the Future” Youth Workshop at Penang Hill Biosphere Reserve

You may also like

Equinix AI Discovery Hub Opens in Hong Kong for Enterprise AI

Microsoft Rolls Out Agentic AI in Hong Kong to Automate Financial Workflows

Leave a reply Cancel reply

Recent Reviews

Xiaomi 17 Review: The Phone That Fits in Your Life Without Demanding to Be the Centre of It

iQOO 15R Review: Flagship Chip, Marathon Battery — The Gaming Phone That Thinks It’s a Work Phone

ASUS ProArt GoPro Edition Quick Review — much RAM, very Pro

Featured Posts

Categories

TAGS