TurboQuant Tested: Is Google’s “Lossless” AI Compression Really Lossless?
TLDR:
- YouTuber xCreate tests TurboQuant live — finds 4-bit turbo performs similarly to traditional 4-bit quantization, not “lossless” as claimed
- 9-bit quantization is genuinely near-lossless (99.6% token accuracy, 0.09 MAE)
- 2-bit turbo completely fails; model cannot compile or produce coherent output
- Google/NYU paper’s spectacular results may be model-specific or optimized for certain benchmarks
- Academic drama: Rabbit R claims TurboQuant used their C++ implementation without credit
The Hype vs Reality

When Google and NYU researchers published TurboQuant, the AI community got excited. The paper promised “lossless” quantization — running massive AI models at 2-4 bit precision with virtually no quality loss. The implications are huge: a 70B parameter model could run on a MacBook instead of a server rack.
But YouTuber xCreate actually tested it. The results? Not quite the revolution the paper claimed.
The Testing Method
xCreate ran TurboQuant through real-world tests on two models: MiniMax and Llama 1B. Rather than just benchmark numbers, he actually prompted the model to write a complex 3D graphics program — a spaceship game with context windows and game logic. Then he compared outputs at different precision levels.
Memory usage tells one story:
| Precision | Memory Used |
| Full 16-bit | 1.92 GiB |
| 9-bit | 1.1 GiB (43% reduction) |
| 4.5-bit | 0.5 GiB (74% reduction) |
| 4-bit turbo | 0.51 GiB (73% reduction) |
The memory savings are real — but so are the quality losses.
Token Accuracy Results
Here’s where TurboQuant’s claims start falling apart. The critical metric is “top token accuracy” — how often the quantized model picks the exact same next word as the full-precision version:
| Quantization | Token Accuracy | Perplexity |
| Full 16-bit | 100% | 0.07 |
| 9-bit | 99.6% | Very low |
| 4-bit affine (traditional) | 97.2% | 1.117 |
| 4-bit turbo (1-pass) | ~97% | 1.1171 |
| 3-bit mixed precision | 95% | Higher |
| 2-bit turbo | FAILED | N/A |
The 4-bit turbo version performs nearly identically to traditional 4-bit affine quantization. It’s not lossless. At 9-bit, however, quantization is genuinely near-lossless — that’s the sweet spot.
The 2-Bit Disaster
Going below 3-bits breaks everything. At 2-bit precision, xCreate’s model couldn’t even compile. The Metal shaders failed to build, the context window collapsed, and output was completely incoherent. The tokenizer started outputting garbage tokens that didn’t correspond to any valid text.
This isn’t surprising — you’re cramming 16 bits of precision into 2 bits. But TurboQuant’s paper implied even 2-bit would work “losslessly.” That claim doesn’t hold up.
The Second Pass Problem
TurboQuant’s paper describes a two-pass approach: one pass for the main quantization, a second pass using a quantized Johnson-Lindenstrauss transform for error correction. This is where the paper’s best results come from.
But xCreate found that implementing the full two-pass method actually hurt performance. With the second pass enabled, token accuracy dropped to 88.2% and perplexity spiked to 1.5 — worse than the single-pass version.
The open-source implementations currently available (in MLX LM and MLX VLM) only implement the first pass anyway. So even if the paper’s two-pass method works in specific scenarios, nobody can actually use it yet.
The Academic Drama
TurboQuant’s GitHub page features prominent links to “A Recipe for Pretraining Data Efficiency” — which Rabbit R interprets as a diss. More seriously, Rabbit R claims the TurboQuant team contacted them in January 2025 asking for help debugging their C++ implementation, then transferred that code to Python without giving credit.
TurboQuant responded that Rabbit R’s guarantees are “suboptimal.” Academic discourse gets spicy on GitHub.
Our Take
TurboQuant is genuinely useful — but it’s not the revolution the headlines suggest. The 4-bit implementation is solid, practical, and about equivalent to existing 4-bit quantization methods. If you’re running local AI models on Apple Silicon, it’s worth using, but don’t expect “lossless” results.
The real breakthrough might be in the 9-bit range. That’s where xCreate saw genuinely near-lossless performance: 99.6% token accuracy with half the memory usage. For users who need more precision than 4-bit but can’t run full 16-bit models, 9-bit could be the sweet spot.
As for the paper’s spectacular claims — treat them as aspirational. Real-world results on consumer hardware tell a more modest story. The research is legitimate, but the gap between “works in the paper” and “works on your MacBook” remains significant.
Source:
– YouTube: How to Run TurboQuant – “Lossless” Quantization for Local AI TESTED by xCreate
– Google TurboQuant: How Extreme Compression Makes AI Models 8x Faster (Helloexpress)
– TurboQuant GitHub / Paper






