TL;DR
- New Method: Google published TurboQuant, a KV cache compression algorithm that reduces LLM memory usage by 6x with zero accuracy loss.
- How It Works: The data-oblivious algorithm requires no calibration or fine-tuning, delivering up to 8x attention speedup on H100 GPUs.
- Community Adoption: Independent developers have already built working implementations in Triton, MLX, and llama.cpp despite no official code release from Google.
- Competition: Nvidia’s rival KVTC method achieves 20x compression but requires per-model calibration, and both debut at ICLR 2026.
Running a 70-billion-parameter large language model for 512 concurrent users can consume 512 GB of cache memory alone, nearly four times the memory needed for the model weights themselves. Google on March 25 published TurboQuant, a KV cache compression algorithm that shrinks that footprint by 6x with zero accuracy loss. Independent developers are already building working implementations from the paper alone, even though Google has not yet released any official code or integration libraries.
KV cache memory footprint scales linearly with context length, creating one of the largest bottlenecks in LLM deployment. Google’s TurboQuant compression method addresses this by compressing the cache to 3 bits per channel using a data-oblivious algorithm that requires no calibration, delivering up to 8x attention speedup on H100 GPUs. That 8x figure applies specifically to attention logit computation against a JAX baseline, not to end-to-end inference throughput.
How TurboQuant Works
TurboQuant combines two companion techniques: PolarQuant and Quantized Johnson-Lindenstrauss (QJL). PolarQuant converts vectors from Cartesian to polar coordinates, eliminating per-block normalization constants that typically add 1 to 2 extra bits of overhead in traditional vector quantization. QJL adds a 1-bit error-correction step that reduces each residual vector to a single sign bit, correcting bias in inner product estimates.
An outlier treatment strategy allocates higher precision (3 bits) to outlier channels and lower precision (2 bits) to non-outliers, enabling effective bit-rates of 2.5 or 3.5 bits.
By applying a random rotation to input vectors, TurboQuant induces a concentrated Beta distribution on each coordinate regardless of the original data. The algorithm operates in a data-oblivious manner, requiring no training, fine-tuning, or calibration on specific datasets. Google’s researchers proved it operates near known theoretical lower bounds for quantization distortion, coming within a factor of approximately 2.7 of the information-theoretic limit.
At 1-bit width, TurboQuant’s distortion sits only a factor of roughly 1.45 from optimal. At 3.5 bits per channel, quality remains neutral; at 2.5 bits, degradation is marginal.
Operators running diverse model portfolios benefit directly from this calibration-free design, since standard quantization methods require per-model tuning passes that multiply engineering overhead as organizations deploy dozens of specialized models. TurboQuant eliminates that overhead, positioning it as a drop-in optimization that works across architectures without per-model investment.
Google benchmarked TurboQuant against KIVI, the standard baseline for KV cache quantization published at ICML 2024, which introduced asymmetric 2-bit quantization with 2.6x memory reduction. On the Needle-In-A-Haystack benchmark, TurboQuant matched full-precision performance up to 104,000 tokens under 4x compression. Google evaluated across LongBench, ZeroSCROLLS, RULER, and L-Eval using Gemma, Mistral, and Llama-3.1-8B-Instruct models.
Jumping from KIVI’s 2.6x compression to TurboQuant’s 6x represents a generational improvement, yet the benchmark scope introduces uncertainty. According to the paper’s evaluation tables, all tested models top out at roughly 8 billion parameters, leaving an open question about whether the guarantees hold at 70B or 405B scale where KV cache sizes become truly prohibitive and compression savings would matter most to production operators.
Google’s paper first appeared on arXiv in April 2025 and is being featured ahead of its ICLR 2026 presentation in late April. Its QJL companion was published at AAAI 2025, while PolarQuant is set for presentation at AISTATS 2026. Research scientist Amir Zandieh and Vahab Mirrokni, a VP and Google Fellow, led the work, with collaborators from KAIST and NYU.
Developers Build It Without Google’s Code
Despite the absence of official code, independent developers have already produced working implementations that validate the paper’s claims. One developer built a custom Triton kernel in PyTorch, testing on Gemma 3 4B on an RTX 4090 and reporting character-identical output to the uncompressed baseline at 2-bit precision. At just 2 bits per value, the quantized model produced byte-for-byte identical responses to the full-precision version, suggesting TurboQuant’s theoretical guarantees hold in practice for smaller models.
A separate developer got TurboQuant running on Apple Silicon via MLX with a 35B model, scoring 6 out of 6 on needle-in-a-haystack tests at every quantization level. In the llama.cpp community, three developers are working on C and CUDA implementations, with one reporting 18 out of 18 tests passing and compression ratios matching the paper’s claims.
Adoption at this pace, before any official release, is unusual for a research paper. Implementations spanning Triton, MLX, and llama.cpp reflect both the clarity of TurboQuant’s mathematical formulation and the urgency of KV cache optimization as a deployment bottleneck.
Reproducing the algorithm is not straightforward. One early implementer found the QJL error-correction component tricky to get right, noting that a naive approach produced garbage output. Without proper implementation of QJL’s bias correction for inner product estimates, quantization errors compound and outputs become unusable. Google has not released official TurboQuant code, and it remains absent from vLLM, llama.cpp, Ollama, and every major serving framework.
Beyond LLM inference, TurboQuant also targets vector search applications, where high-dimensional vectors used for retrieval-augmented generation and similarity search face the same memory pressure. Indexing time drops to virtually zero (0.0013 seconds for 1,536-dimensional vectors versus 239.75 seconds for product quantization), while recall numbers on GloVe outperform product quantization and RabbiQ baselines. Google frames TurboQuant as a unified compression method for both KV cache and vector search, though the lack of code release limits adoption in either domain.
Nvidia’s KVTC Offers a Different Trade-Off
TurboQuant is not the only KV cache compression method heading to ICLR 2026. Nvidia’s KVTC achieves 20x compression with less than 1 percentage point accuracy penalty, tested on models from 1.5B to 70B parameters, a wider range than TurboQuant’s benchmarks on models up to roughly 8B.
“Effective KV cache management becomes critical, as idle caches must be quickly offloaded from GPU memory to accommodate other users, and quickly restored for resumed conversations. These infrastructure costs are now reflected in commercial pricing (e.g., as ‘prompt caching’) with additional charges for caching.”
Adrian Lancucki, Nvidia researcher (via VentureBeat)
KVTC takes a fundamentally different approach, using PCA-based decorrelation and entropy coding that borrows concepts from JPEG compression. Unlike TurboQuant’s data-oblivious design, KVTC requires a one-time calibration step per model to compute a PCA alignment matrix offline. In return, it reduces time-to-first-token by up to 8x on an 8,000-token prompt, dropping from roughly 3 seconds to 380 milliseconds on H100 hardware.
Earlier baselines like KIVI and GEAR suffered severe accuracy degradation at just 5x compression on long-context tasks, positioning both TurboQuant and KVTC as meaningful advances over prior methods.
Having two competing compression standards debut at the same conference signals that KV cache optimization is maturing from a research curiosity into a production infrastructure layer. KIVI, which shipped with HuggingFace Transformers integration, had been the standard approach since ICML 2024, but its 2.6x compression ceiling falls well short of what both newer methods achieve.
Lancucki predicted “the emergence of a dedicated, standardized compression layer is probable” given structural similarities across model architectures. For cloud providers and LLM operators, the choice depends on deployment constraints: TurboQuant offers simplicity with no calibration and a mathematically proven distortion bound, while KVTC delivers substantially greater raw compression at 20x versus 6x, validated across a broader model size range. Nvidia launched Dynamo as a data-center-scale inference engine in March 2026, and KVTC is integrating with Nvidia’s Dynamo inference engine with vLLM compatibility, while TurboQuant’s path to production runs through community developers racing to fill the gap Google left by withholding its code.

