Google’s TurboQuant Algorithm Slashes LLM Memory Use by 6x


TL;DR

  • New Method: Google published TurboQuant, a KV cache compression algorithm that reduces LLM memory usage by 6x with zero accuracy loss.
  • How It Works: The data-oblivious algorithm requires no calibration or fine-tuning, delivering up to 8x attention speedup on H100 GPUs.
  • Community Adoption: Independent developers have already built working implementations in Triton, MLX, and llama.cpp despite no official code release from Google.
  • Competition: Nvidia’s rival KVTC method achieves 20x compression but requires per-model calibration, and both debut at ICLR 2026.

Running a 70-billion-parameter large language model for 512 concurrent users can consume 512 GB of cache memory alone, nearly four times the memory needed for the model weights themselves. Google on March 25 published TurboQuant, a KV cache compression algorithm that shrinks that footprint by 6x with zero accuracy loss. Independent developers are already building working implementations from the paper alone, even though Google has not yet released any official code or integration libraries.

KV cache memory footprint scales linearly with context length, creating one of the largest bottlenecks in LLM deployment. Google’s TurboQuant compression method addresses this by compressing the cache to 3 bits per channel using a data-oblivious algorithm that requires no calibration, delivering up to 8x attention speedup on H100 GPUs. That 8x figure applies specifically to attention logit computation against a JAX baseline, not to end-to-end inference throughput.

How TurboQuant Works

TurboQuant combines two companion techniques: PolarQuant and Quantized Johnson-Lindenstrauss (QJL). PolarQuant converts vectors from Cartesian to polar coordinates, eliminating per-block normalization constants that typically add 1 to 2 extra bits of overhead in traditional vector quantization. QJL adds a 1-bit error-correction step that reduces each residual vector to a single sign bit, correcting bias in inner product estimates.

An outlier treatment strategy allocates higher precision (3 bits) to outlier channels and lower precision (2 bits) to non-outliers, enabling effective bit-rates of 2.5 or 3.5 bits.

By applying a random rotation to input vectors, TurboQuant induces a concentrated Beta distribution on each coordinate regardless of the original data. The algorithm operates in a data-oblivious manner, requiring no training, fine-tuning, or calibration on specific datasets. Google’s researchers proved it operates near known theoretical lower bounds for quantization distortion, coming within a factor of approximately 2.7 of the information-theoretic limit.