Researchers at Tsinghua University and Z.ai have developed IndexCache, a new technique that enhances the efficiency of sparse attention models, resulting in up to 1.82 times faster time-to-first-token and 1.48 times faster generation throughput at extended context lengths.
Large language models, which heavily rely on the self-attention mechanism, face challenges due to the quadratic scaling of computational complexity with sequence length. Sparse attention, particularly the DeepSeek Sparse Attention (DSA) architecture, offers a solution by optimizing the attention process to focus only on the most relevant tokens, thereby streamlining computations and speeding up model performance.
IndexCache addresses a bottleneck in the DSA indexer operation by capitalizing on the stability of selected tokens across consecutive transformer layers. By partitioning layers into full and shared categories, IndexCache efficiently caches important indices, reducing redundant computations and accelerating inference speeds.
Empirical tests on the 30-billion-parameter GLM-4.7 Flash model showcased remarkable speedups, with prefill latency reduced by 1.82 times and generation throughput improved by 1.48 times at a 200K context length. These enhancements translate into tangible cost savings and improved user experience for enterprise applications.
Developers can readily implement IndexCache using a training-free approach with a ‘greedy layer selection’ algorithm or a training-aware version for custom model optimization. The technique not only boosts performance but also maintains output quality, underscoring its practical value for real-world AI applications.
Source: VentureBeat