Researchers from the University of Maryland, Lawrence Livermore National Labs, Columbia University, and TogetherAI have developed a new technique to significantly enhance AI model performance. The method increases inference speed up to three times by integrating it directly into the model’s weights, without the need for speculative decoding.
Unlike traditional approaches that rely on additional infrastructure, this novel technique involves adding a single special token to the model’s architecture. By enabling multi-token prediction (MTP), where a language model can predict multiple tokens simultaneously in a single forward pass, the researchers have found a way to improve processing efficiency.
“The shift towards prioritizing single-user speed in AI workflows is crucial as complex reasoning models generate extensive chains of thought tokens, impacting overall serving efficiency,” said John Kirchenbauer, a computer science doctorate candidate at the University of Maryland and co-author of the research.
The team’s proposed training technique utilizes a student-teacher scheme to train models for multi-token prediction. By incorporating a teacher model to evaluate the coherence of generated token sequences, the student model learns to produce accurate and contextually relevant outputs.
The practical implications of this research extend to industries deploying AI models for various tasks. The approach’s simplicity, requiring only the addition of a special token to existing models, allows for seamless adaptation without significant architectural changes. Furthermore, the introduction of an adaptive decoding strategy, ConfAdapt, ensures a balance between generation speed and output quality.
Experimental results demonstrated substantial speed improvements without compromising accuracy, with the models achieving up to a 3x speedup in inference tasks. This efficiency enhancement opens new opportunities for accelerating AI model performance across domains, from math problem-solving to creative writing and summarization.
The research team has made their trained models and framework code available for further exploration, anticipating simplified deployment processes for low-latency AI models in production environments.
Source: VentureBeat