Google is Making AI Training 28% Faster by Using SLMs as Teachers

Training large language models (LLMs) has become out of reach for most organizations. With costs running into millions and compute requirements that would make a supercomputer sweat, AI development has remained locked behind the doors of tech giants. But Google just flipped this story on its head with an approach so simple it makes you […] The post Google is Making AI Training 28% Faster by Using SLMs as Teachers appeared first on Unite.AI.

Jan 6, 2025 - 21:52
 5895
Google is Making AI Training 28% Faster by Using SLMs as Teachers

Training large language models (LLMs) has become out of reach for most organizations. With costs running into millions and compute requirements that would make a supercomputer sweat, AI development has remained locked behind the doors of tech giants. But Google just flipped this story on its head with an approach so simple it makes you wonder why no one thought of it sooner: using smaller AI models as teachers.

How SALT works: A new approach to training AI models

In a recent research paper titled “A Little Help Goes a Long Way: Efficient LLM Training by Leveraging Small LMs,” Google Research and DeepMind introduced SALT (Small model Aided Large model Training). This is the novel method challenging our traditional approach to training LLMs.

Why is this research significant? Currently, training large AI models is like trying to teach someone everything they need to know about a subject all at once – it is inefficient, expensive, and often restricted to organizations with massive computing resources. SALT takes a different path, introducing a two-stage training process that is both innovative and practical.

Breaking down how SALT actually works:

Stage 1: Knowledge Distillation

  • A smaller language model (SLM) acts as a teacher, sharing its understanding with the larger model
  • The smaller model focuses on transferring its “learned knowledge” through what researchers call “soft labels”
  • Think of it like a teaching assistant handling foundational concepts before a student moves to advanced topics
  • This stage is particularly effective in “easy” regions of learning – areas where the smaller model has strong predictive confidence

Stage 2: Self-Supervised Learning

  • The large model transitions to independent learning
  • It focuses on mastering complex patterns and challenging tasks
  • This is where the model develops capabilities beyond what its smaller “teacher” could provide
  • The transition between stages uses carefully designed strategies, including linear decay and linear ratio decay of the distillation loss weight

In non-technical terms, imagine the smaller AI model is like a helpful tutor who guides the larger model in the beginning stages of training. This tutor provides extra information along with their answers, indicating how confident they are about each answer. This extra information, known as the “soft labels,” helps the larger model learn more quickly and effectively.

Now, as the larger AI model becomes more capable, it needs to transition from relying on the tutor to learning independently. This is where “linear decay” and “linear ratio decay” come into play.
Think of these techniques as gradually reducing the tutor's influence over time:
  • Linear Decay: It is like slowly turning down the volume of the tutor's voice. The tutor's guidance becomes less prominent with each step, allowing the larger model to focus more on learning from the raw data itself.
  • Linear Ratio Decay: This is like adjusting the balance between the tutor's advice and the actual task at hand. As training progresses, the emphasis shifts more towards the original task, while the tutor's input becomes less dominant.
The goal of both techniques is to ensure a smooth transition for the larger AI model, preventing any sudden changes in its learning behavior. 

The results are compelling. When Google researchers tested SALT using a 1.5 billion parameter SLM to train a 2.8 billion parameter LLM on the Pile dataset, they saw:

  • A 28% reduction in training time compared to traditional methods
  • Significant performance improvements after fine-tuning:
    • Math problem accuracy jumped to 34.87% (compared to 31.84% baseline)
    • Reading comprehension reached 67% accuracy (up from 63.7%)

But what makes SALT truly innovative is its theoretical framework. The researchers discovered that even a “weaker” teacher model can enhance the student's performance by achieving what they call a “favorable bias-variance trade-off.” In simpler terms, the smaller model helps the larger one learn fundamental patterns more efficiently, creating a stronger foundation for advanced learning.

Why SALT could reshape the AI development playing field

Remember when cloud computing transformed who could start a tech company? SALT might just do the same for AI development.

I have been following AI training innovations for years, and most breakthroughs have mainly benefited the tech giants. But SALT is different.

Here is what it could mean for the future:

For Organizations with Limited Resources:

  • You may no longer need massive computing infrastructure to develop capable AI models
  • Smaller research labs and companies could experiment with custom model development
  • The 28% reduction in training time translates directly to lower computing costs
  • More importantly, you could start with modest computing resources and still achieve professional results

For the AI Development Landscape:

  • More players could enter the field, leading to more diverse and specialized AI solutions
  • Universities and research institutions could run more experiments with their existing resources
  • The barrier to entry for AI research drops significantly
  • We might see new applications in fields that previously could not afford AI development

What this means for the future

By using small models as teachers, we are not just making AI training more efficient – we are also fundamentally changing who gets to participate in AI development. The implications go far beyond just technical improvements.

Key takeaways to keep in mind:

  • Training time reduction of 28% is the difference between starting an AI project or considering it out of reach
  • The performance improvements (34.87% on math, 67% on reading tasks) show that accessibility does not always mean compromising on quality
  • SALT's approach proves that sometimes the best solutions come from rethinking fundamentals rather than just adding more computing power

What to watch for:

  1. Keep an eye on smaller organizations starting to develop custom AI models
  2. Watch for new applications in fields that previously could not afford AI development
  3. Look for innovations in how smaller models are used for specialized tasks

Remember: The real value of SALT is in how it might reshape who gets to innovate in AI. Whether you are running a research lab, managing a tech team, or just interested in AI development, this is the kind of breakthrough that could make your next big idea possible.

Maybe start thinking about that AI project you thought was out of reach. It might be more possible than you imagined.

The post Google is Making AI Training 28% Faster by Using SLMs as Teachers appeared first on Unite.AI.