rss-bridge 2025-03-25T19:55:00+00:00

SE Radio 661: Sunil Mallya on Small Language Models

Sunil Mallya, co-founder and CTO of Flip AI, discusses small language models with host Brijesh Ammanath. They begin by considering the technical distinctions between SLMs and large language models.

LLMs excel in generating complex outputs across various natural language processing tasks, leveraging extensive training datasets on with massive GPU clusters. However, this capability comes with high computational costs and concerns about efficiency, particularly in applications that are specific to a given enterprise. To address this, many enterprises are turning to SLMs, fine-tuned on domain-specific datasets. The lower computational requirements and memory usage make SLMs suitable for real-time applications. By focusing on specific domains, SLMs can achieve greater accuracy and relevance aligned with specialized terminologies.

The selection of SLMs depends on specific application requirements. Additional influencing factors include the availability of training data, implementation complexity, and adaptability to changing information, allowing organizations to align their choices with operational needs and constraints.

This episode is sponsored by Codegate.

Sunil Mallya, co-founder and CTO of Flip AI, discusses small language models with host Brijesh Ammanath. They begin by considering the technical distinctions between SLMs and large language models.

Show Notes

Related Episodes

SE Radio 582: Leo Porter and Daniel Zingaro on Learning to Program with LLMs

SE Radio 648: Matthew Adams on AI Threat Modeling and Stride GPT

SE Radio 611: Ines Montani on Natural Language Processing

SE Radio 610: Phillip Carter on Observability for Large Language Models

Other References

Flip AI Paper on System of Intelligent Actors

Flip AI Blog

Community Stories – Llama

Speculative Decoding: Cost-Effective AI Inferencing

OpenAI’s o-1 and Inference-Time Scaling Laws

Transcript

Transcript brought to you by IEEE Software magazine and IEEE Computer Society. This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number.

Brijesh Ammanath 00:00:18 Welcome to Software Engineering Radio. I’m your host Brijesh Ammanath. Today I will be discussing small language models with Sunil Mallya. Sunil is the co-founder and CTO of Flip AI. Prior to this, Sunil was the head of AWS NLP service, comprehend and helped start AWS pet. He’s the co-creator of AWS deep appraiser. He has over 25 patents filed in the area of machine learning, reinforcement learning, and LP and distributed systems. Sunil, welcome to Software Engineering Radio.

Sunil Mallya 00:00:49 Thank you Brijesh. So happy to be here and talk about this topic that’s near and dear to me.

Brijesh Ammanath 00:00:55 We have covered language models in some of our prior episodes, notably Episode 648, 611, 610, and 582. Let’s start off Sunil, maybe by explaining what small language models are and how they differ from large language models or LLMS.

Sunil Mallya 00:01:13 Yeah, this is a very interesting question because, the term itself is sort of time bound because what is large today can mean something else tomorrow as the underlying hardware get better and bigger. So if I go back in time, it’s around 2020. That’s when the LLM term starts to sort of emerge with the advent of people building like billion parameter models and quickly after OpenAI releases GTP-3, which is like 175 billion parameter model that sort of becomes like this gold standard of what a true LLM means, but the number keeps changing. So I’d like to define SLMs in a more slightly different way. Not in terms of number of parameters, but in terms of like practical terms. So what that means is something that you can run with resources that are easily accessible. You’re not like constrained by GPU, availability or you need the biggest GPU, the best GPU. I think to distill all of this, I’d say as of today, early 2025, a 10 billion parameter model that’s operating with like say a max of like 10K context length, which means that you can give it like an input of around 10K words maximum, but where the inference latency is around one second. So it’s pretty fast overall. Like so I would define SLMs in that context, which is a lot more practical.

Brijesh Ammanath 00:02:33 Makes sense. And I believe as the models become more memory intensive, the definition itself will change. I believe when I was reading up GPT-4 actually has about 1.76 trillion parameters.

Sunil Mallya 00:02:46 Yeah. That actually some of these closed source models are really hard when people talk about numbers. Because what can happen is people nowadays use like a mixture of expert architecture model. What that means is they’ll sort of put together like a really large model that has specialized parts to it. Again, I’m trying to explain in very easy language here. What that means is when you run inference through these models, not all the parameters are activated. So you don’t necessarily need 1.7 trillion parameters worth of compute to actually run the models. So you end up using some percentage of that. That actually makes it a little interesting when we say like, oh, how big the model is. But like you want to actually talk about like number of active parameters because that really defines the underlying hardware and resources you need. So if we go back again something like GPT-3, when we, when I say one 75 billion parameters, all the one 75 billion parameters are involved in giving you that final answer.

Brijesh Ammanath 00:03:49 Right. So if I understood that correctly, only a subset of the parameters would be used for the inference in any particular use case.

Sunil Mallya 00:03:57 In mixture of expert model in that architecture. And that’s a very popular for the last maybe a year and a half, has been a popular sort of way for people to build and train because training these really, really large models is extremely hard. But training like mixture of experts, which are sort of collection of smaller models, relatively smaller models are much easier. And then you put them together, so to speak. That’s a emerging trend even today. Very popular and a very pragmatic way of actually going forward in training and then running inference.

Brijesh Ammanath 00:04:34 Okay. And what differentiates an SLM from an expert model? Or are they the same?

[...]

Original source

📄 65a6eba955e5e8af265caec4_System%20of%20Intelligent%20Actors_FlipAI.pdf