rss-bridge 2025-06-18T17:56:00+00:00

SE Radio 673: Abhinav Kimothi on Retrieval-Augmented Generation

In this episode of Software Engineering Radio, Abhinav Kimothi sits down with host Priyanka Raghavan to explore retrieval-augmented generation (RAG), drawing insights from Abhinav's book, A Simple Guide to Retrieval-Augmented Generation.

The conversation begins with an introduction to key concepts, including large language models (LLMs), context windows, RAG, hallucinations, and real-world use cases. They then delve into the essential components and design considerations for building a RAG-enabled system, covering topics such as retrievers, prompt augmentation, indexing pipelines, retrieval strategies, and the generation process.

The discussion also touches on critical aspects like data chunking and the distinctions between open-source and pre-trained models. The episode concludes with a forward-looking perspective on the future of RAG and its evolving role in the industry.

Brought to you by IEEE Computer Society and IEEE Software magazine.

In this episode of Software Engineering Radio, Abhinav Kimothi sits down with host Priyanka Raghavan to explore retrieval-augmented generation (RAG), drawing insights from Abhinav’s book, A Simple Guide to Retrieval-Augmented Generation.

Brought to you by IEEE Computer Society and IEEE Software magazine.

Show Notes

Related Episodes

SE Radio 582: Leo Porter and Daniel Zingaro on Using LLMs in the Classroom

SE Radio 641: Catherine Nelson on Machine Learning in Data Science

Other References

A Simple Guide to Retrieval-Augmented Generation

What Is Retrieval-Augmented Generation?

Retrieval-Augmented Generation

Introduction to Retrieval Augmented Generation

Transcript

Transcript brought to you by IEEE Software magazine.
This transcript was automatically generated. To suggest improvements in the text, please contact [email protected] and include the episode number and URL.

Priyanka Raghavan 00:00:18 Hi everyone, I’m Priyanka Raghaven for Software Engineering Radio and I’m in conversation with Abhinav Kimothi on Retrieval Augmented Generation or RAG. Abhinav is the co-founder and VP at Yanet, an AI powered platform for content creation and he’s also the author of the book,† A Simple Guide to Retrieval Augmented Generation . He has more than 15 years of experience in building AI and ML solutions, and if you’ll see today Large Language Models are being used in numerous ways in various industries for automating tasks, using natural languages input. In this regard, RAG is something that is talked about to enhance performance of LLMs. So for this episode, we’ll be using the book from Abhinav to discuss RAG. Welcome to the show Abhinav.

Abhinav Kimothi 00:01:05 Hey, thank you so much Priyanka. It’s great to be here.

Priyanka Raghavan 00:01:09 Is there anything else in your bio that I missed out that you would like listeners to know about?

Abhinav Kimothi 00:01:13 Oh no, this is absolutely fine.

Priyanka Raghavan 00:01:16 Okay, great. So let’s jump right in. The first thing, when I gave the introduction, I talked about LLMs being used in a lot of industries, but the first section of the podcast, we could just go over some of these terms and so I’ll ask you to define a few of those things for us. So what is a Large Language Model?

Abhinav Kimothi 00:01:34 That’s a great question. That’s a great place to start the conversation also. Yeah, so Large Language Model’s very important in a way, LLM is the technology that assured in this new era of artificial intelligence and everybody’s talking about it. I’m sure by now everybody’s familiar with ChatGPT and the likes. So these applications, which everybody’s using for conversations, text generation, etc., the core technology that they are based on is a Large Language Model, an LLM as we call it.

Abhinav Kimothi 00:02:06 Technically LLMs are deep learning models. They have been trained on massive volumes of text and they’re based on a neural network architecture called the transformers architecture. And they’re so deep that they have billions and in some cases trillions of parameters and hence they’re called large models. What it does is that it gives them unprecedented ability to process text, understand text and generate text. So that’s sort of the technical definition of an LLM. But in layman terms, LLMs are sequence models, or we can say that they’re algorithms that look at a sequence of words and are trying to predict what the next word should be. And how they do it is based on a probability distribution that they have inferred from the data that they have been trained on. So think about it, you can predict the next word and then the word after that and the word after that.

Abhinav Kimothi 00:03:05 So that’s how they’re generating coherent text, which we also call natural language and health. They are generating natural language.

Priyanka Raghavan 00:03:15 That’s great. Another term that’s always used is prompt engineering. So we’ve always, a lot of us who go on ChatGPT or other kind of agents, you just type in normally, but then you see that there’s a lot of literature out there which says if you are good at prompt engineering, you can get better results. So what is prompt engineering?

Abhinav Kimothi 00:03:33 Yeah, that’s a good question. So LLMs differ from traditional algorithms in the sense that when you’re interacting with an LLM, you’re interacting not in code or not in numbers, but in natural language text. So this input that you’re giving to the LLM in form of natural language or natural text is called a prompt. So think of prompt as an instruction or a piece of input that you’re giving to this model.

Abhinav Kimothi 00:03:58 In fact, if you go back to early 2023, everybody was saying, hey, English is the new programming language because these AI models, you can just chat with them in English. And it may seem a bit banal if you look at it from a high level that hey, how can English now become a programming language? But it turns out the way you are structuring your instructions even in English language, has a significant effect of on the kind of output that this LLM will produce. I mean English may be the language, but the principles of logic reasoning they stay the same. So how you craft your instruction that becomes very important. And this ability or the process of crafting the right instruction even in English language is what we call prompt engineering.

Priyanka Raghavan 00:04:49 Great. And then obviously the other question I have to ask you is also there’s a lot of talk about this term called context window. What is that?

Abhinav Kimothi 00:04:56 As I said, LLMs are sequence models. They’ll look at a sequence of text and then they will generate some text after that. Now this sequence of text cannot be infinite and the reason why it can’t be infinite is because of how the algorithm is structured. So there is a limit to how much text can the model look at in terms of the instructions that you’re giving it and then how much text can it generate after that. So this constraint on the number of, well it’s technically called tokens, but we’ll use words. So number of words that the model can process in one go is called the context window of that model. And we started with very less context windows, but now they are models that have context window of two lacks and three lacks. So, can process two lack words at a time. So that’s what the context window term means.

[...]

Original source