Summary: This blog introduces Large Language Models (LLMs) and Retrieval Augmented Generation (RAG), targeting tech enthusiasts. It explains LLMs, their applications, and limitations, while offering guidelines for building knowledge bases and prompt engineering. It also highlights testing methodologies for RAG models..
Contents
- Intro to LLMs (Large Language Models) and how it works
- Intro to RAG
- Guidelines for knowledge base
- Guidelines for prompt engineering
- Methodologies to test
- Annexure
Who this is blog for
Aimed at folks who regards themselves as “tech-enthusiasts” but not necessarily “hardcore tech”. The aim of this blog is to
- Understand the terminologies like LLM, RAG, knowledge base, prompt and prompt engineering
- Get practical tips for prompt engineering and curating knowledge bases
Intro to LLMs and how it works
Large language models (LLMs) are the engines behind the rise of Generative AI. These models are designed to understand and generate human-like text.
An analogy
Think of LLMs as an algorithm that can mimic human language. It can speak (generate text) that mimics human speech and respond to most commonly used human phrases. It has read and learned from tons of information on the internet, so when you ask it a question, it uses what it has learned to respond in a human-like way.
This is a gross oversimplification and for a bit deeper insight into the technology you explore more here
Various companies are in the race to create their own Large Language Models (ex- OpenAI is the creator of GPT models, Google is the creator of Gemini, Meta is the creator of Llama series of LLMs, Ola is creator of Krutrim, Anthropic is the creator of Claude and so on). Each of these models have their pros and cons along the categories such as cost of usage, ease of usage, community support, inherent biases, accuracy, ease of fine-tuning etc.
Since LLMs are trained on data from the internet, they don’t always give you the latest or most accurate information. So, it’s important to double-check the facts.
Several companies have built their own language models, such as:
- OpenAI (makers of GPT models),
- Google (makers of Gemini),
- Meta (makers of Llama), and more.
Each model has strengths and weaknesses based on factors like cost, ease of use, accuracy, and biases.
Since LLMs are trained on data from the internet, they don’t always give you the latest or most accurate information. So, it’s important to double-check the facts. So
- It is best to assume that the information being provided is not 100% factual. It only sounds like an intelligent human answering.
- AI generates a response based on the data it was fed, typically it has information from internet upto a certain date in the past.
Here is a tool for you to try out various LLM models pitted against each other.
Intro to RAG (Retrieval Augmented Generation)
RAG stands for Retrieval Augmented Generation, this technology can be used to augment the information that the model has access to, in order to generate the response to a question.
An analogy
If you think of LLMs as an algorithm to mimic human language, think of RAG as giving the algorithm text content to mug up and use its mimicking human speech ability to then generate answers from.
This idea is crucial as using RAG, we can then get the LLMs to answer highly context oriented questions which will definitely not be part of its original training data set.
Following is a simplified visual representation of RAG
There are 3 major components to be able to leverage RAG
- The knowledge base
- The prompt
- The Large Language Model
Guidelines for knowledge base
Following should be kept in mind while putting together the files that can be used as knowledge base
- Text based files formats are best suited (.md, .txt, .docx etc)
- The files should not be filled with images, word arts, tables, and any other media
- PDF files are ok, but not as good pure text documents
- Tabular files are not good inputs as knowledge base (.csv, .xls etc.). Any tabular info should be converted into paragraph based format to be effective.
Guidelines for prompt engineering
A prompt is a instruction provided to the model. Sometimes it is also called just that “instructions”. This guides the way the LLM generates the answer. Altering the instructions can be game-chaging for the way a model responds. For example, providing an instruction to “explain like I am 5 years old” vs “explain like I am PHD student” will generate different responses.
Prompt engineering is crafting a set of instructions that include
- context of the situation
- role the AI should play
- guardrails to follow
- examples of ideal interactions
Here are some
- Be very specific and clear. Think of the LLM as an assistant that will do what you ask it to, given your instructions are clear and concise.
- Assign a role to the AI model. E.g. “You are a helpful mentor, who answers questions of young people regarding social and emotional well being in a patient manner..” . Steers the model to behave the way in which you want it to.
- The correct grammar, punctuation, spacing, paragraphs matter.
- Instead of “don’t do X”, say “avoid X”. Ex. Instead of “do not respond to queries which do not pertain to the topic” you can say “politely decline to answer if the query is outside the topic you’re allowed to answer from”
- Use a new line for every part of the prompt
- Specify a structured format to receive the output back in.
- Give examples of how you expect the output from the model to be. Ex- if you want the output to be 2 paragraph answers, provide it examples of responses structured in a similar way.
- Divide your prompt into these categories be giving “context and role to play”, “instructions or guidelines”, and “examples or expected output format”
Some more ideas
Chain-of-thought prompting: Give the model time to think. The model is biased to give some output as soon as possible. If you directly ask the model to give the final answer, it will likely make a mistake. Instead, ask the model to think through the individual steps and only after that produce the final output. This is because of how the attention mechanism works. Reduces hallucination drastically
Few shot prompting: Give a list of examples of anticipated questions and the expected response. This works great to steer the output to be of a certain format.
Methodologies to test
Now, when you evaluate a RAG model, you need to make sure of three things:
- Relevance – Does the information retrieved by the AI actually make sense for the question?
- Accuracy – Is the generated response factually correct?
- Groundedness – Is the answer based on reliable sources? You’d trust something your teacher told you over a rumor from a friend, so the AI should rely on trustworthy data when generating its answer.
So, in short, when we evaluate RAG, we’re looking at how well the AI:
- Picks relevant info,
- Gives accurate answers, and
- Backs it up with reliable sources
Here is how you can test the models
- Human reviews: This is a manual process. This involves recording every interaction (question, answer) then going through each pair to rate the response as per discretion. This is time consuming but often the most reliable way to build confidence on the answers, and to understand for what circumstances the LLM is not responding as per the expectations. Read this blog to get an idea of how a manual system can be established to quantitatively evaluate the accuracy, relevance of LLM responses.
- Automating the tests: using libraries like ragas.io or tools like langfuse. The positives of using any of the above technologies to implement observability and monitoring is that it can be best utilized when a large number of interactions are taking place, and to run A/B experiments by tweaking prompts etc. While on the other hand, it is still an automated evaluation (LLM evaluating itself) which does tend to be biased and not as intelligent as a human eyes and mind observing the interactions.
Annexure:
- Analysis of Using GPT models vs self hosting open source LLMs (link)
- TLDR: At a scale of 1k-100k API calls per day, the proprietary models are more cost effective. (using GPT models by OpenAI or something relevant). At a scale of 100k to 1M API calls per day, it is more cost effective to use self hosted open source models.
- More reading materials (link)
- Using RAG in your WhatsApp chatbot powered by Glific (read here)
Lastly
The blog is collated by tejas@glific.org, akhilesh@glific.org and info provided on prompt engineering is borrowed generously from a session conducted by Aman Dalmia of Hyperverge (aman.d@hyperverge.co)
