When to use RAG
RAG is a form of prompt engineering. It is a collection of techniques in which applications retrieve relevant documents and then include them in the prompt to the generative model. The goal of RAG is to provide the model with relevant information at the time a question is asked. This allows using LLM with information that was not available during training (because it is private or new), reminds it of relevant facts and reduces hallucinations.Example problems where RAG is a good fit:
- Customer support: Automate customer support by providing answers to common questions. The source material can be product documentation and previous support tickets.
- Sales enablement: Generate custom responses based on a vast repository of sales documents, case studies, and product specifications.
- Legal: Quickly find relevant legal documents, case law, and statutes to answer legal questions.
Challenges with RAG
RAG’s effectiveness depends on the quality of the knowledge base and the retrieval strategy. In many cases, the knowledge base does not exist and has to be created by painstakingly collecting and structuring data from various sources across a company. Getting high quality data in a place where it can be easily retrieved isn’t a new challenge - it existed, under different names and disciplines for decades. Many companies have entire data engineering or data platform teams dedicated to this task. The fact that we have LLMs now helps, but it’s not a silver bullet. The main areas of challenge are:- Preparing the data for storage and retrieval: Cleaning and structuring the data, chunking large texts, enriching it with additional information, indexing it for fast retrieval, etc.
- Retrieval strategy: How to retrieve the most relevant documents for a given prompt. Techniques include semantic similarity using vector embedings, keyword matching, use of relation graphs, and combinations of these. And of course each technique has variety of algorithms and models with different strengths and different usage patterns.
When to fine-tune a model
Fine-tuning a model is a form of training. It is the process of taking a pre-trained model and training it further on a specific dataset. The goal of the training isn’t to get the model to memorize new data, but rather to learn how to generalize better on a specific task. Because of the generalization process, a fine-tuned model can perform well on a new task when there isn’t a lot of data available or when the relevant data is challenging to retrieve.Example problems where fine-tuning is a good fit:
- Code generation: If a model wasn’t trained on a specific language or framework, it won’t perform well in that language, even if you provide examples using RAG. Code generation doesn’t generalize well from a small number of examples, even when they are relevant.
- Asking customers clarifying questions. Customers rarely share all available information when they open a support ticket. But most language models are “reluctant” to ask clarifying questions and weren’t trained to do so. RAG isn’t useful here because the model isn’t missing the data, it is missing a behavior. A large set of example scenarios and which questions are appropriate can train the model with this new behavior.
- Writing in the style of a specific person. With enough examples of a specific writing style, a model will generalize and learn to write in that style. While you could provide examples using RAG, a few examples in the prompt can be costly and hard to generalize from.
- Generating audio in a specific voice. An audio generating model can be trained to mimic specific voices.
Challenges with fine-tuning
Fine-tuning is a form of training, and most of the challenges result from the training process.- Training requires a large and high quality data set that was prepared specifically for traning the model - separated into “good” and “bad” examples, and covering many variations of the type of tasks the model will perform. Large number of examples and high variaty is important for generalization (to avoid overfitting). Creating these data sets is time consuming and expensive. For some use-cases there are public datasets available, which helps. Also worth mentioning that some companies use AI models to generate synthetic data for training. This is controversial, since it can lead to degredation of the model’s performance on real data, however using AI to add variations to real data can be a good way to create a larger data set and avoid overfitting.
- Training requires a lot of compute power and time. GPUs are expensive and the latest models are hard to acquire at any price.
- Training require not just GPUs, but also use of machine learning tools, languages and libraries. And expertise in using these.
- Training requires expertise beyond just the tools. There are many techniques for training and many hyperparameters to tune. Some techniques are specifically designed for efficiency and can be used even with a single consumer-grade GPU. These are known as PEFT (Parameter Efficient Model Fine-Tuning), and the most famous of these is called LORA. PEFT only trains a small number of parameters, and if the task requires deeper training, less efficient techniques like full or partial parameter fine-tuning are used.
- Training requires a good process. Knowing how to train a model, how to evaluate it, managing various versions, etc.
- Training can go very wrong very fast. There are a lot of cases where a model performs significantly worse after fine-tuning, and the training must be restarted from an older version of the model. This behavior is sometimes called “catasrophic forgetting” and is a well-known problem in machine learning, and this is just one example of how models can quickly and unexpectedly degrade in performance.
- To use a fine-tuned model, you need to deploy it (although some vendors, like OpenAI, allow you to fine-tune on their platform). This is a whole other set of challenges, including how to serve the model, how to monitor it, how to update it. All this, on top of the GPU costs of serving the model.