RAG: Simple in Proof of Concept, Challenging to get it Production Ready!

4 min readSep 23, 2023

I’ve recently written a theoretical analysis of the RAG (Retrieval Augmented Generation) method, aiming to enhance the capabilities of Generative AI using an existing dataset.

Generative AI Virtual Assistants and Search with RAG — Theoretical ExplorationAs a technology enthusiast, I often find myself fascinated by AI's potential applications in transforming how we…
anilktalla.medium.com

To summarize, Retrieval Augmented Generation (RAG) involves enhancing user input by integrating responses from a dataset and then processing it through a Large Language Model (LLM) like GPT-3.5 or GPT-4 to produce contextually apt answers.

While there’s no shortage of beginner’s guides for setting up RAG systems, such as those from LangChain, Microsoft’s GitHub, and Amazon Kendra, the real challenge emerges during user testing. Transitioning from a proof of concept (PoC) to full production isn’t as straightforward as one might think. As a huge advocate of the Pareto principle’s 80:20 rule, a PoC might only bring you 20% closer to production readiness, leaving 80% of the work still to be planned and executed.

Here are some strategies, methods, and considerations to enhance RAG and prepare it for production.

Is Your Search Optimized and Prepared?

The success of RAG implementation relies on the quality of your data. Feeding it outdated data can lead to inaccurate results or, in more technical terms, “hallucinations” from the user’s perspective. It’s crucial to ensure that the data you provide to Generative AI is up-to-date, accurate, and aligns with the user’s context. Beyond just having a “Vector Search,” what’s essential is a finely-tuned search that yields the most precise results. Techniques like “stemming” and eliminating stop words can significantly enhance your search’s efficacy. So, is your search system primed to deliver the most accurate outcomes at any moment? If the answer is yes, you’re on the right track.

Stop-word Removal: Clear out common stop words like “and,” “the,” and “is” that don’t add meaningful value. This declutters the search index, spotlighting more relevant terms.

Stemming Technique: Employ stemming to reduce words to their root form, improving search recall by accommodating different word variations. For example, stemming would convert “Walking,” “Walks,” and “Walked” to the base word “Walk.”

2. Is your content fresh and relevant?

A RAG system is only as good as the content it’s based on. If your content lacks clarity, the system’s output will be off-mark. Do you maintain a well-governed dataset? Remember, if you input poor-quality data into RAG, it will produce irrelevant results. It’s crucial to keep your data current, eliminate redundancies, and tailor it to the user based on the attributes you deem essential. One technique is to only “scope” your RAG implementation to highly curated and governed data.

3. Token limitations — Absolutely!

Every GPT model comes with specific token constraints, which can relate to individual requests, minute-by-minute thresholds, and so on. It’s essential to have backup strategies for situations where users reach these limits. It’s a good idea to conduct some load testing! Fingers crossed for improvements in the near future! :)

4. Is your prompt accurate? Keep tuning it.

Starting with a basic prompt, such as “Generate a response to the query using only the given information, without relying on prior knowledge,” is a good approach. Yet, it may not always work well. As you gather more data, it’s crucial to adjust and enhance the prompt, even if it means making it more detailed.

5. Have you determined the most suitable model?

It might be necessary to compare various models, like GPT-3.5-turbo and GPT 4.0, for instance. By evaluating them against your specific criteria, you can decide which model best fits your RAG configuration. For instance, if you focus on voice, tone, and deploy a Chatbot via RAG, GPT 4.0 might excel in accuracy, tone, and reduced hallucination. However, it might come with a higher cost.

6. Do you need re-ranking?

Re-ranking offers the capability to re-order query results based on relevance through keyword-driven search engines, user queries, and contextual information. This lets you prioritize results before feeding them into the Generative AI model.

For example, see: https://txt.cohere.com/rerank/

7. Have your users been trained to give precise prompts? If not, should you adjust the query?

Prompt engineering is a new skill, and many users might not be yet adapted to writing optimal prompts/queries/questions for search engines that RAG systems rely on. LLMs can assist in re-writing the query and rewording the user’s question to make it more comprehensible for your search engine. Implementing a query pre-processing method using an LLM can be beneficial, especially when dealing with extensive or complex queries.

8. An experimental approach might be more effective!

After preparing the PoC and demonstrating various queries, are you ready to launch? Adopting an experimental strategy can be beneficial for successfully implementing the RAG system. Given the uncertainties, it’s better to engage in experimentation, gather user insights, refine the system, and gradually expand its reach to a broader audience.

Implementing a RAG system isn’t as straightforward. I’ve added my thoughts on just a few challenges one might encounter during its deployment. As I understand deeper and learn more, I’ll continue to share updates.

RAG: Simple in Proof of Concept, Challenging to get it Production Ready!

Generative AI Virtual Assistants and Search with RAG — Theoretical Exploration

As a technology enthusiast, I often find myself fascinated by AI's potential applications in transforming how we…

Written by @Anil's Notes