Skip to main content

Retrieval-Augmented Generation

Here we are going to show how to build an end-to-end Retrieval Augmented Generation (RAG) system using Predibase and other frameworks in the ecosystem. Specifically, Predibase currently integrates with other tools that specialize in RAG workflows including Langchain and Llamaindex.

The following walkthrough will show you how to use Predibase + LlamaIndex to set up all the moving parts of a RAG system, with Predibase as the LLM provider. Feel free to follow along in the colab notebook!

Open in Colab Notebook

Predibase + LlamaIndex: Building a RAG System

The following walkthrough shows you how to use Predibase-hosted LLMs with LlamaIndex to build a RAG system.

There are a few pieces required to build a RAG system:

  1. LLM provider
    • Predibase is the LLM provider here. We can serve base LLMs and/or fine-tuned LLMs for whatever generative task you have.
  2. Embedding Model
    • This model generates embeddings for the data that you are storing in your Vector Store
    • In this example you have the option of using a local HuggingFace embedding model, or OpenAI's embedding model.
      • Note: You need to have an OpenAI account with funds and an API token to use the OpenAI embedding model.
    • In the near future, you will be able to train and deploy your own embedding models using Predibase
  3. Vector Store
    • This is where we store the embedded data that we want to retrieve later at query time
    • In this example we will use Pinecone for our Vector Store

Getting Started


  1. If you don't have a Predibase account already, sign up for a free trial here
  2. Once you've logged in, navigate to Settings > My profile
  3. Generate a new API token
  4. Copy the API token and paste in the first setup cell below

OpenAI (Optional)

  1. If you don't have an OpenAI account already, sign up here
  2. Navigate to OpenAI's API keys page
  3. If you have not already, generate an API key
  4. Copy the API key and paste in the second setup cell below


  1. If you don't have a Pinecone account already, they have a free tier available for trial
  2. Navigate to the API Keys page
  3. If you have not already, generate an API key

Step 0: Setup

import os

import openai
import pinecone

from llama_index import ServiceContext, StorageContext, SimpleDirectoryReader, VectorStoreIndex, set_global_service_context
from llama_index.llms import PredibaseLLM
from llama_index.embeddings import HuggingFaceEmbedding, OpenAIEmbedding
from llama_index.vector_stores import PineconeVectorStore


The following is only required if you'll be using an OpenAI embedding model.

openai.api_key = os.environ["OPENAI_API_KEY"]

Step 1: Setting up the Predibase LLM

There a few parameters to keep in mind while setting up your Predibase LLM:

  1. model_name: This must be an LLM currently deployed in your Predibase environment.
    • Any of models shown in the LLM query view dropdown are valid options.
    • If you are running Predibase in a VPC, you'll need to deploy an LLM first.
  2. adapter_id: An optional HuggingFace ID of a fine-tuned LLM adapter, whose base model is the model parameter;
    • The fine-tuned adapter must be compatible with its base model; otherwise, an error is raised.
  3. temperature: Controls the randomness of your model responses.
    • A higher value will give the model more creative leeway
    • A lower value will give a more reproducible and consistent response
  4. max_new_tokens: Controls the number of tokens the model can produce.
# Configure Predibase LLM
predibase_llm = PredibaseLLM(
adapter_id="predibase/e2e_nlg", # optional parameter

Step 2: Set up Embedding model

If you are using a local HuggingFace embedding model, you can use the following code to set up your embedding model:

# loads BAAI/bge-small-en-v1.5
hf_embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

If you are using OpenAI's embedding model, you can use the following code to set up your embedding model:

# loads text-embedding-ada-002 OpenAI embedding model - uncomment and run for the OpenAI option
openai_embed_model = OpenAIEmbedding()

Now with our embedding model set up, we will create the service context that will be used to query the LLM and embed our data/queries.

# Create a ServiceContext with our Predibase LLM and chosen embedding model
ctx = ServiceContext.from_defaults(llm=predibase_llm, embed_model=hf_embed_model)

# Set the Predibase LLM ServiceContext to the default

Step 3: Set up Vector Store

As mentioned before, we'll be using Pinecone for this example. Pinecone has a free tier that you can use to try out this example. You can also swap out any other Vector Store supported by LlamaIndex.

# Initialize pinecone and create index
pinecone.init(api_key="YOUR API TOKEN HERE", environment="gcp-starter")

If you are using the HuggingFace embedding model, you can use the following code to set up your Vector Store:

# HF Index - Compatible with local HF embedding model output dimensions
pinecone.create_index("predibase-demo-hf", dimension=384, metric="euclidean", pod_type="p1")

If you are using the OpenAI embedding model, you can use the following code to set up your Vector Store:

Note: You need to have OpenAI set up and configured for this option. If you do not have an OpenAI API key, we recommend you go with the HuggingFace Index option above.

# OpenAI Index - Compatible with OpenAI embedding model (text-embedding-ada-002) output dimensions
pinecone.create_index("predibase-demo-openai", dimension=1536, metric="euclidean", pod_type="p1")

Finally, we'll select our index, create the storage context, and index our documents!

# construct vector store and custom storage context
pincone_vector_store = PineconeVectorStore(pinecone.Index("predibase-demo-hf"))
pinecone_storage_context = StorageContext.from_defaults(vector_store=pincone_vector_store)

# Load in the documents you want to index
documents = SimpleDirectoryReader("/Users/connor/Documents/Projects/datasets/huffington_post_pdfs/").load_data()

Step 4: Set up index

Here we create the index so that any query you make will pull the relevant context from your Vector Store.

index = VectorStoreIndex.from_documents(documents, storage_context=pinecone_storage_context)

Step 5: Querying the LLM with RAG

Now that we've set up our index, we can ask questions over the documents and Predibase + LlamaIndex will search for the relevant context and provide a response to your question within said context.

# Setup query engine
predibase_query_engine = index.as_query_engine()

Now we can ask questions over our documents!

response = predibase_query_engine.query("INSERT QUERY HERE")

To see the response to your query, you can pass the response variable to a print statement. Otherwise, you can pass the response object around your system to finish setting up your RAG solution.

Extra Resources

If you would like to learn more about building RAG systems with Predibase, check out the docs below.

Example RAG Notebook with LlamaIndex

Example RAG Notebook with Langchain