Skip to main content

Information Extraction

This is an example walkthrough on how to build an information extraction system on top of Predibase infrastructure. The source code for the functions and classes used in this tutorial can be found in the predibase/examples repository.

Github: Example notebook and app

In this tutorial, we show how you can connect any number of unstructured documents in a text format and specify prompts to extract details from every document, enabling you to transform unstructured documents to tables.

Set up your environment

Start by setting up your environemnt by cloning the predibase/examples repository

git clone https://github.com/predibase/examples.git

and get the latest code for the example RAG project by running

cd examples
git pull
pip install -e information_extraction_rag

Load your data

To conform with the input format for this implementation of RAG, your data must have the following 3 columns:

  • document_id: identifier for a document. Has to be unique.
  • document_name: make sure it's unique
  • document_text: contains the actual text of the document

We have included an example CSV file. Make sure you have AWS credentials set up.

import os
import pandas as pd
from predibase import PredibaseClient
from time import perf_counter

df = pd.read_csv("s3://predibase-public-us-west-2/datasets/formatted_hotel_reviews.csv")
df

Instantiate resources

In this section, we're going to

  • Create a PredibaseClient to be able to interact with Predibase infrastructure.
  • Specify the LLM that will be used for generation.
  • Build a Corpus of unstructured documents to be queried.
from info_extract import Corpus
from info_extract.endpoints import get_llm_endpoint
from info_extract.retrieval import get_retriever


# number of documents to work with in the corpus. This can be increased to any number.
num_documents = 10

# chunks size in characters
chunk_size = 2048

# name of the corpus
corpus_name = "demo-corpus"

# instantiate the Predibase client
pc = PredibaseClient(token="<YOUR PREDIBASE API TOKEN>")

# Using a Predibase LLM (e.g. llama-2-13b)
llm_endpoint = get_llm_endpoint(model_provider="predibase", model_name="llama-2-13b", predibase_client=pc)

# Use Predibase infrastructure for indexing and retrieval
retriever = get_retriever(retrieval_provider="predibase", index_name=f"{corpus_name}-{chunk_size}", predibase_client=pc, model_name="llama-2-13b")

# Create the corpus of documents and pass in the necessary resources (LLM and retriever)
corpus = Corpus(df.head(num_documents), name=corpus_name, llm_endpoint=llm_endpoint, retriever=retriever)

Create chunks

As some documents might not fit in the context of the LLM, make sure to chunk the documents in the corpus.

chunks = corpus.chunk(chunk_size)

# view chunks
chunks.df

Extract information from your documents

  • Define a list of questions/queries that you'd like to extract.
  • Note that this will run on all documents and will be slow. If you know that the information exists in a subset of the documents, either use RAG (next section) or create a new Corpus with a subset of the documents.
start_t = perf_counter()
extraction_result = corpus.extract(queries=["what is the address of the hotel?"])
print(f"took {perf_counter() - start_t}")

Look at the results of the extraction

Once the extraction process concludes, you should be able to view the returned dataframe.

extraction_result.extractions

# iterate through the dataframe
for _, row in extraction_result.extractions.iterrows():
print(10 * "-")
print(row["answer"])
print()

See which chunk the extracted bit comes from

Every document-query pair will generate an extraction. An extraction will correspond to a cell in the extraction_result.extractions dataframe. This information can come from one or more chunks in the document. The following code helps you retrieve the attributes for that extracted information. Note that the query and document_id provided here are specific to an example in the CSV we provided above. Make sure to modify these to fit your data.

query = "what is the address of the hotel?"
document_id = "AWE2FvX5RxPSIh2RscTK"

relevant_chunks = extraction_result.get_attribution(query=query, document_id=document_id)

for chunk in relevant_chunks:
print("chunk.document_id", chunk.document_id)
print("chunk.chunk_id", chunk.chunk_id)
print()
print("chunk.chunk_text:\n", chunk.chunk_text)
print("\n\n")