Skip to main content

Few-Shot Learning

What is Few Shot Learning?

Few-shot learning is a way for LLMs to learn new tasks or make predictions with only a limited amount of labeled data. It allows LLMs to make predictions about new classes using just a handful of examples, by embedding them as context in the prompt, and without going through the process of training a model. This is particularly useful when there is a scarcity of annotated data for a specific task.

Suppose you have a small dataset of customer reviews for various products. You want to classify the sentiment of reviews for a new, previously unseen product, but you only have a few labeled examples of reviews on other products. In the few-shot learning approach, you can use these limited labeled examples to infer the sentiment towards other products that you don't have data for.

To do this, you can provide these labeled examples as context for the model in the query that you are passing to the LLM. By leveraging the knowledge learned from other related products, the model can generalize and make accurate sentiment predictions for the new product, even with limited labeled examples. This allows you to perform sentiment analysis on new products without requiring a large amount of annotated data specifically for each individual product.

When is Few Shot Learning useful?

Few-shot learning is useful in the following scenarios:

  1. Limited Labeled Data: When you have a small amount of labeled data for a specific task or class, few-shot learning can be employed to make accurate predictions. It helps overcome the limitations of traditional machine learning methods that typically require a large annotated dataset.
  2. Rapid Task Adaptation: If you need to quickly adapt a pretrained LLM to perform well on a new task with minimal labeled examples, few-shot learning is valuable. It allows the model to generalize from prior knowledge and learn new tasks efficiently rather than needing to train a new task specific model.
  3. Costly Data Annotation: If acquiring labeled data is expensive or time-consuming, few-shot learning can be an effective strategy. Instead of annotating a large dataset, you can label a few examples and utilize the few-shot learning approach to achieve comparable performance. You can also use few-shot learning to label your dataset.

Few Shot Learning in Predibase

In Predibase, you can perform few shot evaluation over your entire dataset and assess how an open-source LLM is performing on the task of your choice. This is currently supported for two output feature types: text and category (including binary). These two output feature types support evaluation of few shot performance on a variety of different use cases like:

  1. Text Generation: The model can be used for few-shot text generation by providing a prompt that describes the desired content, style, or context. The model can then generate text that aligns with the given prompt, even for topics or styles it has not been trained on.
  2. Text Classification: Few-shot text classification allows the model to categorize text into predefined classes or categories without explicit training. By providing a description or label of the target class as input, the model can predict the most suitable category for a given text, even if it has not seen examples from that specific class during training.
  3. Topic Classification: In few-shot topic classification, the model can classify text into predefined topics or themes without being trained on those specific topics. By providing a description or label of the target topic, the model can categorize a given text into the appropriate topic based on its understanding of language patterns and semantics.

To train a new model through the UI, first click + New Model Repository. You'll be prompted to specify a name of the model repository and optionally provide a description. We recommend using descriptive names that describe the ML use-case you're trying to solve.

Once you've entered this information, you'll be taken directly to the Model Builder to train your first model.

Model Builder Page 1

Train a single model version. You can start with our defaults to establish a baseline, or customize any parameter in the model pipeline, from preprocessing to training parameters.

Next you'll be asked to provide a model version description, connection (system you want to pull data in from), dataset, and target (output feature or column you want to predict). At the moment, Predibase only supports one text input feature and one output feature (either text or category) for large language model evaluation. Support for multiple input and multiple output features for large language models is in our roadmap and will be in the platform soon.

Next, select the Large Language Model model type to evaluate LLM performance using a few shot prompt.

At the bottom of the model builder page, you will see a summary view of the features in your dataset. If your dataset has more than 1 text feature, you should disable all of those additional features from being used using the Active toggle on the right side.

Model Builder Page 2

Now we'll cover the configuration options available after you click Next and land on the second page of the Model Builder.

Model Graph

Also known as the butterfly diagram, the model graph shows a visual diagram of the training/evaluation pipeline. Each individual component is viewable and modifiable, providing a truly "glass-box" approach to ML development.


This section contains parameters that users may consider modifying while tweaking models.

Model Name:

To evaluate an LLM's few-shot capabilities for your task, you need to first choose what LLM you want to use. This can be configured using the Model Name parameter within the Large Language Model parameter sub-section. Huggingface lists a large variety of large language models you can choose from. Make sure you provide the full model-name by copying the value from the model card.

Typically, good choices for LLMs here are those that have been instruction-tuned. Instruction-tuned models are language models that have been fine-tuned with explicit instructions or prompts during the training process. These instructions provide specific guidance to the model about how to approach a task or generate desired outputs, giving the model the capability to respond to instructions when prompted.

Task and Template

The next step after picking your model is to configure your prompt through the Task and Template parameters. These parameters define the task you want the LLM to perform. Predibase uses default templates if you don't provide one.

For example, if you have a dataset with movie reviews and you want the LLM to generate a sentiment score from 1 to 5, you can define your task and template in the following way:


Predict the sentiment for the movie review. The sentiment for the review is a score from 1 to 5.


SAMPLE INPUT: {sample_input}
USER: Complete the following task: {task}

{sample_input} inserts your input text feature into the template, while {task} inserts the task you defined into the template.

If you had a row of data in your dataset that had the value The movie had really nice special effects, then your data would get transformed to the following after preprocessing:

SAMPLE INPUT: The movie had really nice special effects
USER: Complete the following task: Predict the sentiment for the movie review. The sentiment for the review is a score from 1 to 5.

Predibase also lets you control other parameters for preprocessing such as the sequence length, lowercasing, etc. through Feature Type Defaults.


Since we want to perform few shot learning, we want to add examples from our dataset into the prompt to give our LLM more context during generation. Predibase supports two different strategies for retrieving example rows from the dataset: random (default) and semantic.

Random: Randomly select k rows from your dataset to add to the prompt as context.

Semantic: Find the top k semantically similar rows to your input query and add them to the prompt as context. This is done by creating an embedding index and using it for k-nearest neighbour search. For example, say you have a dataset of movie reviews and corresponding sentiment scores from 1 to 5. If the query you want to perform few-shot sentiment analysis on is "The movie was thought provoking yet continuously dynamic", then the semantic strategy will find the top k semantically similar rows of data from your dataset and add that to the prompt as context.

To perform semantic retrieval, you need to specify a model name that is used to create an embedding index for semantic retrieval. These models are based on the sentence-transformers library, and you can find pretrained sentence-transformers models on huggingface-hub. A good starter model to use is paraphrase-MiniLM-L6-v2.

Trainer and Generation:

Finally, to perform few shot learning, you need to configure your generation parameters and your trainer.

The generation config refers to the set of configuration parameters that control the behavior of text generation models. These parameters specify various aspects of the generation process, including length restrictions, temperature, repetition penalty, and more.

Here are some common configuration parameters that can be specified in the generation config. You can control as many as you like and leave the default values we've set for the rest:

  • Max Length: Sets the maximum number of tokens for the generated output, ensuring it doesn't exceed a certain number of tokens.
  • Min Length: Specifies the minimum length of the generated output, ensuring it meets a certain threshold.
  • Temperature: Controls the randomness of the generated text. Higher temperature values (e.g., 1.0) result in more diverse and creative outputs, while lower values (e.g., 0.2) lead to more deterministic and focused responses.
  • Repetition Penalty: Discourages the model from repeating the same tokens in the generated text. Higher penalty values (e.g., 2.0) make the model more cautious about generating repetitive content.
  • Num Beams: Specifies the number of beams used in beam search decoding. Increasing the number of beams can improve the diversity of generated outputs but may also increase computation time.

By adjusting these generation configuration parameters, you can control the trade-off between creativity and coherence in the generated text, prevent common issues like repetition, and tailor the model's behavior according to the specific requirements of your application.

Since we don't actually train a model for few-shot learning, we set the trainer type to None.

That's it! You can hit train to start a batch evaluation on your dataset!


Depending on whether you're using your LLM for few shot text generation (text output feature) or few shot classification (category output feature), you will see different evaluation metrics.

Category Output Feature

  • Accuracy: Accuracy is a common evaluation metric used to measure the overall correctness of a classification model. It represents the ratio of correctly classified samples to the total number of samples in a dataset. It is calculated by dividing the number of correct predictions by the total number of predictions. Accuracy provides a general overview of the model's performance but may not be suitable for imbalanced datasets.
  • Hits At K: Hits at K is a metric commonly used in recommendation systems and information retrieval tasks. It measures the proportion of correct items that appear in the top K recommendations. For example, in a recommendation scenario, Hits at K would count a recommendation as a "hit" if the desired item is present in the top K recommendations.
  • ROC AUC: Receiver Operating Characteristic Area Under the Curve (ROC AUC) is a metric commonly used to evaluate binary classification models. It measures the model's ability to discriminate between positive and negative samples by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at different classification thresholds. The ROC AUC score represents the area under this curve and provides an overall measure of the model's performance, with higher values indicating better discrimination ability.

Text Output Feature

  • Perplexity: Perplexity measures how well a language model predicts a given sequence of tokens. It gives us an idea of how surprised or confused the model is when it encounters new text. Lower perplexity values indicate better performance. Perplexity is calculated using the likelihood of the target sequence according to the model's learned probability distribution. A lower perplexity suggests that the model is more certain and accurate in predicting the next token in a sequence.
  • Token Accuracy: Token accuracy measures the percentage of correctly predicted tokens in the generated text compared to the ground truth. It calculates the ratio of correctly generated tokens to the total number of tokens. Token accuracy provides an understanding of how well the model generates individual tokens, irrespective of their position in the sequence.
  • Sequence Accuracy: Sequence accuracy evaluates the correctness of the entire generated sequence compared to the ground truth. It measures the percentage of generated sequences that exactly match the desired target sequences. Sequence accuracy is particularly relevant when generating sequences that require strict adherence to a desired output format or structure.
  • Character Error Rate: Character Error Rate (CER) is a metric commonly used to evaluate the quality of text generation models, such as speech recognition systems or optical character recognition (OCR) systems. It measures the percentage of incorrectly predicted characters compared to the ground truth. CER is calculated by aligning the predicted and ground truth sequences and counting the number of substitutions, insertions, and deletions required to transform one sequence into the other.