Skip to main content

Zero-Shot Learning

What is Zero Shot Learning?

Zero-shot learning is a way for LLMs to recognize and classify things they've never seen before. It lets them make predictions about new classes even without specific training examples or actually going through the process of training a model. Instead of relying on labeled data, LLMs use shared characteristics or attributes from the underlying data they were pretrained on to understand and generalize their knowledge to your prediction task.

For example, let's say you have a large language model that was trained on news articles from different categories like sports, politics, and entertainment. With zero-shot learning, you can make predictions on new categories that were not present in the training data. For instance, if you provide the model with a description of a technology article, it can predict the category even if it hasn't seen any technology-related articles during training. This way, zero-shot learning helps leverage contextual knowledge it gained during pretraining to make educated guesses about unfamiliar classes.

When is Zero Shot Learning useful?

Zero-shot learning is useful in the following scenarios:

  1. Limited or no labeled data: When the availability of labeled data is scarce or insufficient for all possible classes or categories, zero-shot learning provides a way to make predictions on unseen classes without requiring explicit training examples. It allows leveraging shared attributes or semantic information to bridge the gap between known and unknown classes.
  2. Expanding classification capabilities: Zero-shot learning enables models to extend their classification abilities beyond the classes they were initially trained on. This is beneficial when new classes or categories emerge, and there is a need to make predictions on them without retraining the entire model from scratch.
  3. Domain adaptation: Zero-shot learning can be applied in scenarios where there is a need to adapt a model trained on one domain to make predictions in a related but different domain. By leveraging shared attributes or semantic information, the model can transfer its knowledge and generalize to the new domain without requiring labeled examples from that domain. For example, you can use a model trained on sentiment data on an e-commerce to classify sentiment on social media data or movie reviews.
  4. Multilingual applications: In multilingual settings, zero-shot learning allows models to perform tasks across multiple languages, even if they have been trained on data from a subset of those languages. By understanding shared characteristics and structures across languages, the model can make predictions or generate outputs in languages it has not been explicitly trained on.

There are two ways to perform zero shot evaluation of LLMs in Predibase:

  1. If you don't have a dataset with labels: You can perform batch prediction using the LLM query editor and manually inspect the results.
  2. If you have a dataset with labels: You can compute metrics against the labels using the rest of this zero shot guide!

Zero Shot Learning in Predibase

In Predibase, you can perform zero shot evaluation over your entire dataset and assess how an open-source LLM is performing on the task of your choice. This is currently supported for two output feature types: text and category. These two output feature types support evaluation of zero shot performance on a variety of different use cases like:

  1. Text Generation: The model can be used for zero-shot text generation by providing a prompt that describes the desired content, style, or context. The model can then generate text that aligns with the given prompt, even for topics or styles it has not been trained on.
  2. Text Classification: Zero-shot text classification allows the model to categorize text into predefined classes or categories without explicit training. By providing a description or label of the target class as input, the model can predict the most suitable category for a given text, even if it has not seen examples from that specific class during training.
  3. Topic Classification: In zero-shot topic classification, the model can classify text into predefined topics or themes without being trained on those specific topics. By providing a description or label of the target topic, the model can categorize a given text into the appropriate topic based on its understanding of language patterns and semantics.
  4. Semantic Similarity: Zero-shot semantic similarity assessment involves comparing the similarity or relatedness between two pieces of text, even if they belong to different domains or topics. By providing pairs of text as input, the model can estimate their semantic similarity without requiring explicit training on specific pairs. For e.g., you can prompt the LLM to see if sentence A and sentence B contextually similar, and it can generate an output response that could either be a confidence score or yes/no/maybe response based on what you ask the LLM to respond back with in your prompt.

To train a new model through the UI, first click + New Model Repository. You'll be prompted to specify a name of the model repository and optionally provide a description. We recommend using descriptive names that describe the ML use-case you're trying to solve.

Once you've entered this information, you'll be taken directly to the Model Builder to train your first model.

Model Builder Page 1

Train a single model version. You can start with our defaults to establish a baseline, or customize any parameter in the model pipeline, from preprocessing to training parameters.

Next you'll be asked to provide a model version description, connection (system you want to pull data in from), dataset, and target (output feature or column you want to predict). At the moment, Predibase only supports one text input feature and one output feature (either text or category) for large language model evaluation. Support for multiple input and multiple output features for large language models is in our roadmap and will be in the platform soon.

Next, select the Large Language Model model type to evaluate LLM performance using a zero/few shot prompt.

At the bottom of the model builder page, you will see a summary view of the features in your dataset. If your dataset has more than 1 text feature, you should disable all of those additional features from being used using the Active toggle on the right side.

Model Builder Page 2

Now we'll cover the configuration options available after you click Next and land on the second page of the Model Builder.

Model Graph

Also known as the butterfly diagram, the model graph shows a visual diagram of the training/evaluation pipeline. Each individual component is viewable and modifiable, providing a truly "glass-box" approach to ML development.


This section contains parameters that users may consider modifying while tweaking models.

Model Name:

To evaluate an LLM's zero-shot capabilities for your task, you need to first choose what LLM you want to use. This can be configured using the Model Name parameter within the Large Language Model parameter sub-section. Huggingface lists a large variety of large language models you can choose from. Make sure you provide the full model-name by copying the value from the model card.

Typically, good choices for LLMs here are those that have been instruction-tuned. Instruction-tuned models are language models that have been fine-tuned with explicit instructions or prompts during the training process. These instructions provide specific guidance to the model about how to approach a task or generate desired outputs, giving the model the capability to respond to instructions when prompted.

Task and Template

The next step after picking your model is to configure your prompt through the Task and Template parameters. These parameters define the task you want the LLM to perform. Predibase uses default templates if you don't provide one.

For example, if you have a dataset with movie reviews and you want the LLM to generate a sentiment score from 1 to 5, you can define your task and template in the following way:


Predict the sentiment for the movie review. The sentiment for the review is a score from 1 to 5.


SAMPLE INPUT: {sample_input}
USER: Complete the following task: {task}

{sample_input} inserts your input text feature into the template, while {task} inserts the task you defined into the template.

If you had a row of data in your dataset that had the value The movie had really nice special effects, then your data would get transformed to the following after preprocessing:

SAMPLE INPUT: The movie had really nice special effects
USER: Complete the following task: Predict the sentiment for the movie review. The sentiment for the review is a score from 1 to 5.

Predibase also lets you control other parameters for preprocessing such as the sequence length, lowercasing, etc. through Feature Type Defaults.

Trainer and Generation:

Finally, to perform zero shot learning, you need to configure your generation parameters and your trainer.

The generation config refers to the set of configuration parameters that control the behavior of text generation models. These parameters specify various aspects of the generation process, including length restrictions, temperature, repetition penalty, and more.

Here are some common configuration parameters that can be specified in the generation config. You can control as many as you like and leave the default values we've set for the rest:

  • Max Length: Sets the maximum number of tokens for the generated output, ensuring it doesn't exceed a certain number of tokens.
  • Min Length: Specifies the minimum length of the generated output, ensuring it meets a certain threshold.
  • Temperature: Controls the randomness of the generated text. Higher temperature values (e.g., 1.0) result in more diverse and creative outputs, while lower values (e.g., 0.2) lead to more deterministic and focused responses.
  • Repetition Penalty: Discourages the model from repeating the same tokens in the generated text. Higher penalty values (e.g., 2.0) make the model more cautious about generating repetitive content.
  • Num Beams: Specifies the number of beams used in beam search decoding. Increasing the number of beams can improve the diversity of generated outputs but may also increase computation time.

By adjusting these generation configuration parameters, you can control the trade-off between creativity and coherence in the generated text, prevent common issues like repetition, and tailor the model's behavior according to the specific requirements of your application.

Since we don't actually train a model for zero-shot learning, we set the trainer type to None.

That's it! You can hit train to start a batch evaluation on your dataset!


Depending on whether you're using your LLM for zero shot text generation (text output feature) or zero shot classification (category output feature), you will see different evaluation metrics.

Category Output Feature

  • Accuracy: Accuracy is a common evaluation metric used to measure the overall correctness of a classification model. It represents the ratio of correctly classified samples to the total number of samples in a dataset. It is calculated by dividing the number of correct predictions by the total number of predictions. Accuracy provides a general overview of the model's performance but may not be suitable for imbalanced datasets.
  • Hits At K: Hits at K is a metric commonly used in recommendation systems and information retrieval tasks. It measures the proportion of correct items that appear in the top K recommendations. For example, in a recommendation scenario, Hits at K would count a recommendation as a "hit" if the desired item is present in the top K recommendations.
  • ROC AUC: Receiver Operating Characteristic Area Under the Curve (ROC AUC) is a metric commonly used to evaluate binary classification models. It measures the model's ability to discriminate between positive and negative samples by plotting the True Positive Rate (TPR) against the False Positive Rate (FPR) at different classification thresholds. The ROC AUC score represents the area under this curve and provides an overall measure of the model's performance, with higher values indicating better discrimination ability.

Text Output Feature

  • Perplexity: Perplexity measures how well a language model predicts a given sequence of tokens. It gives us an idea of how surprised or confused the model is when it encounters new text. Lower perplexity values indicate better performance. Perplexity is calculated using the likelihood of the target sequence according to the model's learned probability distribution. A lower perplexity suggests that the model is more certain and accurate in predicting the next token in a sequence.
  • Token Accuracy: Token accuracy measures the percentage of correctly predicted tokens in the generated text compared to the ground truth. It calculates the ratio of correctly generated tokens to the total number of tokens. Token accuracy provides an understanding of how well the model generates individual tokens, irrespective of their position in the sequence.
  • Sequence Accuracy: Sequence accuracy evaluates the correctness of the entire generated sequence compared to the ground truth. It measures the percentage of generated sequences that exactly match the desired target sequences. Sequence accuracy is particularly relevant when generating sequences that require strict adherence to a desired output format or structure.
  • Character Error Rate: Character Error Rate (CER) is a metric commonly used to evaluate the quality of text generation models, such as speech recognition systems or optical character recognition (OCR) systems. It measures the percentage of incorrectly predicted characters compared to the ground truth. CER is calculated by aligning the predicted and ground truth sequences and counting the number of substitutions, insertions, and deletions required to transform one sequence into the other.