Skip to main content

AugmentationConfig

Below is the class definition for the Augmentation Config.

class AugmentationConfig():
base_model: str = "gpt-4o-mini", # gpt-4-turbo, gpt-4-1106-preview, gpt-4-0125-preview, gpt-4o,gpt-4o-2024-08-06, gpt-4o-mini
augmentation_strategy: str = "mixture_of_agents" # mixture_of_agents, single_pass
num_samples_to_generate: int = 1000, # Number of samples to generate
num_seed_samples: str | int = "all", # Optional - number of examples to use as seed samples from the dataset
task_context: str = "", # Optional - provide a context for the task to help the model. If not provided, the model will infer context.

Base models

When creating your AugmentationConfig, base_model is a required parameter. Currently, we support the following options:

  • gpt-4-turbo
  • gpt-4-1106-preview
  • gpt-4-0125-preview
  • gpt-4o
  • gpt-4o-2024-08-06
  • gpt-4o-mini
OpenAI costs

Note that you will incur OpenAI API costs when calling this augment function. We recommend starting with augmentation_strategy: "single_pass", a smaller num_samples_to_generate and a cheaper OpenAI model.

Defaults

Predibase sets the following default values for the rest of the attributes in the AugmentationConfig (subject to change):

  • augmentation_strategy (default: "mixture_of_agents"): Supported options are mixture_of_agents and single_pass. Check out how each strategy works here.
  • num_samples_to_generate (default: 1000): Recommended values are 10, 25, 50, 100, 200, 500, 1000.
  • num_seed_samples (default: "all"): The number of examples from your seed dataset to use as examples in the synthetic dataset generation process. This can be set to "all" to use all examples, or a value between 1 and the size of your dataset if you want to only use a subset of the rows.
  • task_context (default: ""): You can use this to give the synthetic dataset process more context about your task, such as high-level task description, nuances in the dataset to look out for, what the prompt and completion columns represent individually, etc. If not provided, Predibase will use the base_model to smartly infer task context based on the seed samples (see below):

Task Context Inference

When the task_context isn't provided, Predibase uses your seed dataset to infer context about the task from your dataset. This is used in subsequent LLM calls to help guide the model towards generating useful, consistent and accurate data. Some of the key aspects we try to infer include:

  • Identifying the general domain of the data based on terminology and context.
  • Describing the characteristics of the data, such as format, structure, and length.
  • Understanding the nature of the task, including the roles of the 'Prompt' and 'Completion' fields.
  • Recognizing repetitive patterns or unique instances in the data.
  • Assessing ambiguous cases and suggesting likely scenarios with confidence levels.
  • Ensuring the structure of the dataset is accurately reflected in synthetic examples.
  • Noting any additional attributes or metadata that could inform the understanding of the dataset.

and lots more!