Distributed Training
Fine-tuning with very large datasets in Predibase
Datasets greater than 1 GB in size can take advantage of efficient processing techniques to ensure faster and more scalable fine-tuning operations, including:
- Faster preprocessing by tokenizing data in advance
- Efficient data loading using streaming and out-of-core processing
- Multi-GPU, data-parallel distributed training
Multi-GPU distributed training
Predibase supports running your fine-tuning jobs using multiple GPUs in parallel. Each GPU processes a different partition of the dataset (shuffled between epochs) and replicates the model weights on each GPU.
Using GPUs with fast inerconnects (e.g., H100 SXM), distributed training provides near-linear scaling in throughput as you train with more GPUs, allowing you to fine-tune models on very large datasets in a fraction of the time.
Currently, distributed training is an enterprise-nnly feature, and requires purchasing reserved capacity to access. Reach out to sales@predibase.com to learn more.
Currently, distributed training is only supported for sft
and continued_pretraining
task types. It is not available for classification or GRPO task types at this time.
Pretokenizing your dataset
To take advantage of large dataset fine-tuning, Predibase requires that such datasets are tokenized in advance and saved in partitioned arrow format compatible with HuggingFace’s Datasets library.
This step allows datasets to be streamed into memory efficiently and processed in parallel across multiple GPUs (if available to your tenant), in addition to speeding up preprocessing time by skipping tokenization during training.
This guide will walk you through the process of preparing your data for large training jobs in Predibase.
You can follow along in the following notebook:
Load your dataset
You can load your dataset either in-memory or out-of-core, depending on its size and your available resources. Here’s an example using the Hugging Face datasets library:
Initialize the tokenizer
Use the transformers library to load the appropriate tokenizer for the LLM that you want to use. Note that some LLMs like Llama-3 and Mistral require creating a Huggingface account and requesting access since they are gated.
Batch tokenize your data
You will need to tokenize the data ahead of time.
- If you’re doing instruction tuning, you have to tokenize the
prompt
andcompletion
columns independently. - If you’re doing completions style training, you just need to tokenize your
text
column.
Create input_ids and labels
Next, you need to concatenate the tokenized prompt and completion, then add an EOS token at the end. The process varies based on your training approach:
- For instruction tuning, you need to concat your prompt tokens with your completion tokens
- For completions style training, you can just re-use your input tokens as your labels.
Create a split column (optional)
Add a split column to your dataset for training and evaluation:
Save the dataset
Save your prepared dataset to disk or directly to S3 using the Huggingface Datasets library. Note, it must be saved as a multi-partition arrow file using the Huggingface Datasets format.
If you decide to save locally, you will need to upload your dataset folder to S3.
Upload the dataset to Predibase
Finally, use the Predibase S3 connector to upload your dataset through the product interface.