Merge Dataset
Merge Dataset
The merge_dataset step allows you to combine multiple datasets into a single unified dataset, which can be used for either evaluation or model training.
Overview
When working with multiple datasets, it’s often useful to combine them into a single dataset with consistent formatting. This step provides utilities for merging different evaluation datasets, maintaining their original structure while ensuring format compatibility.
Key Features
- Multi-dataset Integration: Combine datasets from various sources
- Format Standardization: Ensures consistent formatting across merged data
- Training Data Preparation: Create instruction tuning datasets
- Random Shuffling: Option to shuffle the combined dataset
When to Use
Use this step when you want to:
- Create a comprehensive evaluation across multiple benchmarks
- Prepare combined training datasets for fine-tuning
- Generate mixed-domain instruction sets
- Convert datasets into a consistent format
Implementation Details
Internally, this step:
- Loads multiple datasets according to the provided configurations
- Converts each dataset to a standardized format
- Merges the datasets into a unified structure
- Optionally shuffles the combined dataset
- Saves the result to the specified path
Supported Modes
The merge dataset step supports two primary modes:
- SFT Mode: Formats data for supervised fine-tuning
- PT Mode: Formats data for pretraining
Technical Considerations
When merging datasets with different structures, you may need to provide mapping configurations to ensure consistent formatting across the merged dataset.