Merge Dataset

The merge_dataset step allows you to combine multiple datasets into a single unified dataset, which can be used for either evaluation or model training.

Overview

When working with multiple datasets, it’s often useful to combine them into a single dataset with consistent formatting. This step provides utilities for merging different evaluation datasets, maintaining their original structure while ensuring format compatibility.

Key Features

Multi-dataset Integration: Combine datasets from various sources
Format Standardization: Ensures consistent formatting across merged data
Training Data Preparation: Create instruction tuning datasets
Random Shuffling: Option to shuffle the combined dataset

When to Use

Use this step when you want to:

Create a comprehensive evaluation across multiple benchmarks
Prepare combined training datasets for fine-tuning
Generate mixed-domain instruction sets
Convert datasets into a consistent format

Implementation Details

Internally, this step:

Loads multiple datasets according to the provided configurations
Converts each dataset to a standardized format
Merges the datasets into a unified structure
Optionally shuffles the combined dataset
Saves the result to the specified path

Supported Modes

The merge dataset step supports two primary modes:

SFT Mode: Formats data for supervised fine-tuning
PT Mode: Formats data for pretraining

Technical Considerations

When merging datasets with different structures, you may need to provide mapping configurations to ensure consistent formatting across the merged dataset.