Skip to content

Merge Dataset

Merge Dataset

The merge_dataset step allows you to combine multiple datasets into a single unified dataset, which can be used for either evaluation or model training.

Overview

When working with multiple datasets, it’s often useful to combine them into a single dataset with consistent formatting. This step provides utilities for merging different evaluation datasets, maintaining their original structure while ensuring format compatibility.

Key Features

  • Multi-dataset Integration: Combine datasets from various sources
  • Format Standardization: Ensures consistent formatting across merged data
  • Training Data Preparation: Create instruction tuning datasets
  • Random Shuffling: Option to shuffle the combined dataset

When to Use

Use this step when you want to:

  • Create a comprehensive evaluation across multiple benchmarks
  • Prepare combined training datasets for fine-tuning
  • Generate mixed-domain instruction sets
  • Convert datasets into a consistent format

Implementation Details

Internally, this step:

  1. Loads multiple datasets according to the provided configurations
  2. Converts each dataset to a standardized format
  3. Merges the datasets into a unified structure
  4. Optionally shuffles the combined dataset
  5. Saves the result to the specified path

Supported Modes

The merge dataset step supports two primary modes:

  • SFT Mode: Formats data for supervised fine-tuning
  • PT Mode: Formats data for pretraining

Technical Considerations

When merging datasets with different structures, you may need to provide mapping configurations to ensure consistent formatting across the merged dataset.