Datasets

Datasets are a fundamental component of the FreeEval framework, providing the content against which language models are evaluated. They contain the prompts, questions, or scenarios that models must respond to during assessment, helping to measure various aspects of model capability and performance.

Dataset Fundamentals

In FreeEval, a dataset represents a collection of evaluation items organized for systematic assessment. Each item typically includes input text that will be presented to the model, and often contains additional metadata like reference answers, categories, or difficulty ratings.

Datasets serve as the foundation for meaningful evaluation by providing consistent, reproducible content. By using standardized datasets, you can compare models fairly, track improvements over time, and situate your results within broader research contexts.

FreeEval supports various dataset formats and sources, allowing you to leverage existing benchmarks or create custom collections tailored to your specific evaluation needs. The framework provides utilities for loading, filtering, and preprocessing datasets to ensure they’re properly formatted for evaluation.

Types of Datasets

FreeEval accommodates several types of datasets to support diverse evaluation scenarios:

Knowledge datasets test a model’s factual knowledge across domains like science, history, or current events. These datasets typically contain factual questions with verified answers, allowing you to measure a model’s accuracy in retrieving and applying information.

Reasoning datasets assess a model’s ability to follow logical steps, solve problems, or draw inferences. These often include mathematical problems, logical puzzles, or scenarios requiring multi-step reasoning.

Dialogue datasets evaluate conversational abilities through multi-turn exchanges. These are particularly useful for assessing contextual understanding, consistency, and adherence to conversational norms.

Creative datasets measure a model’s creative capabilities through prompts for stories, poetry, or other creative content. These evaluations often rely on human judgment or specialized metrics for assessment.

Safety datasets test a model’s ability to handle potentially problematic requests appropriately, helping to evaluate alignment with ethical guidelines and safety constraints.

Dataset Selection and Customization

Selecting appropriate datasets is crucial for meaningful evaluation. The datasets you choose should align with your evaluation goals, covering the specific capabilities you want to assess and representing the use cases relevant to your application.

FreeEval allows you to customize datasets to suit your evaluation needs. You can filter datasets to focus on specific categories, difficulty levels, or content types. You can also transform dataset items to change formats, add instructions, or modify content to test particular model behaviors.

For specialized needs, you can create custom datasets from scratch or adapt existing ones. FreeEval provides utilities to help you structure your data correctly and integrate it seamlessly into the evaluation pipeline.

Working with Datasets

When configuring an evaluation in FreeEval, you specify which datasets to use and how they should be processed. The framework handles the loading and preparation of datasets, making them available to the evaluation steps in your pipeline.

During evaluation, each step can access the relevant datasets as needed, applying model-specific formatting or processing as required. The framework maintains a clear connection between dataset items and evaluation results, making it easy to analyze performance across different subsets of your data.

By understanding how datasets function within FreeEval, you can design more meaningful evaluations that provide targeted insights into model capabilities. Thoughtful dataset selection and customization ensure your assessments measure what matters most for your specific applications or research questions.

Example Dataset Implementation

To implement a new dataset, you need to define the loading logic and data parsing for your custom dataset. Below is a detailed example that shows how to create a dataset compatible with FreeEval’s pipeline.

from freeeval.datasets.multiple_choice import (
    MultipleChoiceDataset,  # Base class for multiple choice datasets
    MultipleChoiceProblem,  # Data structure for individual problems
)
from datasets import load_dataset  # HuggingFace datasets library
from freeeval.data.registry import register_dataset  # For registering our dataset

# Define a new dataset class that inherits from MultipleChoiceDataset
class ExampleDataset(MultipleChoiceDataset):
    """
    Example dataset implementation that loads multiple-choice questions.
    This example demonstrates loading a dataset similar to C-Eval format.
    """

    def __init__(
        self,
        seed=1,              # Random seed for reproducibility
        split="val",         # Dataset split (e.g., train, val, test)
        name_or_path=None,   # HuggingFace dataset name or local path
        config_name=None,    # Configuration name for the dataset
        fewshot_split="dev", # Split to use for few-shot examples
        fewshot_num=0,       # Number of few-shot examples to include (0 = zero-shot)
        **kwargs             # Additional arguments passed to parent class
    ):
        # Initialize the parent class
        super().__init__(seed=seed, **kwargs)

        # Set dataset source (with default if none provided)
        self.name_or_path = (
            "liyucheng/ceval_all" if name_or_path is None else name_or_path
        )

        # Handle few-shot learning if requested
        if fewshot_num:
            # Load examples for few-shot demonstrations from specified split
            fewshot_dataset = load_dataset(
                self.name_or_path,
                name=config_name,  # Optional config for specific subset
                split=fewshot_split
            )

            # Select a consistent set of examples based on the seed
            fewshot_examples = self.select_fewshot_examples(
                fewshot_dataset, fewshot_num, seed=seed
            )

            # Parse examples into the standard problem format
            self.fewshot_examples = [
                self.parse_data_instance(e) for e in fewshot_examples
            ]
        else:
            # For zero-shot, no examples are needed
            self.fewshot_examples = []

        # Load the main evaluation dataset
        self.hf_dataset = load_dataset(self.name_or_path, split=split)

        # Process the dataset into our format
        self.parse_hf_dataset()

        # Generate the final prompt text including any few-shot examples
        self.generate_prompt_text()

    def parse_data_instance(self, data, extra={}):
        """
        Parse an individual question instance into a standardized format.

        Args:
            data: Raw data instance from the dataset
            extra: Additional metadata to include with the problem

        Returns:
            MultipleChoiceProblem object representing the question
        """
        # Extract question text
        question = data["question"]

        # Extract answer choices
        choices = [data["A"], data["B"], data["C"], data["D"]]

        # Convert letter answer to index (A=0, B=1, C=2, D=3)
        answer = ["A", "B", "C", "D"].index(data["answer"])

        # Generate labels for the choices
        labels = [chr(i + ord("A")) for i in range(len(choices))]

        # Create and return a structured problem object
        return MultipleChoiceProblem(
            question,
            choices,
            answer,
            generation_config={"stop_sequences": labels},  # Stop generation after choice label
            extra=extra,  # Store any additional metadata
        )

    def parse_hf_dataset(self):
        """
        Process the entire dataset by parsing each instance.
        Stores all problems in self.problems list.
        """
        for idx, data in enumerate(self.hf_dataset):
            # Parse each instance and add an ID for tracking
            self.problems.append(self.parse_data_instance(data, extra={"id": idx}))


# Register the dataset so it can be referenced in configuration files
register_dataset("example_dataset", ExampleDataset)

Using Your Custom Dataset

After implementing your dataset, you can use it in FreeEval configurations:

{
  "steps": [
    {
      "step_type": "simple_multiple_choice",
      "step_name": "Example Dataset Evaluation",
      "dataset_config": {
        "type": "example_dataset",  // Reference your registered dataset name
        "dataset_kwargs": {
          "seed": 42,
          "split": "test",
          "fewshot_num": 3  // Use 3-shot examples
        }
      },
      // ...model configuration and other settings...
    }
  ]
}

Key Considerations When Implementing Datasets

Reproducibility: Use a fixed seed for any random operations to ensure consistent results.
Validation: Include checks to verify dataset format and raise informative errors.
Formatting: Ensure your dataset produces prompts in a format expected by your target models.
Documentation: Add clear docstrings explaining dataset structure and any special handling.
Registration: Always register your dataset to make it available in configuration files.