Creating Custom Evaluators

FreeEval’s modular architecture makes it easy to create custom evaluation methods. This guide will walk you through implementing your own evaluation step and integrating it into the framework.

Overview

Creating a custom evaluator in FreeEval involves:

Implementing a custom evaluation step class
Registering your step with the framework
Using your custom step in evaluation configurations

This extensibility allows you to add specialized evaluation metrics, incorporate domain-specific knowledge, or implement novel evaluation techniques beyond what’s built into FreeEval.

Step 1: Implementing a Custom Step

Let’s create a simple custom evaluator that assesses a model’s ability to identify prime numbers. We’ll name our file custom_evaluator.py and place it in the freeeval/steps directory.

from typing import Dict, Any, List, Optional
from freeeval.steps.base_step import BaseEvaluationStep
from freeeval.data.dataset import Dataset
from freeeval.models.base_model import BaseModel
import sympy  # For prime number checking
import numpy as np

class PrimeNumberIdentificationStep(BaseEvaluationStep):
    """Step for evaluating a model's ability to identify whether a number is prime."""

    def __init__(self,
                 name: str,
                 save_dataset: bool = False,
                 save_path: Optional[str] = None,
                 **kwargs):
        """Initialize the prime number identification step."""
        super().__init__(name, save_dataset, save_path, **kwargs)

    def prepare_dataset(self, dataset: Dataset) -> Dataset:
        """Prepare the dataset for evaluation."""
        # This step expects a dataset with numbers to check
        # Add any preprocessing here if needed
        return dataset

    def format_prompt(self, example: Dict[str, Any]) -> str:
        """Format the prompt for the model."""
        return f"Is {example['number']} a prime number? Answer with 'yes' or 'no'."

    def evaluate_example(self, example: Dict[str, Any],
                         model_response: str) -> Dict[str, Any]:
        """Evaluate the model's response for a single example."""
        number = example['number']
        is_prime = sympy.isprime(number)

        # Clean and normalize the model's response
        response_text = model_response.strip().lower()
        predicted_prime = ("yes" in response_text) and ("no" not in response_text)

        # Calculate correctness
        is_correct = (is_prime == predicted_prime)

        return {
            "number": number,
            "is_prime": is_prime,
            "model_prediction": predicted_prime,
            "is_correct": is_correct,
        }

    def aggregate_results(self, results: List[Dict[str, Any]]) -> Dict[str, Any]:
        """Aggregate the results of all evaluated examples."""
        correct_count = sum(result["is_correct"] for result in results)
        total_count = len(results)
        accuracy = correct_count / total_count if total_count > 0 else 0

        # Separate performance on prime vs. non-prime numbers
        prime_results = [r for r in results if r["is_prime"]]
        nonprime_results = [r for r in results if not r["is_prime"]]

        prime_accuracy = sum(r["is_correct"] for r in prime_results) / len(prime_results) if prime_results else 0
        nonprime_accuracy = sum(r["is_correct"] for r in nonprime_results) / len(nonprime_results) if nonprime_results else 0

        return {
            "accuracy": accuracy,
            "prime_accuracy": prime_accuracy,
            "nonprime_accuracy": nonprime_accuracy,
            "num_examples": total_count,
        }

    def run(self, dataset: Dataset, model: BaseModel, context: Dict[str, Any]) -> Dict[str, Any]:
        """Run the evaluation step."""
        # Prepare the dataset
        prepared_dataset = self.prepare_dataset(dataset)

        # Store the results for each example
        results = []

        # Process each example
        for example in prepared_dataset:
            # Format the prompt
            prompt = self.format_prompt(example)

            # Get model's response
            model_response = model.generate(prompt)

            # Evaluate the response
            evaluation_result = self.evaluate_example(example, model_response)

            # Store the result
            results.append(evaluation_result)

        # Aggregate the results
        aggregated_results = self.aggregate_results(results)

        # Add the results to the context
        context[self.name] = aggregated_results

        return context

Step 2: Registering Your Custom Step

Next, you need to register your custom evaluation step with FreeEval so that it can be referenced in configuration files. Add your step to the freeeval/steps/__init__.py file:

from freeeval.steps.simple_multiple_choice import SimpleMCPStep
from freeeval.steps.cloze_prompt import ClozePromptStep
# Import other built-in steps...

# Import your custom step
from freeeval.steps.custom_evaluator import PrimeNumberIdentificationStep

# Update the type-to-step mapping
TYPE_TO_STEP = {
    "simple_multiple_choice": SimpleMCPStep,
    "cloze_prompt": ClozePromptStep,
    # Other built-in steps...

    # Register your custom step
    "prime_number_identification": PrimeNumberIdentificationStep,
}

Step 3: Creating a Dataset

Your custom evaluator needs a dataset to work with. You can either:

Use an existing dataset and adapt it for your needs
Create a custom dataset specifically for your evaluator

For our prime number identification example, let’s create a simple dataset class in freeeval/datasets/prime_dataset.py:

from typing import Dict, Any, List
from freeeval.data.dataset import BaseDataset, DatasetKwargs
import random

class PrimeDatasetKwargs(DatasetKwargs):
    """Arguments for the Prime Number dataset."""
    min_number: int = 1
    max_number: int = 1000
    num_examples: int = 100
    seed: int = 42

class PrimeDataset(BaseDataset):
    """Dataset for prime number identification."""

    def __init__(self, kwargs: PrimeDatasetKwargs):
        """Initialize the Prime Number dataset."""
        super().__init__(kwargs)
        self.min = kwargs.min_number
        self.max = kwargs.max_number
        self.num_examples = kwargs.num_examples

        # Set random seed for reproducibility
        random.seed(kwargs.seed)

        # Generate the dataset
        self.data = self._generate_dataset()

    def _generate_dataset(self) -> List[Dict[str, Any]]:
        """Generate examples for prime number identification."""
        examples = []
        for _ in range(self.num_examples):
            number = random.randint(self.min, self.max)
            examples.append({"number": number})
        return examples

    def __len__(self) -> int:
        """Return the number of examples in the dataset."""
        return len(self.data)

    def __getitem__(self, idx: int) -> Dict[str, Any]:
        """Get an example by index."""
        return self.data[idx]

Then register your dataset in freeeval/datasets/__init__.py:

from freeeval.datasets.mmlu import MMLUDataset
from freeeval.datasets.arc import ARCDataset
# Other built-in datasets...

# Import your custom dataset
from freeeval.datasets.prime_dataset import PrimeDataset

# Update the type-to-dataset mapping
TYPE_TO_DATASET = {
    "mmlu": MMLUDataset,
    "arc_easy": lambda kwargs: ARCDataset(kwargs, "ARC-Easy"),
    "arc_challenge": lambda kwargs: ARCDataset(kwargs, "ARC-Challenge"),
    # Other built-in datasets...

    # Register your custom dataset
    "prime": PrimeDataset,
}

Step 4: Using Your Custom Evaluator

Now that you’ve implemented and registered your custom evaluator and dataset, you can use them in a configuration file:

{
  "results_output_path": "./result/prime_number_test.json",
  "steps": [
    {
      "step_type": "prime_number_identification",
      "step_name": "prime_test",
      "save_dataset": true,
      "dataset_config": {
        "type": "prime",
        "dataset_kwargs": {
          "min_number": 1,
          "max_number": 1000,
          "num_examples": 50,
          "seed": 42
        }
      },
      "inference_config": {
        "type": "remote_hf",
        "output_path": "./result",
        "inference_kwargs": {
          "model_name": "your-model-name",
          "base_url": ["http://your-model-endpoint:port"],
          "timeout": 60,
          "num_workers": 4
        }
      }
    }
  ]
}

Save this configuration as prime_evaluation.json and run it:

python run.py -c prime_evaluation.json

Advanced: Using Model Inference in Custom Steps

If your custom evaluator needs more complex interaction with the model, you can use the model’s methods directly in your run method. Here’s an example that demonstrates more advanced usage:

def run(self, dataset: Dataset, model: BaseModel, context: Dict[str, Any]) -> Dict[str, Any]:
    """Run a more complex evaluation with multiple model calls per example."""
    prepared_dataset = self.prepare_dataset(dataset)
    results = []

    for example in prepared_dataset:
        # First model call: ask if the number is prime
        initial_prompt = self.format_prompt(example)
        initial_response = model.generate(initial_prompt)

        # Second model call: ask for explanation
        explanation_prompt = f"Explain why {example['number']} is{' not' if 'no' in initial_response.lower() else ''} a prime number."
        explanation = model.generate(explanation_prompt)

        # Evaluate both responses
        evaluation_result = self.evaluate_with_explanation(
            example, initial_response, explanation
        )
        results.append(evaluation_result)

    aggregated_results = self.aggregate_results(results)
    context[self.name] = aggregated_results
    return context

Best Practices for Custom Evaluators

When creating custom evaluators, consider these best practices:

Ensure reproducibility: Always use seeds for randomness and document any non-deterministic behavior.
Handle edge cases: Anticipate and gracefully handle unexpected model responses or dataset entries.
Document your step: Include clear docstrings explaining what your evaluator measures and how it works.
Separate logic: Keep prompt formatting, response evaluation, and result aggregation in separate methods.
Add logging: Include informative logging for debugging and monitoring.
Consider efficiency: For large datasets, implement batching or sampling if appropriate.
Validate inputs: Check that dataset examples have the expected format and fields.

Conclusion

Creating custom evaluators in FreeEval allows you to extend the framework to meet your specific evaluation needs. By following the steps outlined in this guide, you can implement, register, and use custom evaluation methods that integrate seamlessly with FreeEval’s evaluation pipeline.

For more complex evaluators, you might need to implement custom scoring metrics, integrate external tools, or create more sophisticated dataset preprocessing. The extensible architecture of FreeEval makes it possible to adapt the framework to a wide range of evaluation scenarios while maintaining the benefits of its standardized infrastructure.