Creating Custom Evaluators
Creating Custom Evaluators
FreeEval’s modular architecture makes it easy to create custom evaluation methods. This guide will walk you through implementing your own evaluation step and integrating it into the framework.
Overview
Creating a custom evaluator in FreeEval involves:
- Implementing a custom evaluation step class
- Registering your step with the framework
- Using your custom step in evaluation configurations
This extensibility allows you to add specialized evaluation metrics, incorporate domain-specific knowledge, or implement novel evaluation techniques beyond what’s built into FreeEval.
Step 1: Implementing a Custom Step
Let’s create a simple custom evaluator that assesses a model’s ability to identify prime numbers. We’ll name our file custom_evaluator.py and place it in the freeeval/steps directory.
from typing import Dict, Any, List, Optionalfrom freeeval.steps.base_step import BaseEvaluationStepfrom freeeval.data.dataset import Datasetfrom freeeval.models.base_model import BaseModelimport sympy # For prime number checkingimport numpy as np
class PrimeNumberIdentificationStep(BaseEvaluationStep): """Step for evaluating a model's ability to identify whether a number is prime."""
def __init__(self, name: str, save_dataset: bool = False, save_path: Optional[str] = None, **kwargs): """Initialize the prime number identification step.""" super().__init__(name, save_dataset, save_path, **kwargs)
def prepare_dataset(self, dataset: Dataset) -> Dataset: """Prepare the dataset for evaluation.""" # This step expects a dataset with numbers to check # Add any preprocessing here if needed return dataset
def format_prompt(self, example: Dict[str, Any]) -> str: """Format the prompt for the model.""" return f"Is {example['number']} a prime number? Answer with 'yes' or 'no'."
def evaluate_example(self, example: Dict[str, Any], model_response: str) -> Dict[str, Any]: """Evaluate the model's response for a single example.""" number = example['number'] is_prime = sympy.isprime(number)
# Clean and normalize the model's response response_text = model_response.strip().lower() predicted_prime = ("yes" in response_text) and ("no" not in response_text)
# Calculate correctness is_correct = (is_prime == predicted_prime)
return { "number": number, "is_prime": is_prime, "model_prediction": predicted_prime, "is_correct": is_correct, }
def aggregate_results(self, results: List[Dict[str, Any]]) -> Dict[str, Any]: """Aggregate the results of all evaluated examples.""" correct_count = sum(result["is_correct"] for result in results) total_count = len(results) accuracy = correct_count / total_count if total_count > 0 else 0
# Separate performance on prime vs. non-prime numbers prime_results = [r for r in results if r["is_prime"]] nonprime_results = [r for r in results if not r["is_prime"]]
prime_accuracy = sum(r["is_correct"] for r in prime_results) / len(prime_results) if prime_results else 0 nonprime_accuracy = sum(r["is_correct"] for r in nonprime_results) / len(nonprime_results) if nonprime_results else 0
return { "accuracy": accuracy, "prime_accuracy": prime_accuracy, "nonprime_accuracy": nonprime_accuracy, "num_examples": total_count, }
def run(self, dataset: Dataset, model: BaseModel, context: Dict[str, Any]) -> Dict[str, Any]: """Run the evaluation step.""" # Prepare the dataset prepared_dataset = self.prepare_dataset(dataset)
# Store the results for each example results = []
# Process each example for example in prepared_dataset: # Format the prompt prompt = self.format_prompt(example)
# Get model's response model_response = model.generate(prompt)
# Evaluate the response evaluation_result = self.evaluate_example(example, model_response)
# Store the result results.append(evaluation_result)
# Aggregate the results aggregated_results = self.aggregate_results(results)
# Add the results to the context context[self.name] = aggregated_results
return contextStep 2: Registering Your Custom Step
Next, you need to register your custom evaluation step with FreeEval so that it can be referenced in configuration files. Add your step to the freeeval/steps/__init__.py file:
from freeeval.steps.simple_multiple_choice import SimpleMCPStepfrom freeeval.steps.cloze_prompt import ClozePromptStep# Import other built-in steps...
# Import your custom stepfrom freeeval.steps.custom_evaluator import PrimeNumberIdentificationStep
# Update the type-to-step mappingTYPE_TO_STEP = { "simple_multiple_choice": SimpleMCPStep, "cloze_prompt": ClozePromptStep, # Other built-in steps...
# Register your custom step "prime_number_identification": PrimeNumberIdentificationStep,}Step 3: Creating a Dataset
Your custom evaluator needs a dataset to work with. You can either:
- Use an existing dataset and adapt it for your needs
- Create a custom dataset specifically for your evaluator
For our prime number identification example, let’s create a simple dataset class in freeeval/datasets/prime_dataset.py:
from typing import Dict, Any, Listfrom freeeval.data.dataset import BaseDataset, DatasetKwargsimport random
class PrimeDatasetKwargs(DatasetKwargs): """Arguments for the Prime Number dataset.""" min_number: int = 1 max_number: int = 1000 num_examples: int = 100 seed: int = 42
class PrimeDataset(BaseDataset): """Dataset for prime number identification."""
def __init__(self, kwargs: PrimeDatasetKwargs): """Initialize the Prime Number dataset.""" super().__init__(kwargs) self.min = kwargs.min_number self.max = kwargs.max_number self.num_examples = kwargs.num_examples
# Set random seed for reproducibility random.seed(kwargs.seed)
# Generate the dataset self.data = self._generate_dataset()
def _generate_dataset(self) -> List[Dict[str, Any]]: """Generate examples for prime number identification.""" examples = [] for _ in range(self.num_examples): number = random.randint(self.min, self.max) examples.append({"number": number}) return examples
def __len__(self) -> int: """Return the number of examples in the dataset.""" return len(self.data)
def __getitem__(self, idx: int) -> Dict[str, Any]: """Get an example by index.""" return self.data[idx]Then register your dataset in freeeval/datasets/__init__.py:
from freeeval.datasets.mmlu import MMLUDatasetfrom freeeval.datasets.arc import ARCDataset# Other built-in datasets...
# Import your custom datasetfrom freeeval.datasets.prime_dataset import PrimeDataset
# Update the type-to-dataset mappingTYPE_TO_DATASET = { "mmlu": MMLUDataset, "arc_easy": lambda kwargs: ARCDataset(kwargs, "ARC-Easy"), "arc_challenge": lambda kwargs: ARCDataset(kwargs, "ARC-Challenge"), # Other built-in datasets...
# Register your custom dataset "prime": PrimeDataset,}Step 4: Using Your Custom Evaluator
Now that you’ve implemented and registered your custom evaluator and dataset, you can use them in a configuration file:
{ "results_output_path": "./result/prime_number_test.json", "steps": [ { "step_type": "prime_number_identification", "step_name": "prime_test", "save_dataset": true, "dataset_config": { "type": "prime", "dataset_kwargs": { "min_number": 1, "max_number": 1000, "num_examples": 50, "seed": 42 } }, "inference_config": { "type": "remote_hf", "output_path": "./result", "inference_kwargs": { "model_name": "your-model-name", "base_url": ["http://your-model-endpoint:port"], "timeout": 60, "num_workers": 4 } } } ]}Save this configuration as prime_evaluation.json and run it:
python run.py -c prime_evaluation.jsonAdvanced: Using Model Inference in Custom Steps
If your custom evaluator needs more complex interaction with the model, you can use the model’s methods directly in your run method. Here’s an example that demonstrates more advanced usage:
def run(self, dataset: Dataset, model: BaseModel, context: Dict[str, Any]) -> Dict[str, Any]: """Run a more complex evaluation with multiple model calls per example.""" prepared_dataset = self.prepare_dataset(dataset) results = []
for example in prepared_dataset: # First model call: ask if the number is prime initial_prompt = self.format_prompt(example) initial_response = model.generate(initial_prompt)
# Second model call: ask for explanation explanation_prompt = f"Explain why {example['number']} is{' not' if 'no' in initial_response.lower() else ''} a prime number." explanation = model.generate(explanation_prompt)
# Evaluate both responses evaluation_result = self.evaluate_with_explanation( example, initial_response, explanation ) results.append(evaluation_result)
aggregated_results = self.aggregate_results(results) context[self.name] = aggregated_results return contextBest Practices for Custom Evaluators
When creating custom evaluators, consider these best practices:
- Ensure reproducibility: Always use seeds for randomness and document any non-deterministic behavior.
- Handle edge cases: Anticipate and gracefully handle unexpected model responses or dataset entries.
- Document your step: Include clear docstrings explaining what your evaluator measures and how it works.
- Separate logic: Keep prompt formatting, response evaluation, and result aggregation in separate methods.
- Add logging: Include informative logging for debugging and monitoring.
- Consider efficiency: For large datasets, implement batching or sampling if appropriate.
- Validate inputs: Check that dataset examples have the expected format and fields.
Conclusion
Creating custom evaluators in FreeEval allows you to extend the framework to meet your specific evaluation needs. By following the steps outlined in this guide, you can implement, register, and use custom evaluation methods that integrate seamlessly with FreeEval’s evaluation pipeline.
For more complex evaluators, you might need to implement custom scoring metrics, integrate external tools, or create more sophisticated dataset preprocessing. The extensible architecture of FreeEval makes it possible to adapt the framework to a wide range of evaluation scenarios while maintaining the benefits of its standardized infrastructure.