Models

Models are the language models being evaluated in FreeEval. The framework supports different types of models through a unified interface, allowing you to compare various model architectures and implementations using the same evaluation methodology.

Model Types

FreeEval supports three primary types of models to accommodate different deployment scenarios and resource constraints:

Remote Hugging Face Models (`remote_hf`)

Remote Hugging Face models in FreeEval can be accessed in two ways:

Via Hugging Face Inference API: Access hosted models directly from Hugging Face’s API service
Via Text Generation Inference (TGI): Deploy models as HTTP services on your own infrastructure

The second approach uses Hugging Face’s Text Generation Inference package, a high-performance serving solution specifically optimized for large language models. This gives you several advantages:

Industrial-grade performance with optimized GPU utilization
Support for model sharding across multiple GPUs
Advanced features like continuous batching and token streaming
Consistent API interface compatible with Hugging Face’s hosted services

FreeEval provides a convenient script (deploy_model.py) to help you deploy models using TGI with a single command. For example:

python deploy_model.py --model meta-llama/Llama-2-7b-chat-hf --data-path /path/to/models --gpus 0,1 --port 8080

This command deploys the Llama-2-7b-chat model, sharded across GPUs 0 and 1, and makes it available as an HTTP service on port 8080.

FreeEval also supports load balancing across multiple model instances, allowing you to distribute evaluation workloads across multiple machines for better performance and scalability.

Local Hugging Face Models (`local_hf`)

Local Hugging Face models run directly in your Python process, giving you complete control over the execution environment. This approach is beneficial when:

You need direct integration without HTTP overhead
You’re working with proprietary or modified models not available via API
You require offline evaluation capabilities
You want to fine-tune evaluation parameters like precision or batch size

Local models require appropriate hardware (typically GPUs) and may need additional setup to ensure optimal performance. FreeEval provides configuration options to customize how models are loaded and executed, allowing you to balance performance and resource usage according to your hardware capabilities.

OpenAI Models (`openai`)

OpenAI models are accessed through the OpenAI API, providing access to models like GPT-3.5, GPT-4, and other offerings from OpenAI. This option allows you to:

Evaluate state-of-the-art commercial models
Establish performance benchmarks for comparison with other models
Access models that may not be publicly available for local deployment
Integrate with existing OpenAI-based applications

Using OpenAI models requires an API key and follows OpenAI’s usage policies and pricing structure. FreeEval manages the API communication, formatting requests appropriately and parsing responses to fit within the evaluation framework.

Model Configuration

Each model type has specific configuration parameters that control its behavior during evaluation:

For remote Hugging Face models, you can specify:

Model endpoint URL
Authentication details
Generation parameters (temperature, max tokens, etc.)
Load balancing configuration for multiple endpoints

For local Hugging Face models, you can specify:

Model identifier or path
Device mapping (which GPUs to use)
Precision settings (float16, int8, etc.)
Custom prompting templates
Caching behavior

For OpenAI models, you can configure:

Model version (e.g., gpt-3.5-turbo, gpt-4)
API parameters (temperature, presence_penalty, etc.)
Response formatting
Authentication details

The configuration system allows you to fine-tune model behavior to match your evaluation requirements, ensuring fair comparisons between different models while accounting for their unique characteristics.

Deploying Models with Text Generation Inference

To facilitate high-performance model deployment, FreeEval includes a utility script for deploying models with Hugging Face’s Text Generation Inference (TGI). This approach is recommended for most evaluation scenarios involving open-source models, as it provides excellent performance and flexibility.

The deploy_model.py script handles the Docker setup and configuration for you:

python deploy_model.py --model /data --data-path /path/to/your/models --gpus 0,1,2,3 --port 8080 --follow-logs

Key parameters include:

--model: The model ID or path (use /data for models in your data path)
--data-path: The directory where your model weights are stored
--gpus: Comma-separated list of GPU devices to use
--port: The port to expose the HTTP service on
--follow-logs: (Optional) Show logs after deployment

Once deployed, you can configure FreeEval to use this endpoint by specifying it in your model configuration:

"models": {
  "llama2-7b": {
    "type": "remote_hf",
    "url": "http://localhost:8080"
  }
}

For production environments or large-scale evaluations, you can deploy multiple instances across different machines and use FreeEval’s load balancing capabilities to distribute the workload efficiently.

Selecting Models for Evaluation

When designing an evaluation in FreeEval, consider which models will provide the most meaningful insights for your use case. You might include:

State-of-the-art models to establish performance ceilings
Open-source alternatives to proprietary models
Different versions of the same model architecture to measure improvements
Models with varying parameter counts to assess scaling effects
Specialized models trained for your domain of interest

By evaluating multiple models using the same datasets and steps, you can gain comparative insights that help inform model selection decisions for your applications or research questions.

Model Outputs and Analysis

FreeEval standardizes outputs from different model types, allowing unified analysis regardless of the underlying implementation. The framework captures both the generated responses and additional metadata like timing information, token usage, and model-specific metrics.

This standardization enables apples-to-apples comparisons between different model types, helping you understand their relative strengths and weaknesses across various evaluation dimensions. The results can guide decisions about which models to use in production or which research directions to prioritize for future development.

Configuration Examples

Below are example configurations for the different model types supported by FreeEval. These examples show the inference_config section of a FreeEval configuration file.

Local Hugging Face Model

"inference_config": {
    "type": "local_hf",                           // Specify local Hugging Face model
    "output_path": "./outputs",                   // Path to save evaluation results
    "inference_kwargs": {
        "model_path": "./models/llama2-7b-chat",  // Local path to model weights
        "generation_config": {                    // Configuration for text generation
            "stop_sequences": ["A", "B", "C", "D", "E"],
            "max_new_tokens": 2048,
            "temperature": 0.0                    // 0.0 for deterministic outputs
        },
        "device": "cuda",                         // Device to run inference on
        "num_gpus_per_model": 1,                  // Number of GPUs for model parallelism
        "num_gpus_total": 4,                      // Total available GPUs
        "max_gpu_memory": null,                   // GPU memory limit (null = no limit)
        "trial_run": false,                       // Set to true for testing with small sample
        "dump_individual_rsp": true               // Save individual model responses
    }
}

Remote Hugging Face Model

"inference_config": {
    "type": "remote_hf",                          // Specify remote Hugging Face model
    "output_path": "./outputs",                   // Path to save evaluation results
    "inference_kwargs": {
        "model_name": "llama-2-7b-chat-hf",       // Model identifier
        "base_url": ["http://your-tgi-url"],      // Text Generation Inference API URL(s)
        "timeout": 10,                            // Request timeout in seconds
        "generation_config": {                    // Configuration for text generation
            "stop_sequences": ["A", "B", "C", "D", "E"],
            "max_new_tokens": 2048,
            "temperature": 0.0                    // 0.0 for deterministic outputs
        },
        "num_workers": 32,                        // Number of parallel workers
        "request_limit": 100000,                  // Rate limit for API requests
        "request_limit_period": 60,               // Period for rate limit (seconds)
        "trial_run": false,                       // Set to true for testing with small sample
        "dump_individual_rsp": true               // Save individual model responses
    }
}

OpenAI API Model

"inference_config": {
    "type": "openai",                             // Specify OpenAI model
    "output_path": "./outputs",                   // Path to save evaluation results
    "inference_kwargs": {
        "openai_model": "gpt-4-1106-preview",     // OpenAI model identifier
        "openai_key": "your-openai-key",          // API key for authentication
        "openai_api_base": "https://api.openai.com/v1", // API endpoint URL
        "openai_proxy": "",                       // Optional proxy URL (empty for none)
        "openai_timeout": 120.0,                  // Request timeout in seconds
        "generation_config": {                    // Configuration for text generation
            "max_tokens": 400,                    // Maximum tokens to generate
            "n": 1,                               // Number of completions to generate
            "temperature": 0.0,                   // 0.0 for deterministic outputs
            "seed": 0                             // Seed for deterministic sampling
        },
        "num_workers": 16,                        // Number of parallel workers
        "request_limit": 100000,                  // Rate limit for API requests
        "request_limit_period": 60,               // Period for rate limit (seconds)
        "dump_individual_rsp": true               // Save individual model responses
    }
}

The configuration you choose depends on your evaluation needs and available resources. Local models offer more control and can be more cost-effective but require sufficient hardware. Remote and API models provide flexibility without the need for local hardware resources but may incur usage costs.

Models

Models

Model Types

Remote Hugging Face Models (remote_hf)

Local Hugging Face Models (local_hf)

OpenAI Models (openai)

Model Configuration

Deploying Models with Text Generation Inference

Selecting Models for Evaluation

Model Outputs and Analysis

Configuration Examples

Local Hugging Face Model

Remote Hugging Face Model

OpenAI API Model

Remote Hugging Face Models (`remote_hf`)

Local Hugging Face Models (`local_hf`)

OpenAI Models (`openai`)