Models
Models
Models are the language models being evaluated in FreeEval. The framework supports different types of models through a unified interface, allowing you to compare various model architectures and implementations using the same evaluation methodology.
Model Types
FreeEval supports three primary types of models to accommodate different deployment scenarios and resource constraints:
Remote Hugging Face Models (remote_hf)
Remote Hugging Face models in FreeEval can be accessed in two ways:
- Via Hugging Face Inference API: Access hosted models directly from Hugging Face’s API service
- Via Text Generation Inference (TGI): Deploy models as HTTP services on your own infrastructure
The second approach uses Hugging Face’s Text Generation Inference package, a high-performance serving solution specifically optimized for large language models. This gives you several advantages:
- Industrial-grade performance with optimized GPU utilization
- Support for model sharding across multiple GPUs
- Advanced features like continuous batching and token streaming
- Consistent API interface compatible with Hugging Face’s hosted services
FreeEval provides a convenient script (deploy_model.py) to help you deploy models using TGI with a single command. For example:
python deploy_model.py --model meta-llama/Llama-2-7b-chat-hf --data-path /path/to/models --gpus 0,1 --port 8080This command deploys the Llama-2-7b-chat model, sharded across GPUs 0 and 1, and makes it available as an HTTP service on port 8080.
FreeEval also supports load balancing across multiple model instances, allowing you to distribute evaluation workloads across multiple machines for better performance and scalability.
Local Hugging Face Models (local_hf)
Local Hugging Face models run directly in your Python process, giving you complete control over the execution environment. This approach is beneficial when:
- You need direct integration without HTTP overhead
- You’re working with proprietary or modified models not available via API
- You require offline evaluation capabilities
- You want to fine-tune evaluation parameters like precision or batch size
Local models require appropriate hardware (typically GPUs) and may need additional setup to ensure optimal performance. FreeEval provides configuration options to customize how models are loaded and executed, allowing you to balance performance and resource usage according to your hardware capabilities.
OpenAI Models (openai)
OpenAI models are accessed through the OpenAI API, providing access to models like GPT-3.5, GPT-4, and other offerings from OpenAI. This option allows you to:
- Evaluate state-of-the-art commercial models
- Establish performance benchmarks for comparison with other models
- Access models that may not be publicly available for local deployment
- Integrate with existing OpenAI-based applications
Using OpenAI models requires an API key and follows OpenAI’s usage policies and pricing structure. FreeEval manages the API communication, formatting requests appropriately and parsing responses to fit within the evaluation framework.
Model Configuration
Each model type has specific configuration parameters that control its behavior during evaluation:
For remote Hugging Face models, you can specify:
- Model endpoint URL
- Authentication details
- Generation parameters (temperature, max tokens, etc.)
- Load balancing configuration for multiple endpoints
For local Hugging Face models, you can specify:
- Model identifier or path
- Device mapping (which GPUs to use)
- Precision settings (float16, int8, etc.)
- Custom prompting templates
- Caching behavior
For OpenAI models, you can configure:
- Model version (e.g., gpt-3.5-turbo, gpt-4)
- API parameters (temperature, presence_penalty, etc.)
- Response formatting
- Authentication details
The configuration system allows you to fine-tune model behavior to match your evaluation requirements, ensuring fair comparisons between different models while accounting for their unique characteristics.
Deploying Models with Text Generation Inference
To facilitate high-performance model deployment, FreeEval includes a utility script for deploying models with Hugging Face’s Text Generation Inference (TGI). This approach is recommended for most evaluation scenarios involving open-source models, as it provides excellent performance and flexibility.
The deploy_model.py script handles the Docker setup and configuration for you:
python deploy_model.py --model /data --data-path /path/to/your/models --gpus 0,1,2,3 --port 8080 --follow-logsKey parameters include:
--model: The model ID or path (use/datafor models in your data path)--data-path: The directory where your model weights are stored--gpus: Comma-separated list of GPU devices to use--port: The port to expose the HTTP service on--follow-logs: (Optional) Show logs after deployment
Once deployed, you can configure FreeEval to use this endpoint by specifying it in your model configuration:
"models": { "llama2-7b": { "type": "remote_hf", "url": "http://localhost:8080" }}For production environments or large-scale evaluations, you can deploy multiple instances across different machines and use FreeEval’s load balancing capabilities to distribute the workload efficiently.
Selecting Models for Evaluation
When designing an evaluation in FreeEval, consider which models will provide the most meaningful insights for your use case. You might include:
- State-of-the-art models to establish performance ceilings
- Open-source alternatives to proprietary models
- Different versions of the same model architecture to measure improvements
- Models with varying parameter counts to assess scaling effects
- Specialized models trained for your domain of interest
By evaluating multiple models using the same datasets and steps, you can gain comparative insights that help inform model selection decisions for your applications or research questions.
Model Outputs and Analysis
FreeEval standardizes outputs from different model types, allowing unified analysis regardless of the underlying implementation. The framework captures both the generated responses and additional metadata like timing information, token usage, and model-specific metrics.
This standardization enables apples-to-apples comparisons between different model types, helping you understand their relative strengths and weaknesses across various evaluation dimensions. The results can guide decisions about which models to use in production or which research directions to prioritize for future development.
Configuration Examples
Below are example configurations for the different model types supported by FreeEval. These examples show the inference_config section of a FreeEval configuration file.
Local Hugging Face Model
"inference_config": { "type": "local_hf", // Specify local Hugging Face model "output_path": "./outputs", // Path to save evaluation results "inference_kwargs": { "model_path": "./models/llama2-7b-chat", // Local path to model weights "generation_config": { // Configuration for text generation "stop_sequences": ["A", "B", "C", "D", "E"], "max_new_tokens": 2048, "temperature": 0.0 // 0.0 for deterministic outputs }, "device": "cuda", // Device to run inference on "num_gpus_per_model": 1, // Number of GPUs for model parallelism "num_gpus_total": 4, // Total available GPUs "max_gpu_memory": null, // GPU memory limit (null = no limit) "trial_run": false, // Set to true for testing with small sample "dump_individual_rsp": true // Save individual model responses }}Remote Hugging Face Model
"inference_config": { "type": "remote_hf", // Specify remote Hugging Face model "output_path": "./outputs", // Path to save evaluation results "inference_kwargs": { "model_name": "llama-2-7b-chat-hf", // Model identifier "base_url": ["http://your-tgi-url"], // Text Generation Inference API URL(s) "timeout": 10, // Request timeout in seconds "generation_config": { // Configuration for text generation "stop_sequences": ["A", "B", "C", "D", "E"], "max_new_tokens": 2048, "temperature": 0.0 // 0.0 for deterministic outputs }, "num_workers": 32, // Number of parallel workers "request_limit": 100000, // Rate limit for API requests "request_limit_period": 60, // Period for rate limit (seconds) "trial_run": false, // Set to true for testing with small sample "dump_individual_rsp": true // Save individual model responses }}OpenAI API Model
"inference_config": { "type": "openai", // Specify OpenAI model "output_path": "./outputs", // Path to save evaluation results "inference_kwargs": { "openai_model": "gpt-4-1106-preview", // OpenAI model identifier "openai_key": "your-openai-key", // API key for authentication "openai_api_base": "https://api.openai.com/v1", // API endpoint URL "openai_proxy": "", // Optional proxy URL (empty for none) "openai_timeout": 120.0, // Request timeout in seconds "generation_config": { // Configuration for text generation "max_tokens": 400, // Maximum tokens to generate "n": 1, // Number of completions to generate "temperature": 0.0, // 0.0 for deterministic outputs "seed": 0 // Seed for deterministic sampling }, "num_workers": 16, // Number of parallel workers "request_limit": 100000, // Rate limit for API requests "request_limit_period": 60, // Period for rate limit (seconds) "dump_individual_rsp": true // Save individual model responses }}The configuration you choose depends on your evaluation needs and available resources. Local models offer more control and can be more cost-effective but require sufficient hardware. Remote and API models provide flexibility without the need for local hardware resources but may incur usage costs.