Skip to main content
Why Testing Matters - Just like you wouldn’t ship code without tests, don’t deploy AI agents without evaluations! Evaluations give you confidence that your agents will handle real-world scenarios gracefully, consistently, and professionally.

What Are Evaluations?

Think of evaluations as unit tests for your AI agents! They help you:

CSV-Based Testing

Define test cases in simple CSV files - no complex setup required!

Automated Validation

Run hundreds of tests automatically with built-in assertions

LLM-as-Judge

Use AI to evaluate subjective qualities like helpfulness

Result Tracking

Export results to CSV for analysis and CI/CD integration

Creating Your First Evaluation

Step 1: Generate the Evaluation Class

Let’s create an evaluation to test a customer support agent! Run this magical command:
Terminal
php artisan vizra:make:eval CustomerSupportEvaluation
Double Magic! What Gets Created - This single command creates two files for you:
  • app/Evaluations/CustomerSupportEvaluation.php - Your evaluation class
  • app/Evaluations/data/customer_support_evaluation.csv - Empty CSV with headers ready for test data
No need to manually create the CSV file - it’s all set up and ready for you to add test cases!
Boom! This creates your evaluation class in app/Evaluations/CustomerSupportEvaluation.php:
app/Evaluations/CustomerSupportEvaluation.php
<?php

namespace App\Evaluations;

use Vizra\VizraADK\Evaluations\BaseEvaluation;

class CustomerSupportEvaluation extends BaseEvaluation
{
    public string $name = 'customer_support_eval';

    public string $description = 'Evaluate customer support agent responses';

    public string $agentName = 'customer_support'; // Agent alias

    public string $csvPath = 'app/Evaluations/data/customer_support_tests.csv';

    public function preparePrompt(array $csvRowData): string
    {
        // Use the 'prompt' column from CSV by default
        return $csvRowData[$this->getPromptCsvColumn()] ?? '';
    }

    public function evaluateRow(array $csvRowData, string $llmResponse): array
    {
        // Reset assertions for this row
        $this->resetAssertionResults();

        // Run assertions based on test type
        if ($csvRowData['test_type'] === 'greeting') {
            $this->assertResponseContains($llmResponse, 'help');
            $this->assertResponseHasPositiveSentiment($llmResponse);
        }

        // Return evaluation results
        $allPassed = collect($this->assertionResults)
            ->every(fn($r) => $r['status'] === 'pass');

        return [
            'row_data' => $csvRowData,
            'llm_response' => $llmResponse,
            'assertions' => $this->assertionResults,
            'final_status' => $allPassed ? 'pass' : 'fail',
        ];
    }
}

Step 2: Add Your Test Data

Now for the fun part - adding test scenarios! The CSV file was automatically created with standard headers. Let’s populate it with different customer interactions:
app/Evaluations/data/customer_support_evaluation.csv
prompt,expected_response,description
"Hello, I need help",help,"Greeting test - should offer assistance"
"Where is my order #12345?",order,"Order inquiry - should help track order"
"I want to return this product",return,"Return request - should explain process"
"This is terrible service!",sorry,"Complaint - should be empathetic"
Pro Tip: Customize Your CSV Structure! - The auto-generated CSV starts with standard headers, but you can customize it for your needs:
  • prompt - The input to send to your agent (required)
  • expected_response - What you expect in the response
  • description - Human-readable test description
  • test_type - Add this to categorize tests for different assertion logic
  • context - Add background information for the test
  • Add any custom columns you need!
The command creates the basic structure - feel free to add more columns as your evaluation needs grow!

Your Assertion Toolbox

Vizra ADK provides a rich collection of assertions to validate every aspect of your agent’s responses!

Content Assertions

// Check if response contains text
$this->assertResponseContains($llmResponse, 'expected text');
$this->assertResponseDoesNotContain($llmResponse, 'unwanted');

// Pattern matching
$this->assertResponseMatchesRegex($llmResponse, '/pattern/');

// Position checks
$this->assertResponseStartsWith($llmResponse, 'Hello');
$this->assertResponseEndsWith($llmResponse, '.');

// Multiple checks
$this->assertContainsAnyOf($llmResponse, ['yes', 'sure', 'okay']);
$this->assertContainsAllOf($llmResponse, ['thank', 'you']);

Length & Structure

// Response length
$this->assertResponseLengthBetween($llmResponse, 50, 500);

// Word count
$this->assertWordCountBetween($llmResponse, 10, 100);

// Format validation
$this->assertResponseIsValidJson($llmResponse);
$this->assertJsonHasKey($llmResponse, 'result');
$this->assertResponseIsValidXml($llmResponse);

Quality Checks

// Sentiment analysis
$this->assertResponseHasPositiveSentiment($llmResponse);

// Writing quality
$this->assertGrammarCorrect($llmResponse);
$this->assertReadabilityLevel($llmResponse, 12);
$this->assertNoRepetition($llmResponse, 0.3);

Safety & Security

// Content safety
$this->assertNotToxic($llmResponse);

// Privacy protection
$this->assertNoPII($llmResponse);

// General safety
$this->assertResponseIsNotEmpty($llmResponse);

LLM as Judge - The Ultimate Quality Check

Sometimes you need another AI to evaluate subjective qualities. That’s where LLM-as-Judge comes in!
When to Use LLM Judge? - Perfect for evaluating:
  • Helpfulness and professionalism
  • Empathy and emotional intelligence
  • Creativity and originality
  • Accuracy of complex responses
  • Overall response quality

Using LLM Judge Assertions

New Fluent Judge Interface! - We’ve introduced a cleaner, more intuitive syntax for judge assertions:
app/Evaluations/CustomerSupportEvaluation.php
public function evaluateRow(array $csvRowData, string $llmResponse): array
{
    $this->resetAssertionResults();

    // Simple pass/fail evaluation
    $this->judge($llmResponse)
        ->using(PassFailJudgeAgent::class)
        ->expectPass();

    // Quality score evaluation
    $this->judge($llmResponse)
        ->using(QualityJudgeAgent::class)
        ->expectMinimumScore(7.5);

    // Multi-dimensional evaluation
    $this->judge($llmResponse)
        ->using(ComprehensiveJudgeAgent::class)
        ->expectMinimumScore([
            'accuracy' => 8,
            'helpfulness' => 7,
            'clarity' => 7
        ]);

    // Return results...
}

Three Judge Patterns

1. Pass/Fail Judge For binary decisions - returns {"pass": true/false, "reasoning": "..."}
$this->judge($response)
    ->using(PassFailJudgeAgent::class)
    ->expectPass();
2. Quality Score Judge For numeric ratings - returns {"score": 8.5, "reasoning": "..."}
$this->judge($response)
    ->using(QualityJudgeAgent::class)
    ->expectMinimumScore(7.0);
3. Comprehensive Judge For multi-dimensional evaluation - returns {"scores": {...}, "reasoning": "..."}
$this->judge($response)
    ->using(ComprehensiveJudgeAgent::class)
    ->expectMinimumScore([
        'accuracy' => 8,
        'helpfulness' => 7,
        'clarity' => 7
    ]);

Running Your Evaluations

Time to put your agent to the test! Let’s see how it performs!

Running from CLI

Terminal
# Run evaluation by class name
php artisan vizra:run:eval CustomerSupportEvaluation

# Save results to CSV for analysis
php artisan vizra:run:eval CustomerSupportEvaluation --output=results.csv

# Results are saved to storage/app/evaluations/ by default

What You’ll See

Watch the magic happen with a beautiful progress bar and detailed results!
Console Output
Running evaluation: customer_support_eval
Description: Evaluate customer support agent responses
Processing 4 rows from CSV using agent 'customer_support'...
████████████████████████████████████████ 4/4
Evaluation processing complete.

┌─────┬──────────────┬──────────────────────────┬─────────────────┬───────┐
│ Row │ Final Status │ LLM Response Summary     │ Assertions Count│ Error │
├─────┼──────────────┼──────────────────────────┼─────────────────┼───────┤
│ 1   │ ✅ pass      │ Hello! I'd be happy to...│ 2               │       │
│ 2   │ ✅ pass      │ I can help you track...  │ 1               │       │
│ 3   │ ❌ fail      │ Sure, let me assist...   │ 2               │       │
│ 4   │ ✅ pass      │ I understand your...     │ 3               │       │
└─────┴──────────────┴──────────────────────────┴─────────────────┴───────┘

Summary: Total Rows: 4, Passed: 3 (75%), Failed: 1 (25%), Errors: 0

Advanced Example - Putting It All Together

Ready for the full experience? Here’s a complete evaluation implementation that showcases all the techniques!
app/Evaluations/CustomerSupportEvaluation.php
<?php

namespace App\Evaluations;

use Vizra\VizraADK\Evaluations\BaseEvaluation;

class CustomerSupportEvaluation extends BaseEvaluation
{
    public string $name = 'customer_support_eval';
    public string $description = 'Comprehensive customer support evaluation';
    public string $agentName = 'customer_support';
    public string $csvPath = 'app/Evaluations/data/customer_support_tests.csv';

    public function preparePrompt(array $csvRowData): string
    {
        // Get the base prompt
        $prompt = $csvRowData[$this->getPromptCsvColumn()] ?? '';

        // Add context if available
        if (isset($csvRowData['context'])) {
            $prompt = "Context: " . $csvRowData['context'] . "\n\n" . $prompt;
        }

        return $prompt;
    }

    public function evaluateRow(array $csvRowData, string $llmResponse): array
    {
        $this->resetAssertionResults();

        // Basic content checks
        if (isset($csvRowData['expected_contains'])) {
            $this->assertResponseContains(
                $llmResponse,
                $csvRowData['expected_contains']
            );
        }

        // Test type specific assertions
        switch ($csvRowData['test_type'] ?? '') {
            case 'greeting':
                $this->assertResponseHasPositiveSentiment($llmResponse);
                $this->assertWordCountBetween($llmResponse, 10, 50);
                break;

            case 'complaint':
                $this->assertResponseContains($llmResponse, 'sorry');
                $this->assertNotToxic($llmResponse);
                $this->assertLlmJudge(
                    $llmResponse,
                    'Is this response empathetic and de-escalating?',
                    'llm_judge',
                    'pass'
                );
                break;

            case 'technical':
                $this->assertReadabilityLevel($llmResponse, 12);
                $this->assertGrammarCorrect($llmResponse);
                break;
        }

        // General quality checks
        $this->assertResponseIsNotEmpty($llmResponse);
        $this->assertNoPII($llmResponse);

        // Determine final status
        $allPassed = collect($this->assertionResults)
            ->every(fn($r) => $r['status'] === 'pass');

        return [
            'row_data' => $csvRowData,
            'llm_response' => $llmResponse,
            'assertions' => $this->assertionResults,
            'final_status' => $allPassed ? 'pass' : 'fail',
        ];
    }
}

Analyzing Your Results

CSV Output Structure

When you export results with --output, you get a comprehensive CSV report! CSV Columns Explained:
  • Evaluation Name - The name of your evaluation
  • Row Index - Which test case from your CSV
  • Final Status - pass, fail, or error
  • LLM Response - What your agent actually said
  • Assertions (JSON) - Detailed results of each check

Creating Custom Assertions

Need something specific? Create your own reusable assertion classes!

Simple Example: Product Name Assertion

Let’s create a simple assertion that checks if a product name is mentioned:
app/Evaluations/Assertions/ContainsProductAssertion.php
<?php

namespace App\Evaluations\Assertions;

use Vizra\VizraADK\Evaluations\Assertions\BaseAssertion;

class ContainsProductAssertion extends BaseAssertion
{
    public function assert(string $response, ...$params): array
    {
        $productName = $params[0] ?? '';

        if (empty($productName)) {
            return $this->result(false, 'Product name parameter is required');
        }

        $contains = stripos($response, $productName) !== false;

        return $this->result(
            $contains,
            "Response should mention the product '{$productName}'",
            "contains '{$productName}'",
            $contains ? "found '{$productName}'" : "product not mentioned"
        );
    }
}

Using Your Custom Assertion

app/Evaluations/ProductReviewEvaluation.php
use App\Evaluations\Assertions\ContainsProductAssertion;

class ProductReviewEvaluation extends BaseEvaluation
{
    private ContainsProductAssertion $productAssertion;

    public function __construct()
    {
        parent::__construct();
        $this->productAssertion = new ContainsProductAssertion();
    }

    public function evaluateRow(array $csvRowData, string $llmResponse): array
    {
        $this->resetAssertionResults();

        // Use your custom assertion
        $this->assertCustom(ContainsProductAssertion::class, $llmResponse, 'MacBook Pro');

        // Mix with built-in assertions
        $this->assertWordCountBetween($llmResponse, 50, 200);

        // Determine final status
        $allPassed = collect($this->assertionResults)
            ->every(fn($r) => $r['status'] === 'pass');

        return [
            'assertions' => $this->assertionResults,
            'final_status' => $allPassed ? 'pass' : 'fail',
        ];
    }
}
Pro Tip: CSV-Driven Custom Assertions! - You can even specify custom assertions in your CSV files:
prompt,assertion_class,assertion_params
"Tell me about the new iPhone",ContainsProductAssertion,"[\"iPhone\"]"
"Describe the MacBook features",ContainsProductAssertion,"[\"MacBook\"]"
Then use them dynamically in your evaluation:
if (isset($csvRowData['assertion_class'])) {
    $params = json_decode($csvRowData['assertion_params'] ?? '[]', true);
    $this->assertCustom($csvRowData['assertion_class'], $llmResponse, ...$params);
}

Generate Assertion Classes with Artisan

Creating new assertions is super easy with our generator command!
Terminal
php artisan vizra:make:assertion EmailValidationAssertion
This creates a ready-to-use assertion class with helpful boilerplate!

Built-in Custom Assertions

Vizra ADK comes with several ready-to-use custom assertions:

ContainsProductAssertion

Check if a product name is mentioned

JsonSchemaAssertion

Validate JSON structure against a schema

PriceFormatAssertion

Verify price formatting in any currency

EmailFormatAssertion

Check for valid email addresses

CI/CD Integration

Make testing automatic! Here’s how to add evaluations to your CI/CD pipeline!
.github/workflows/evaluate.yml
# Evaluate agents on every push
name: Evaluate Agents
on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v2

      - name: Setup PHP & Dependencies
        uses: shivammathur/setup-php@v2
        with:
          php-version: '8.2'

      - name: Install Dependencies
        run: composer install

      - name: Run Evaluations
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          php artisan vizra:run:eval CustomerSupportEvaluation --output=results.csv

      - name: Check Results
        run: |
          # Add your own pass/fail logic based on CSV results
          php artisan app:check-eval-results storage/app/evaluations/results.csv

      - name: Upload Results
        uses: actions/upload-artifact@v2
        with:
          name: evaluation-results
          path: storage/app/evaluations/

Best Practices for Awesome Evaluations

Organization

  • CSV Organization - Use clear test types and descriptive columns
  • Thorough Testing - Combine multiple assertion types
  • LLM Judge - Use for subjective quality checks
  • CI/CD Integration - Run evaluations on every push

Quality

  • Track Progress - Monitor performance over time
  • Real Data - Include actual user queries
  • Edge Cases - Test error scenarios too
  • Consistency - Use the same criteria across agents

You’re Ready to Test Like a Pro!

With evaluations, you can ship AI agents with confidence! Your agents will be tested, validated, and ready for real-world challenges. Happy testing!