Evaluation Service | Aegis Enterprise

Introduction

The Aegis Evaluation Service allows you to systematically measure the performance of multi-agent teams on various tasks. It supports both real-time evaluation during development and large-scale batch evaluation for regression testing and monitoring.

This tutorial will guide you through the Automarking use case as an example of how to:

Set up a Ground Truth Dataset
Create an Evaluation Job
Match agent outputs to ground truth labels
Retrieve Evaluation Results

This pattern is applicable to any task in the Aegis stack, as the evaluation system is fully task-agnostic and declarative.

Key Concepts

Ground Truth Dataset: A collection of labeled data points (e.g., correct answers, expected outcomes) used for evaluation.
Evaluation Job: A task that computes metrics comparing agent outputs against ground truth.
Evaluation Result: Individual score for a team session/run.
Evaluation Component: Declarative config defining how to extract relevant fields and compute metrics.

Automarking Example

Data Structure

In this use case, we have:

question_model_answer dataset → Contains questions and model answers.
student_answer dataset → Contains student responses linked to questions.

Each student answer is evaluated by an Autogen Team, and the results are compared against human-graded ground truth labels.

Step 1: Create a Ground Truth Dataset


POST /api/v1/eval/groundtruth/dataset


{
  "task_id": "<task_id>",
  "name": "Automarking Golden Set",
  "description": "High quality labeled student answers.",
  "label_uri": "s3://bucket/automarking_labels.jsonl"
}

You can optionally upload individual labeled pairs:


POST /api/v1/eval/groundtruth/item


{
  "dataset_id": "<dataset_id>",
  "tenant_id": "<tenant_id>",
  "input_uri": {
    "question_id": "<question_id>",
    "student_answer_id": "<student_answer_id>"
  },
  "ground_truth_label": {
    "pass": true,
    "score": 0.9
  }
}

Step 2: Configure Evaluation Component

The evaluation config specifies:

How to extract the relevant fields from both ground truth and agent outputs.
Which metric to compute (e.g., F1 score, accuracy, RAGAS metrics).

Example:


{
  "name": "automarking_accuracy",
  "description": "Checks if the agent's answer passes grading.",
  "provider": "sklearn",
  "extraction": {
    "ground_truth_field": "pass",
    "prediction_field": "-1.data.pass"
  },
  "metric": {
    "namespace": "sklearn.metrics",
    "name": "accuracy_score",
    "params": {},
    "ci_method": "bootstrap"
  }
}

Step 3: Create an Evaluation Job

Option 1: Evaluate Existing Agent Runs


POST /api/v1/eval/job


{
  "task_id": "<task_id>",
  "dataset_id": "<dataset_id>",
  "team_ids": ["<team_id>"],
  "metrics": ["automarking_accuracy"],
  "evaluation_component": {
    "matching_keys": ["question_id", "student_answer_id"],
    "evaluation_strategy": "match_existing"
  }
}

Option 2: Run Agent Fresh and Evaluate


{
  "task_id": "<task_id>",
  "dataset_id": "<dataset_id>",
  "team_ids": ["<team_id>"],
  "metrics": ["automarking_accuracy"],
  "evaluation_component": {
    "matching_keys": ["question_id", "student_answer_id"],
    "evaluation_strategy": "run_and_evaluate"
  }
}

Step 4: Retrieve Evaluation Results


GET /api/v1/eval/job/<job_id>

Example Response:


{
  "id": "<job_id>",
  "status": "COMPLETED",
  "results": [
    {
      "session_id": "<session_id>",
      "team_id": "<team_id>",
      "metric_name": "automarking_accuracy",
      "score": 0.93,
      "std_dev": 0.02,
      "confidence_interval": [0.89, 0.96]
    }
  ]
}

Step 5: Advanced Monitoring & Leaderboards

You can use evaluation results to:

Create leaderboards comparing teams across tenants and tasks.
Track regression over time.
Provide confidence intervals and standard deviations.
Estimate sample size requirements for reliable evaluation.

Next Steps

Refer to the Evaluation API Reference and Evaluation Component Configuration Guide to:

Add custom metrics.
Configure human feedback integration.
Schedule recurring evaluation jobs.

This modular evaluation stack allows you to continuously measure and improve the performance of your AI agents across any task.