Skip to Content
Aegis Enterprise

Introduction

The Aegis Evaluation Service allows you to systematically measure the performance of multi-agent teams on various tasks. It supports both real-time evaluation during development and large-scale batch evaluation for regression testing and monitoring.

This tutorial will guide you through the Automarking use case as an example of how to:

  • Set up a Ground Truth Dataset
  • Create an Evaluation Job
  • Match agent outputs to ground truth labels
  • Retrieve Evaluation Results

This pattern is applicable to any task in the Aegis stack, as the evaluation system is fully task-agnostic and declarative.


Key Concepts

  • Ground Truth Dataset: A collection of labeled data points (e.g., correct answers, expected outcomes) used for evaluation.
  • Evaluation Job: A task that computes metrics comparing agent outputs against ground truth.
  • Evaluation Result: Individual score for a team session/run.
  • Evaluation Component: Declarative config defining how to extract relevant fields and compute metrics.

Automarking Example

Data Structure

In this use case, we have:

  • question_model_answer dataset → Contains questions and model answers.
  • student_answer dataset → Contains student responses linked to questions.

Each student answer is evaluated by an Autogen Team, and the results are compared against human-graded ground truth labels.


Step 1: Create a Ground Truth Dataset

POST /api/v1/eval/groundtruth/dataset
{ "task_id": "<task_id>", "name": "Automarking Golden Set", "description": "High quality labeled student answers.", "label_uri": "s3://bucket/automarking_labels.jsonl" }

You can optionally upload individual labeled pairs:

POST /api/v1/eval/groundtruth/item
{ "dataset_id": "<dataset_id>", "tenant_id": "<tenant_id>", "input_uri": { "question_id": "<question_id>", "student_answer_id": "<student_answer_id>" }, "ground_truth_label": { "pass": true, "score": 0.9 } }

Step 2: Configure Evaluation Component

The evaluation config specifies:

  • How to extract the relevant fields from both ground truth and agent outputs.
  • Which metric to compute (e.g., F1 score, accuracy, RAGAS metrics).

Example:

{ "name": "automarking_accuracy", "description": "Checks if the agent's answer passes grading.", "provider": "sklearn", "extraction": { "ground_truth_field": "pass", "prediction_field": "-1.data.pass" }, "metric": { "namespace": "sklearn.metrics", "name": "accuracy_score", "params": {}, "ci_method": "bootstrap" } }

Step 3: Create an Evaluation Job

Option 1: Evaluate Existing Agent Runs

POST /api/v1/eval/job
{ "task_id": "<task_id>", "dataset_id": "<dataset_id>", "team_ids": ["<team_id>"], "metrics": ["automarking_accuracy"], "evaluation_component": { "matching_keys": ["question_id", "student_answer_id"], "evaluation_strategy": "match_existing" } }

Option 2: Run Agent Fresh and Evaluate

{ "task_id": "<task_id>", "dataset_id": "<dataset_id>", "team_ids": ["<team_id>"], "metrics": ["automarking_accuracy"], "evaluation_component": { "matching_keys": ["question_id", "student_answer_id"], "evaluation_strategy": "run_and_evaluate" } }

Step 4: Retrieve Evaluation Results

GET /api/v1/eval/job/<job_id>

Example Response:

{ "id": "<job_id>", "status": "COMPLETED", "results": [ { "session_id": "<session_id>", "team_id": "<team_id>", "metric_name": "automarking_accuracy", "score": 0.93, "std_dev": 0.02, "confidence_interval": [0.89, 0.96] } ] }

Step 5: Advanced Monitoring & Leaderboards

You can use evaluation results to:

  • Create leaderboards comparing teams across tenants and tasks.
  • Track regression over time.
  • Provide confidence intervals and standard deviations.
  • Estimate sample size requirements for reliable evaluation.

Next Steps

Refer to the Evaluation API Reference and Evaluation Component Configuration Guide to:

  • Add custom metrics.
  • Configure human feedback integration.
  • Schedule recurring evaluation jobs.

This modular evaluation stack allows you to continuously measure and improve the performance of your AI agents across any task.

Last updated on