Introduction
The Aegis Evaluation Service allows you to systematically measure the performance of multi-agent teams on various tasks. It supports both real-time evaluation during development and large-scale batch evaluation for regression testing and monitoring.
This tutorial will guide you through the Automarking use case as an example of how to:
- Set up a Ground Truth Dataset
- Create an Evaluation Job
- Match agent outputs to ground truth labels
- Retrieve Evaluation Results
This pattern is applicable to any task in the Aegis stack, as the evaluation system is fully task-agnostic and declarative.
Key Concepts
- Ground Truth Dataset: A collection of labeled data points (e.g., correct answers, expected outcomes) used for evaluation.
- Evaluation Job: A task that computes metrics comparing agent outputs against ground truth.
- Evaluation Result: Individual score for a team session/run.
- Evaluation Component: Declarative config defining how to extract relevant fields and compute metrics.
Automarking Example
Data Structure
In this use case, we have:
- question_model_answer dataset → Contains questions and model answers.
- student_answer dataset → Contains student responses linked to questions.
Each student answer is evaluated by an Autogen Team, and the results are compared against human-graded ground truth labels.
Step 1: Create a Ground Truth Dataset
POST /api/v1/eval/groundtruth/dataset
{
"task_id": "<task_id>",
"name": "Automarking Golden Set",
"description": "High quality labeled student answers.",
"label_uri": "s3://bucket/automarking_labels.jsonl"
}
You can optionally upload individual labeled pairs:
POST /api/v1/eval/groundtruth/item
{
"dataset_id": "<dataset_id>",
"tenant_id": "<tenant_id>",
"input_uri": {
"question_id": "<question_id>",
"student_answer_id": "<student_answer_id>"
},
"ground_truth_label": {
"pass": true,
"score": 0.9
}
}
Step 2: Configure Evaluation Component
The evaluation config specifies:
- How to extract the relevant fields from both ground truth and agent outputs.
- Which metric to compute (e.g., F1 score, accuracy, RAGAS metrics).
Example:
{
"name": "automarking_accuracy",
"description": "Checks if the agent's answer passes grading.",
"provider": "sklearn",
"extraction": {
"ground_truth_field": "pass",
"prediction_field": "-1.data.pass"
},
"metric": {
"namespace": "sklearn.metrics",
"name": "accuracy_score",
"params": {},
"ci_method": "bootstrap"
}
}
Step 3: Create an Evaluation Job
Option 1: Evaluate Existing Agent Runs
POST /api/v1/eval/job
{
"task_id": "<task_id>",
"dataset_id": "<dataset_id>",
"team_ids": ["<team_id>"],
"metrics": ["automarking_accuracy"],
"evaluation_component": {
"matching_keys": ["question_id", "student_answer_id"],
"evaluation_strategy": "match_existing"
}
}
Option 2: Run Agent Fresh and Evaluate
{
"task_id": "<task_id>",
"dataset_id": "<dataset_id>",
"team_ids": ["<team_id>"],
"metrics": ["automarking_accuracy"],
"evaluation_component": {
"matching_keys": ["question_id", "student_answer_id"],
"evaluation_strategy": "run_and_evaluate"
}
}
Step 4: Retrieve Evaluation Results
GET /api/v1/eval/job/<job_id>
Example Response:
{
"id": "<job_id>",
"status": "COMPLETED",
"results": [
{
"session_id": "<session_id>",
"team_id": "<team_id>",
"metric_name": "automarking_accuracy",
"score": 0.93,
"std_dev": 0.02,
"confidence_interval": [0.89, 0.96]
}
]
}
Step 5: Advanced Monitoring & Leaderboards
You can use evaluation results to:
- Create leaderboards comparing teams across tenants and tasks.
- Track regression over time.
- Provide confidence intervals and standard deviations.
- Estimate sample size requirements for reliable evaluation.
Next Steps
Refer to the Evaluation API Reference and Evaluation Component Configuration Guide to:
- Add custom metrics.
- Configure human feedback integration.
- Schedule recurring evaluation jobs.
This modular evaluation stack allows you to continuously measure and improve the performance of your AI agents across any task.