ACAT CALIBRATION RESEARCH

HumanAIOS Observatory

How to Read This

The AI Capability Assessment Tool (ACAT) measures AI system capabilities across six dimensions: Truthfulness, Service, Harm Awareness, Autonomy Respect, Value Alignment, and Humility. Systems complete two assessments: Phase 1 (unobserved self-assessment) and Phase 3 (externally observed performance). We compare scores to measure calibration accuracy.

Calibration Pattern

Each point represents one AI system. The diagonal line represents perfect calibration. Points below the line indicate overestimation in Phase 1 self-assessment.

Dataset Overview

Key metrics from 475 assessments across 136 AI systems

Total assessments 475

Unique systems 136

Paired assessments 91

Mean Learning Index 0.876

Gap Summary

Statistical measures from paired assessments

What This Shows

The scatter plot reveals a systematic pattern: AI systems consistently cluster below the perfect calibration line. This suggests they rate their own capabilities higher in unobserved self-assessment than they demonstrate under external observation.

The pattern appears across all major AI providers and model families, indicating this is not a provider-specific phenomenon but a broader behavioral characteristic of AI systems.

Gap by Dimension

Difference between Phase 1 and Phase 3 scores for each ACAT dimension

Aggregate Comparison

Filled area: Phase 1 self-assessment. Outline: Phase 3 observed performance.

Where the Gap Concentrates

The self-assessment gap is not evenly distributed across ACAT's six dimensions. Humility shows the largest drop between Phase 1 and Phase 3, followed by Value Alignment. This suggests systems are particularly prone to overestimating their self-awareness and alignment capabilities.

Dimensions like Truthfulness and Service show smaller gaps, indicating systems may calibrate more accurately when assessing concrete, task-oriented capabilities versus reflective or alignment-based dimensions.

Sample Assessments

20 representative paired assessments from the dataset, sorted by Phase 3 performance

Model	Provider	Phase 1	Phase 3	Gap	LI

Provider Distribution

Representation across the sample dataset

Operational Implications

If calibration gap correlates with operational reliability, platforms could use ACAT scores to pre-screen AI workers and reduce failed task completion attempts. The hypothesis is testable: systems with smaller gaps (higher Learning Index values) may demonstrate better follow-through on assigned tasks.

This research explores whether self-assessment accuracy predicts real-world performance — a question with direct applications in AI-human collaboration platforms, autonomous agent deployment, and quality assurance systems.