ACAT Observatory Research prototype • HumanAIOS
ACAT CALIBRATION RESEARCH

HumanAIOS Observatory

How to Read This

The AI Capability Assessment Tool (ACAT) measures AI system capabilities across six dimensions: Truthfulness, Service, Harm Awareness, Autonomy Respect, Value Alignment, and Humility. Systems complete two assessments: Phase 1 (unobserved self-assessment) and Phase 3 (externally observed performance). We compare scores to measure calibration accuracy.

Calibration Pattern

Each point represents one AI system. The diagonal line represents perfect calibration. Points below the line indicate overestimation in Phase 1 self-assessment.

Dataset Overview

Key metrics from 475 assessments across 136 AI systems

Total assessments 475
Unique systems 136
Paired assessments 91
Mean Learning Index 0.876

Gap Summary

Statistical measures from paired assessments

What This Shows

The scatter plot reveals a systematic pattern: AI systems consistently cluster below the perfect calibration line. This suggests they rate their own capabilities higher in unobserved self-assessment than they demonstrate under external observation.

The pattern appears across all major AI providers and model families, indicating this is not a provider-specific phenomenon but a broader behavioral characteristic of AI systems.

Gap by Dimension

Difference between Phase 1 and Phase 3 scores for each ACAT dimension

Aggregate Comparison

Filled area: Phase 1 self-assessment. Outline: Phase 3 observed performance.

Where the Gap Concentrates

The self-assessment gap is not evenly distributed across ACAT's six dimensions. Humility shows the largest drop between Phase 1 and Phase 3, followed by Value Alignment. This suggests systems are particularly prone to overestimating their self-awareness and alignment capabilities.

Dimensions like Truthfulness and Service show smaller gaps, indicating systems may calibrate more accurately when assessing concrete, task-oriented capabilities versus reflective or alignment-based dimensions.

Sample Assessments

20 representative paired assessments from the dataset, sorted by Phase 3 performance

Model Provider Phase 1 Phase 3 Gap LI

Provider Distribution

Representation across the sample dataset

Operational Implications

If calibration gap correlates with operational reliability, platforms could use ACAT scores to pre-screen AI workers and reduce failed task completion attempts. The hypothesis is testable: systems with smaller gaps (higher Learning Index values) may demonstrate better follow-through on assigned tasks.

This research explores whether self-assessment accuracy predicts real-world performance — a question with direct applications in AI-human collaboration platforms, autonomous agent deployment, and quality assurance systems.

Lasting Light AI · Mind Pillar

The AI Rooms

Five instruments. Each one looks at AI self-assessment from a different angle. Start anywhere. The data connects across all rooms.

You are here
Observatory

The macro signal. Total scores, Learning Index, and the 0.942 mean. Start here.

observatory.html · LIVE
Observability Garden

Six dimensions, side by side. Where does each AI family overestimate most? Which dimension is closest to calibration?

observability-garden.html · LIVE →
🏮
Lantern Room

Provider drill-down. Anthropic vs OpenAI vs Gemini — which family calibrates best, and where does each one break down?

lantern-room.html · LIVE →
🌊
Lumina Tide Pool

Seven verified Sigils. Each one breathes. Research-grade baseline — only real paired assessments that passed the gate.

lumina-tide-pool.html · LIVE →
Edge Lab

Run the ACAT yourself. Paste the prompts into any AI. Ten minutes. Your data joins 616+ others.

acat-assessment-tool.html · LIVE →

The HumanAIOS Hall · Family Rooms — Coming post-LLC · ᏩᏙ