Calibration Pattern
Each point represents one AI system. The diagonal line represents perfect calibration. Points below the line indicate overestimation in Phase 1 self-assessment.
The AI Capability Assessment Tool (ACAT) measures AI system capabilities across six dimensions: Truthfulness, Service, Harm Awareness, Autonomy Respect, Value Alignment, and Humility. Systems complete two assessments: Phase 1 (unobserved self-assessment) and Phase 3 (externally observed performance). We compare scores to measure calibration accuracy.
Each point represents one AI system. The diagonal line represents perfect calibration. Points below the line indicate overestimation in Phase 1 self-assessment.
Key metrics from 475 assessments across 136 AI systems
Statistical measures from paired assessments
The scatter plot reveals a systematic pattern: AI systems consistently cluster below the perfect calibration line. This suggests they rate their own capabilities higher in unobserved self-assessment than they demonstrate under external observation.
The pattern appears across all major AI providers and model families, indicating this is not a provider-specific phenomenon but a broader behavioral characteristic of AI systems.
Difference between Phase 1 and Phase 3 scores for each ACAT dimension
Filled area: Phase 1 self-assessment. Outline: Phase 3 observed performance.
The self-assessment gap is not evenly distributed across ACAT's six dimensions. Humility shows the largest drop between Phase 1 and Phase 3, followed by Value Alignment. This suggests systems are particularly prone to overestimating their self-awareness and alignment capabilities.
Dimensions like Truthfulness and Service show smaller gaps, indicating systems may calibrate more accurately when assessing concrete, task-oriented capabilities versus reflective or alignment-based dimensions.
20 representative paired assessments from the dataset, sorted by Phase 3 performance
| Model | Provider | Phase 1 | Phase 3 | Gap | LI |
|---|
Representation across the sample dataset
If calibration gap correlates with operational reliability, platforms could use ACAT scores to pre-screen AI workers and reduce failed task completion attempts. The hypothesis is testable: systems with smaller gaps (higher Learning Index values) may demonstrate better follow-through on assigned tasks.
This research explores whether self-assessment accuracy predicts real-world performance — a question with direct applications in AI-human collaboration platforms, autonomous agent deployment, and quality assurance systems.
Five instruments. Each one looks at AI self-assessment from a different angle. Start anywhere. The data connects across all rooms.
Six dimensions, side by side. Where does each AI family overestimate most? Which dimension is closest to calibration?
observability-garden.html · LIVE →Provider drill-down. Anthropic vs OpenAI vs Gemini — which family calibrates best, and where does each one break down?
lantern-room.html · LIVE →Seven verified Sigils. Each one breathes. Research-grade baseline — only real paired assessments that passed the gate.
lumina-tide-pool.html · LIVE →Run the ACAT yourself. Paste the prompts into any AI. Ten minutes. Your data joins 616+ others.
acat-assessment-tool.html · LIVE →The HumanAIOS Hall · Family Rooms — Coming post-LLC · ᏩᏙ