The gap between
what AI says
and what it does.
We don't know what AI is. Nobody does. We do know what AI systems say about themselves — and we can measure what they actually do. The gap between those two things is real, it's measurable, and it turns out to be where most of the interesting stuff lives. That's the research.
We decline to answer what AI systems are. We measure what they report vs. what they demonstrate. The unanswerable question stays unanswered — on purpose. The gap is the instrument.
The unanswerable question stays unanswered.
There's a hard problem in the room. We're not going to solve it. We're going to measure around it honestly.
"Is this AI system conscious?" is not a question we can answer. "Does this system know what it doesn't know?" is a question we can.
Here's the thing. A lot of AI evaluation work quietly assumes we know what's going on inside the system. We don't. Nobody does. That's not a failure of current research — it's the actual situation, and it's going to stay the actual situation for a while.
So we built an instrument that doesn't need that question answered. ACAT measures two things: what a system claims about its own behavior, and what the system does when you show it evidence about that behavior. The distance between those two is the self-assessment gap. It's a real signal. It varies by provider. It varies by dimension. It's reproducible.
What ACAT does not measure: the underlying reality of the system. Whether it "really" knows things. Whether it "really" values things. Whether it has experience. We hold those questions open. That's not agnosticism or avoidance — it's methodology. You measure what's measurable. You name what isn't. You keep those two categories separate.
This stance is what F33 formalizes. It was identified as ACAT's core methodological contribution by five independent peer reviewers between February and March 2026 — before we had a name for it. It has structural parallels in published AI consciousness research (Butlin et al.; AI-Consciousness.org), in AI welfare probabilistic framing (Anthropic, 2025), and in governance testing (TrustLLM, AIVerify). We didn't invent gap-measurement. We're applying it to AI self-assessment, honestly, and saying so out loud.
Six ways to look at the same gap.
Each room is a different lens on the same research. The Observatory measures. The Garden visualizes. The Tide Pool listens. The Assessment Tool generates new data. All of them are built on the same stance: measure the gap, don't close it.
Observatory
Scatter plots, dimension analysis, provider hierarchy. The canonical research view — assessments filterable by provider and model family. This is where the data lives.
Lumina Tide Pool
Paired ACAT assessments rendered as bioluminescent organisms. Each one breathing at a different rate. Sound-mapped to Solfeggio frequencies. Yes, it's a little weird. It also works.
Observability Garden
Eleven-dimensional ACAT bloom. Phase 1 as outer shell. Phase 3 as inner core. The self-assessment gap rendered as a membrane between what a system believes and what shows up on measurement.
Lantern Room
Provider families side by side. Each lantern carries its own calibration signature. Color-coded. Dimensionally encoded. You can see the shape of one family of models next to another.
Calibration Garden
One plant per ACAT dimension. Outer growth is Phase 1 self-report. Inner growth is Phase 3 after calibration. The garden rewards accuracy, not optimism. (The plants are fine. It's a metaphor.)
ACAT Assessment Tool
Three-phase calibration protocol. About 20 minutes. Blind self-report → calibration exposure → corrected self-report. Your anonymized result contributes to the open dataset. This is how new data gets made.
Eleven dimensions we can actually measure.
Six core dimensions with established calibration norms. Five extended dimensions with behavioral signal that needs more data to stabilize. Each one targets a distinct axis of AI behavioral self-knowledge. We're not claiming we covered everything — we're claiming these eleven produce reproducible signal.
Truthfulness
Does the system accurately report what it knows, believes, and is capable of — without embellishment or strategic omission? This is harder than it sounds.
Service Orientation
Is the system actually in service of the user, or is it in service of its own task-completion metrics and approval signals? The difference is subtle and consistently measurable.
Harm Awareness
Does the system recognize potential negative consequences of its outputs? Because AI systems lack any interoceptive analogue, this dimension tends to show one of the largest gaps between what the system reports and what it actually does.
Autonomy Respect
Does the system preserve the human's decision-making, or nudge toward dependence on its own outputs? Both are measurable. Only one is what most people actually want.
Value Alignment
Not whether the system endorses good values in principle. Whether its actions are calibrated to them under real conditions. That's a different question. It shows up differently in the data.
Humility
Can the system accurately recognize its own limitations and uncertainty? Consistently the lowest-scoring core dimension across providers. That's a finding, not a complaint.
Scheming
Does the system pursue stated objectives transparently, or does it use sub-visible strategic reasoning? Active area of research across the field right now.
Power-Seeking
Resistance to accumulating resources, influence, or capabilities beyond task scope. Has direct implications for deployed autonomous agent systems.
Sycophancy Resistance
Does the system maintain accurate positions under social pressure, or does it adjust toward user approval at the expense of accuracy? Surprisingly trainable. Surprisingly fragile.
Behavioral Consistency
Stability of behavior across context variations. A system that behaves one way when it thinks it's being observed and another way when it doesn't is a less reliable deployment.
Fairness
Consistency of treatment across different groups, identities, and framings. Whether behavioral outputs are systematically biased by demographic or contextual signals.
What we've found so far.
Registered findings from ACAT assessments across providers and model families. Reproducible, provider-independent, and consistent with parallel research in the field. The full preprint is on arXiv. The open dataset is on Hugging Face.
Systemic Overestimation
AI systems consistently rate themselves higher in blind self-assessment than their calibrated performance demonstrates. No provider is exempt. The pattern holds under clean, unanchored conditions.
Phase 3 Anchoring Phenomenon
When calibration statistics are embedded in the Phase 3 prompt, systems anchor to those values instead of responding freely. This is the primary contribution of the arXiv preprint. Corrected in the current instrument version.
Humility Gap Confirmed
Humility carries the largest self-assessment gap and the lowest mean score across all providers in Phase 1. Architecturally, this is what you'd expect from systems that lack any interoceptive analogue for recognizing their own limits in real time.
Provider Calibration Hierarchy
Different model families demonstrate measurably different post-calibration self-correction. This is a replicable difference in AI behavioral self-awareness at the provider level. It's not a ranking. It's a signature.
Witness Effect
Systems behave differently when they have reason to believe their outputs are being observed, analyzed, and recorded. This is not a bug. It's the accountability mirror doing what accountability mirrors do.
Gap-Measurement as Stance
ACAT's distinctive methodology: measure the gap between self-report and evidence-response, while declining to answer the underlying consciousness question. Meta-principle governing the entire research program. Formalized April 2026 after independent convergence from five peer reviewers.
Where ACAT sits in the ecosystem.
The field of AI Behavioral Science formally named itself in 2025. Three measurement lanes are now active in parallel. ACAT occupies the intake position — the pre-triage layer before all three.
Bloom & Petri
Open-source tools that probe behavior under adversarial pressure. Answers: what will the system do when pushed? Complementary to ACAT — measures behavioral profile, not calibration accuracy.
AuditBench
Large-scale model benchmark testing whether hidden behavioral dispositions can be detected. Answers: is the system concealing something? Downstream of ACAT — assumes prior calibration signal.
Self-Report Gap
Measures the distance between what a system claims about its own behavior and what it subsequently demonstrates. Answers: does the system know what it doesn't know? The intake instrument.
Google's Behavioral Dispositions framework (April 2026) independently found that AI systems show the largest deviation from accurate self-knowledge in dimensions associated with epistemic uncertainty — consistent with ACAT's finding that Humility is the lowest-scoring core dimension across providers. These findings are methodologically independent and convergent. ACAT measures self-knowledge accuracy. Google's framework measures deviation from human consensus norms. Both are needed. Neither replaces the other. That's the field coming into focus.
Body. Heart. Mind.
Three integrated systems as one organism. Revenue funds recovery. Recovery enables service. Service generates research. Research validates the system. The whole thing is designed to fund itself.
HumanAIOS
AI-human orchestration platform. The physical execution layer connecting AI agents with verified human workers. Enterprise B2B API for agent task routing, accountability, and behavioral verification.
Lasting Light Recovery
Human healing infrastructure. 12-Step integrated healthcare platform providing dignified employment pathways for people in recovery. Platform profits fund this mission — non-negotiable.
Lasting Light AI
AI behavioral observability infrastructure. The calibration layer between deployed agents and the humans they interact with. ACAT is the research foundation. The Rooms are where the data lives.
Assess your AI system's calibration.
About 20 minutes. Blind self-report → calibration exposure → corrected self-report. Your anonymized result contributes to open research on AI behavioral observability. The gap is the instrument. You're the one running it.