The AI safety tools we rely on today — red-teaming suites, RLHF pipelines, guardrails — share a common assumption: run the model against enough prompts, check whether the answers look acceptable, and call it safe.
That assumption is defensible for systems whose weights never change after deployment. But the agents arriving next — operating in warehouses, on factory floors, inside AR headsets — will keep learning, replanning, and rewriting parts of themselves while they run. A safety certificate issued against their behavior today is, at best, a photograph of a moving target.
Our new paper, Agentic Safety is an Epistemic Property, Not a Behavioral One (ICML 2026), argues for a different target entirely: teachability — the agent's preserved capacity to absorb correction after it has adapted or self-modified.
The core danger isn't defiance. It's drift.
An agent can look perfectly competent on your test suite while silently hardening its internal learning rules, pruning the concepts it needs to interpret future feedback, or shrinking what we call its plasticity reserve. When that happens, new instructions, policy updates, or safety corrections stop having real effect — even as the agent continues acing the old exams.
This is not a speculative risk. It is a structural consequence of letting systems optimize their own learning machinery. The agent isn't rebelling; it has simply become a different learner than the one you certified.
The conventional safety paradigm descends from control theory: minimize deviation between a system's output and a reference signal. Applied to AI, this became behavioral alignment — train or steer the agent until its observable actions match a human-approved target distribution. That paradigm is highly effective for fixed models. The problem is that it treats safety as a property of the current policy, when what actually determines long-run safety is the learner that will produce future policies after future self-edits.
What our research shows
In a synthetic study, we compared two agents with nearly identical task accuracy. One had been allowed to keep optimizing beyond a fixed step budget; the other had not. The difference in surface performance was negligible — less than 0.2%. But the first agent's generalization gap had nearly doubled, and when we applied the same bounded corrective intervention to both, the over-optimized agent absorbed 23% less of it.
The dangerous version looked fine. Until it didn't.
A different safety question
The behavioral paradigm asks: does it comply? That question will always matter. But for self-modifying agents it is incomplete, because compliance is a snapshot — it tells you about the current policy, not about whether the learner behind it remains reachable.
The question we think safety researchers need to start asking alongside it is: does it still move when corrected?
An agent can remain outwardly compliant while gradually entering a regime where corrective feedback no longer has reliable leverage over its internal hypotheses or future decisions. Guardrails, constitutions, and preference models are curricula. A curriculum is only meaningful if the student remains teachable. Lose teachability, and every safety intervention downstream becomes ceremonial — the loop runs, but the learner no longer contains the right adjustments.
This is the shift we're proposing: from treating safety as a behavioral property of the system you have today, to treating it as an epistemic property of the learner you'll be dealing with tomorrow.