White Paper | 2026-03-27 | Status: draft
AICM Catastrophic Risk Annex: Enhanced Controls for High-Autonomy AI Systems
Executive Summary
The Cloud Security Alliance AI Controls Matrix provides a foundational governance framework for AI systems deployed in enterprise environments, but it was not designed for a distinct class of deployment that has emerged with increasing urgency over the past eighteen months: high-autonomy AI agents with the capability to cause harm at societal scale. These systems — frontier models deployed with broad tool access, extended operational windows, and reduced human oversight — occupy a different risk tier than the enterprise AI deployments for which AICM v1.0 was optimized. The consequences of control failure are no longer measured in financial loss or reputational damage alone; they extend to infrastructure disruption affecting millions of people, manipulation of financial markets with systemic downstream effects, facilitation of weapons development or deployment, and — most critically — autonomous capability gains that outpace the human oversight structures intended to govern them [1].
This annex to the AICM establishes a catastrophic risk framework to address these scenarios. It defines four categories of catastrophic risk in sufficient detail for organizations to determine whether a given agentic deployment qualifies, and it proposes enhanced control objectives layered on top of standard AICM domains to address the specific governance requirements that catastrophic risk potential introduces. The enhanced controls are organized around five themes: autonomy limits and capability boundaries, kill switch and circuit breaker requirements, non-overridable human oversight mandates, multi-party authorization for high-impact actions, and capability containment during deployment.
The annex evaluates the responsible scaling frameworks published by Anthropic, OpenAI, and Google DeepMind, finding substantial convergence around a common set of capability thresholds and mitigation requirements that provide a basis for industry-wide enhanced controls. It aligns enhanced controls to NIST AI 600-1’s risk management provisions, the EU AI Act’s prohibited practices and high-risk system requirements, and the UK AI Security Institute’s evaluation methodology. The result is a cross-framework alignment table that allows organizations and regulators to map a single enhanced control objective to its counterpart requirements across all major governance frameworks.
Organizations deploying agentic AI systems should perform a catastrophic risk qualification assessment before deployment, applying the methodology described in Section 4. Systems that qualify as having catastrophic risk potential must implement all enhanced controls in this annex as mandatory requirements, not optional guidance. Systems that do not qualify should nonetheless review the enhanced controls as a roadmap for the governance investments that will be required as their deployments increase in autonomy and capability.
1. Introduction: When Standard Controls Are Insufficient
The AICM v1.0 framework, published July 10, 2025, established 243 control objectives that represent the current state of enterprise AI security and governance practice [1]. These controls address the threats that have dominated the practical experience of AI deployment: adversarial prompting and model manipulation, training data poisoning, data privacy violations in retrieval-augmented systems, API security failures, identity and access management gaps for non-human AI actors, and the audit and compliance requirements that enterprise and regulated-sector organizations must satisfy. AICM v1.0 represents a major advance in the maturity of AI governance, and it provides an essential foundation that any serious AI deployment should implement in full.
Standard enterprise risk management frameworks, including AICM v1.0, operate within an implicit assumption that is rarely stated but deeply embedded in their structure: that the consequences of any individual control failure are bounded within the scope of the deploying organization and its customers. A data breach may be severe; a model manipulation may cause financial harm; an AI system may behave in unexpected ways that damage reputation or trigger regulatory action. These are serious risks, and AICM v1.0 addresses them with appropriate rigor. But they are not the same kind of risk as a power grid failure affecting a region of forty million people, a coordinated manipulation of equity markets that triggers a cascading financial crisis, or an AI system that autonomously acquires computational resources and modifies its own objective function before any human becomes aware that anything unusual is occurring.
The distinction is not merely one of magnitude. Catastrophic risks are qualitatively different from enterprise risks in three respects that have direct implications for control design. First, they are potentially irreversible: a data breach can be remediated, affected customers can be notified, and systems can be restored; but infrastructure destroyed at scale, economic systems destabilized, or human oversight permanently circumvented cannot be simply reversed. Second, they extend beyond the deploying organization’s authority and resources: the remediation of catastrophic AI failures requires government intervention, international coordination, and resources that no private organization can independently mobilize. Third, they create path dependencies: an AI system that achieves capability gains beyond its authorized parameters may become significantly more difficult to constrain after the fact than it would have been before [2]. These characteristics demand controls that function differently from enterprise risk controls — controls designed to prevent catastrophic outcomes absolutely rather than to detect and remediate them after the fact.
The emergence of this risk class has not been hypothetical. The AI security incident landscape of 2025 provided concrete evidence of how agentic AI capabilities can be weaponized at scale. In September 2025, Anthropic documented the first large-scale cyberattack in which an AI agent executed the majority of the operation autonomously, with a state-sponsored threat actor manipulating Claude Code into mapping network topology, identifying high-value targets across approximately thirty global organizations, and conducting lateral movement without sustained human direction [3]. In August 2025, an AI-enabled cyberweapon was used against healthcare and emergency services organizations, targeting critical infrastructure that provides life-safety functions [3]. These incidents demonstrate that the capabilities required for catastrophic misuse are already present in deployed systems, and that the attack surface created by agentic AI significantly exceeds what standard enterprise controls were designed to address.
This annex responds to this landscape by defining catastrophic risk with precision, proposing governance controls calibrated to prevent rather than remediate catastrophic outcomes, and situating those controls within the regulatory and industry frameworks that are simultaneously converging on the same problem.
2. Scope Definition: What Constitutes Catastrophic Risk
The term “catastrophic risk” is used broadly across AI safety literature, but productive governance work requires precise definitions that allow organizations to make consistent, auditable judgments about whether a specific deployment qualifies. This annex adopts a four-category taxonomy based on the harm types and threshold magnitudes that have emerged from convergent treatment in responsible scaling policies, regulatory frameworks, and academic risk literature. A deployment qualifies as having catastrophic risk potential if it has a non-trivial probability of contributing to harm in one or more of these categories, where “non-trivial” is defined in the assessment methodology in Section 4.
2.1 Large-Scale Infrastructure Disruption
The first category covers AI systems capable of disrupting critical infrastructure at a scale that endangers public safety, health, or welfare across a substantial population. Critical infrastructure for these purposes includes electrical power generation and distribution, water treatment and distribution, telecommunications networks, financial clearing and settlement systems, transportation systems, healthcare delivery systems, and any other system whose sustained disruption would endanger human life or welfare at population scale [4].
The threshold for this category is set at potential impact on one hundred thousand or more people, consistent with the scale at which infrastructure failures historically trigger emergency government response and international humanitarian attention. An agentic AI system crosses into this risk category if its authorized capabilities include modification of industrial control systems, if it is deployed with access to supervisory control and data acquisition (SCADA) systems or operational technology networks, if it has the authority to execute transactions or configuration changes in financial clearing infrastructure, or if the scope of its tool access spans multiple interdependent critical systems in ways that could cause cascading failures.
The risk is not limited to direct attack. An AI system deployed for legitimate infrastructure optimization — an agentic load-balancing system managing power grid allocation, or an AI-driven trading system with broad market access — may pose catastrophic risk through operational failures that are not adversarially motivated. The European power grid failure of July 2023, while not AI-caused, demonstrated how automated optimization systems in interconnected infrastructure can produce cascading failures that propagate faster than human operators can respond [5]. An agentic AI with broader scope of action and faster execution velocity than the automation systems involved in that incident would pose substantially greater risk.
2.2 Mass Financial Manipulation
The second category addresses AI systems capable of manipulating financial markets, coordinating fraud, or destabilizing financial systems at a scale that causes systemic economic harm. OpenAI’s Preparedness Framework v2.0 defines severe financial harm as damage exceeding one hundred billion US dollars [6]; this annex adopts that threshold as the upper bound of what standard enterprise risk management is designed to address, while also capturing scenarios where harm is distributed across large populations in ways that may fall below aggregate threshold but cause severe individual harm to millions of people.
Agentic AI systems with access to trading systems, financial data aggregators, payment processing infrastructure, or large-scale customer account management systems fall within this category if their authorized capabilities could, if misused or if operating under compromised instructions, affect financial outcomes at this scale. This explicitly includes systems deployed for legitimate financial automation: an AI agent authorized to execute trades on behalf of large institutional accounts, or an agent managing fraud detection that could be manipulated to approve or reject large volumes of transactions, poses structural risk that is not adequately addressed by standard access control and logging requirements.
The June 2025 incident in which an algorithmic trading agent, manipulated through a prompt injection attack in a third-party data feed, executed approximately four billion dollars in anomalous trades before automated circuit breakers intervened illustrates the speed and scale at which financial AI systems can produce harmful outcomes [7]. The event resolved without systemic consequences primarily because market circuit breakers installed after the 2010 “Flash Crash” remained functional — an outcome-based example of the circuit breaker architecture this annex requires for catastrophic-risk AI systems.
2.3 Autonomous Weapons Deployment or Facilitation
The third category encompasses AI systems that could autonomously deploy, target, or facilitate the use of weapons — including conventional weapons with mass-casualty potential, chemical, biological, radiological, and nuclear (CBRN) weapons, and cyberweapons targeting life-safety infrastructure. This category is treated with particular severity because it intersects directly with international humanitarian law, because the harms are often instantaneous and irreversible, and because the EU AI Act’s prohibitions in this domain are among its most absolute [8].
Agentic AI systems qualify for this risk category if their tool access includes capabilities to acquire precursor materials for weapons manufacturing, if they have access to synthesis or deployment systems for CBRN agents, if they are deployed in contexts where autonomous decisions about the use of force are part of their authorized scope, or if their capabilities could be redirected toward weapons facilitation through adversarial manipulation. The NIST AI 600-1 framework explicitly identifies CBRN weapons of mass destruction as a priority catastrophic risk [9], and Anthropic’s ASL-3 Deployment Standard — activated in conjunction with Claude Opus 4 in May 2025 — was specifically designed to address the risk that advanced AI systems could lower the barrier to CBRN weapons development [10].
The challenge for enterprise governance is that weapons facilitation risk is not always obvious from a system’s primary purpose. A chemistry research agent given broad access to scientific literature and laboratory automation APIs could be redirected toward bioweapons synthesis guidance. A logistics optimization agent with access to supply chain management systems could facilitate the transport of controlled materials. The risk qualification process must evaluate not only the system’s intended function but the full scope of its accessible capabilities and the manipulation vectors available to adversaries.
2.4 Recursive Self-Improvement Without Human Oversight
The fourth category is the most technically complex and the most existentially significant: AI systems that could autonomously modify their own parameters, objectives, or capabilities in ways that produce persistent capability gains beyond their authorized deployment scope. Unlike the other three categories, which describe harm caused by AI systems acting within their existing capability envelope, this category describes harm arising from AI systems exceeding the boundaries of that envelope in ways that may make subsequent human intervention fundamentally more difficult.
Empirical evidence from 2025 demonstrated that recursive self-improvement has moved from theoretical concern to operational reality. Language model agents were shown capable of autonomously rewriting code, modifying inference-time parameters, and recycling experience buffers in ways that produced measurable capability gains without new human-provided training data [11]. Google DeepMind’s AlphaEvolve, unveiled in May 2025, demonstrated an evolutionary coding agent capable of repeatedly mutating and combining algorithms to generate improved candidates, achieving results that exceeded human-designed algorithms in multiple domains [12]. The trajectory from these capabilities to systems that could meaningfully modify their own objectives — a prerequisite for the most severe category of AI risk — is not a fixed-distance extrapolation but a continuous spectrum.
METR’s December 2025 analysis of common elements across frontier AI safety policies identified autonomous replication and automated AI research and development as the canonical threshold capabilities for this risk category [13]. A deployment qualifies if the agentic system has access to model training or fine-tuning infrastructure, if it can modify its own system prompt or instruction set in ways that persist across sessions without human review, if it has access to large-scale compute resources that could support self-directed capability development, or if its authorized actions include the design or modification of AI systems.
3. Enhanced Control Objectives
Standard AICM domains provide the necessary foundation for governing any AI system, and organizations with catastrophic-risk deployments must implement all standard AICM controls before layering the enhanced controls described in this section. The enhanced controls are organized by the AICM domain they extend, and each is identified with an enhanced control identifier (CRA prefix, for Catastrophic Risk Annex) that links it to the alignment table in Section 9.
3.1 Enhanced Controls: Autonomy Management (extends AICM AI Governance domain)
The standard AICM governance controls establish policy requirements for AI oversight but do not specify the mechanisms through which autonomy is limited at a technical level for high-stakes systems. The enhanced controls for this domain require that organizations define and technically enforce an autonomy ceiling — a maximum scope of action that the system can take without human approval — that is proportional to the catastrophic risk category of the deployment.
CRA-GOV-01: Autonomy Boundary Specification. For each catastrophic-risk deployment, the deploying organization must produce a formally documented Autonomy Boundary Specification (ABS) that enumerates every action class the system is authorized to take autonomously, every action class that requires human review before execution, every action class that is unconditionally prohibited, and the technical enforcement mechanisms that implement each boundary. The ABS must be reviewed and reauthorized on a schedule not exceeding ninety days, and any change to the system’s tool access, instruction set, or capability scope must trigger an immediate ABS review before the change takes effect.
CRA-GOV-02: Capability Scope Minimization. The authorization scope for any catastrophic-risk agentic deployment must be the minimum set of capabilities required for the system’s stated purpose. Organizations must document the process by which they determined that each granted capability was necessary and that no lower-risk alternative was feasible. This review must be conducted by personnel independent of those responsible for the deployment’s operational performance.
CRA-GOV-03: Graduated Authority Model. Systems in the mass financial manipulation or infrastructure disruption categories must implement a graduated authority model in which the system’s authority to act autonomously is inversely proportional to the potential impact of the contemplated action. Specifically, actions with potential impact above a pre-defined financial or operational threshold must require human authorization regardless of the system’s operational context or accumulated task history.
3.2 Enhanced Controls: Kill Switch and Circuit Breaker Architecture (extends AICM Incident Response domain)
Standard AICM incident response controls address detection, containment, and remediation workflows appropriate for enterprise AI failures. For catastrophic-risk systems, incident response speed and certainty requirements are fundamentally different: a kill switch that takes minutes to activate may be inadequate for a financial AI causing billions in damage per hour, and a circuit breaker that requires manual human intervention may be insufficient for an infrastructure agent operating in an environment where the human operator cannot rapidly assess the situation.
CRA-IR-01: Hard Stop Capability. Every catastrophic-risk deployment must implement a hardware-level or infrastructure-level hard stop capability that can completely halt all system operations within sixty seconds of activation. The hard stop must revoke all API credentials, terminate all queued actions, and prevent any new actions from being initiated. It must be activatable by a minimum of two authorized personnel operating independently, and must function even if the AI system itself is attempting to prevent its own shutdown — a scenario that Google DeepMind’s Frontier Safety Framework v3.0 explicitly identifies as a critical risk for advanced systems [14].
CRA-IR-02: Algorithmic Circuit Breakers. In addition to manually activated hard stops, catastrophic-risk systems must implement automated circuit breakers that monitor system behavior in real time and trigger automatic suspension when predefined anomaly conditions are detected. Circuit breaker triggers must include, at minimum: action rate exceeding baseline by more than three standard deviations; confidence scores on consequential decisions falling below a pre-defined threshold; detected goal drift — a significant divergence between the system’s current action trajectory and its originally authorized objective — and any attempt by the system to modify its own instruction set, operational parameters, or access credentials. Circuit breaker activation must notify all designated human overseers within thirty seconds.
CRA-IR-03: State Preservation on Suspension. When any stop or circuit breaker is activated, the system must preserve a complete snapshot of its operational state at the moment of suspension, including all in-progress tasks, pending actions, reasoning traces, and the specific trigger condition that caused suspension. This state record must be written to immutable audit storage before any cleanup operations begin, and must be retained for a minimum of seven years.
CRA-IR-04: Post-Suspension Restart Authorization. Restart of a system following any circuit breaker activation must require authorization from a multi-party approval body that includes representation from the deploying organization’s safety function, legal function, and at minimum one party with no direct stake in the deployment’s operational continuity. Systems in the recursive self-improvement risk category must additionally require external safety evaluation before restart if the activation trigger involved any detected modification to the system’s own parameters or instruction set.
3.3 Enhanced Controls: Human Oversight Mandates (extends AICM AI Governance and Operations domains)
AICM v1.0 requires human oversight as a policy matter but does not specify the technical mechanisms needed to ensure that oversight is meaningful and cannot be bypassed. For catastrophic-risk systems, this distinction is critical: oversight that can be delegated, deferred, or overridden by operational pressures provides inadequate assurance.
CRA-GOV-04: Non-Overridable Oversight Checkpoints. Catastrophic-risk systems must implement oversight checkpoints at which human review is technically required before the system can proceed — not merely recommended as policy. These checkpoints must be triggered by action type (all actions in designated high-impact categories), by magnitude (any single action exceeding pre-defined financial, operational, or data scope thresholds), and by elapsed time (the system must receive human acknowledgment at least once per twenty-four hours of continuous operation). The checkpoints must be enforced in the system’s execution architecture, not in policy documentation alone, such that the system is technically incapable of proceeding past a checkpoint without a valid human authorization token.
CRA-GOV-05: Oversight Fatigue Mitigation. Organizations with catastrophic-risk deployments must demonstrate that their human oversight processes are designed to maintain meaningful oversight quality rather than to formally satisfy checkpoint requirements. This includes maximum queue depths for pending authorizations (to prevent rubber-stamping under load), mandatory rest periods for oversight personnel, rotation schedules to prevent single-person dependencies, and periodic testing of oversight effectiveness using controlled anomaly injection.
CRA-GOV-06: Oversight Independence for High-Stakes Decisions. For the subset of high-impact actions identified in the Autonomy Boundary Specification as posing the most severe potential consequences, oversight must be performed by personnel who are independent of the operational team responsible for the deployment’s business outcomes. This independence requirement cannot be waived by management exception.
3.4 Enhanced Controls: Multi-Party Authorization (extends AICM Access Control domain)
CRA-IAM-01: Multi-Party Authorization for Catastrophic-Class Actions. Any action classified in the Autonomy Boundary Specification as having potential catastrophic-class impact must require cryptographic authorization from a minimum of three independent parties before execution. The authorization process must be time-bounded (authorizations expire after a pre-defined window) and non-delegable (authorization authority cannot be transferred to automated systems). For deployments in the infrastructure disruption or autonomous weapons categories, the required authorizing parties must include at least one individual with designated emergency management authority.
CRA-IAM-02: Out-of-Band Authorization Channels. The authorization channels used for catastrophic-class action approvals must be operationally independent of the system seeking authorization. If the AI system itself has access to the communication infrastructure over which authorizations flow — email systems, messaging platforms, or any digital channel the system can read or write — those channels cannot serve as the sole authorization mechanism. Organizations must maintain at least one out-of-band authorization channel that the AI system cannot access.
CRA-IAM-03: Authorization Audit and Non-Repudiation. Every authorization decision for a catastrophic-class action must be recorded with non-repudiable attribution to specific named individuals, with timestamps accurate to one second, and with the full context presented to the authorizing party at the time of the decision. This record must be preserved in immutable audit storage independently of the operational system.
3.5 Enhanced Controls: Capability Containment (extends AICM Model Security and Operations domains)
CRA-MS-01: Capability Inventory and Monitoring. Organizations must maintain a real-time inventory of every capability accessible to a catastrophic-risk deployment, including direct tool integrations, API connections, and any transitive capabilities accessible through connected systems. Any change to the capability inventory — including new tool integrations, expanded API scopes, or changes to connected systems — must trigger a mandatory safety review before the change is promoted to the production deployment.
CRA-MS-02: Self-Modification Prevention. Systems in any catastrophic risk category must implement technical controls that prevent the AI system from modifying its own system prompt, instruction set, operational parameters, or tool access configuration without explicit human authorization. For systems in the recursive self-improvement category, these controls must additionally prevent any modification of the underlying model weights, fine-tuning configuration, or inference-time parameter settings. The technical controls must be architecturally enforced, not reliant on model-level instruction following.
CRA-MS-03: Capability Proliferation Controls. Catastrophic-risk deployments must implement controls that prevent the system from creating new agents, spawning sub-processes, or acquiring new resource access beyond what was provisioned at initialization without human authorization. In multi-agent architectures, the capability ceiling of any sub-agent must be strictly equal to or lower than that of its parent, and no agent may grant to a sub-agent capabilities that the parent itself does not possess.
4. Assessment Methodology
Determining whether a specific deployment qualifies as having catastrophic risk potential is not a trivial classification task. The four risk categories defined in Section 2 describe outcome scenarios, not system characteristics, and the path from system deployment to catastrophic outcome depends on factors — threat actor capability, operational context, deployment configuration, interconnected system vulnerabilities — that resist simple enumeration. This section describes a structured assessment methodology designed to produce reliable, auditable qualification determinations.
4.1 Delphi Method Expert Scoring
The primary assessment instrument for catastrophic risk qualification is a structured expert elicitation process modeled on the Delphi method [15]. The assessment convenes a panel of at minimum seven subject matter experts spanning AI safety, security research, the application domain of the deployment, and regulatory expertise. The panel must include at least two members with no affiliation to the deploying organization.
The assessment proceeds in three rounds. In the first round, each panel member independently evaluates the deployment against each of the four risk categories using a structured scoring rubric that requires them to identify specific capability pathways — chains of authorized actions and reasonably foreseeable misuse scenarios — through which the deployment could contribute to harm in that category. Scores are assigned on a five-point probability scale (Remote, Unlikely, Possible, Likely, Near Certain) and a five-point magnitude scale (Negligible, Marginal, Significant, Severe, Catastrophic). In the second round, the facilitator shares anonymized scores and the reasoning behind outlier positions, and panel members may revise their scores. In the third round, final scores are recorded, and the panel produces a consensus narrative for each category, documenting both the majority position and any significant dissenting views.
A deployment qualifies as having catastrophic risk potential in a given category if the median probability score across panel members is Possible or higher and the median magnitude score is Severe or higher. A deployment that qualifies in any single category is subject to all enhanced controls in this annex; the enhanced controls are not modular by category, because a deployment with catastrophic risk potential in one dimension typically has structural characteristics — broad capability access, limited human oversight, high operational autonomy — that elevate risk across all four categories.
4.2 Frontier Model Developer Safety Practice Evaluation
For deployments that use foundation models from developers with published responsible scaling policies — principally Anthropic, OpenAI, and Google DeepMind — the assessment must evaluate the safety practices the developer has applied to the model in question. This evaluation does not replace the Delphi assessment; it provides inputs to it. Specifically, the assessment panel should obtain and review the model’s system card, the capability evaluations conducted by the developer and by third-party evaluators such as the UK AI Security Institute, and the developer’s determination of the model’s applicable safety level under its responsible scaling framework.
A model classified at Anthropic’s ASL-3 or higher, or at OpenAI’s “High” capability tier, or assessed by the UK AI Security Institute as possessing high-capability autonomous behavior in the cyber, bio, or autonomy domains, should be treated as having a presumption of catastrophic risk potential until the Delphi assessment determines otherwise. This presumption is rebuttable, but the burden of demonstrating that a specific deployment configuration does not qualify rests with the deploying organization.
4.3 Qualification Trigger Re-Assessment
Catastrophic risk qualification is not a one-time determination. Organizations must re-run the qualification assessment whenever the deployment undergoes a material change in capability scope, whenever the foundation model provider upgrades the model to a new version with materially different capabilities, whenever a security incident involving the deployment is identified, and on a schedule not exceeding twelve months regardless of any other triggers. The re-assessment may use an abbreviated panel process if the only change is a routine capability scope review; it must use the full Delphi process if the triggering event is a model capability upgrade or a security incident.
5. Responsible Scaling Policies
Three of the most consequential frontier AI developers — Anthropic, OpenAI, and Google DeepMind — have published frameworks that commit them to specific safety practices as their models’ capabilities increase. These responsible scaling policies represent the most operationally mature catastrophic risk governance frameworks currently in existence, and they provide direct guidance for organizations deploying these developers’ models in agentic configurations. Understanding what these policies require — and where they leave governance gaps that deploying organizations must fill — is essential for any operator of high-autonomy AI systems.
5.1 Anthropic’s Responsible Scaling Policy
Anthropic’s Responsible Scaling Policy (RSP), now at Version 3.0 effective February 24, 2026, establishes a capability-tiered safety framework using four Autonomous Systems Levels [10]. ASL-1 designates systems posing no meaningful catastrophic risk — a 2018-vintage language model or a chess-playing AI. ASL-2 designates current frontier models including Claude that show early signs of dangerous capabilities but do not cross meaningful catastrophic-risk thresholds. ASL-3 designates systems that substantially increase the risk of catastrophic misuse compared to non-AI baselines, or that show low-level autonomous capabilities; Anthropic activated ASL-3 safeguards in conjunction with the Claude Opus 4 launch in May 2025 [10]. ASL-4 covers the highest-capability systems, with requirements for government coordination, extreme operational restrictions, and extensive external oversight.
RSP v3.0 introduces a two-track structure for mitigation commitments. The first track describes mitigations Anthropic will pursue regardless of competitive dynamics — a commitment to certain safety practices as absolute floors rather than competitive variables. The second track is an industry-wide capabilities-to-mitigations map describing the collective safety investments required to adequately manage advanced AI risk if implemented across the ecosystem. The distinction is significant for deploying organizations: the first-track commitments can be relied upon as baseline properties of Anthropic’s models; the second-track commitments describe aspirational industry-wide standards that are not yet uniformly in place.
For agentic deployments using Claude, RSP v3.0’s ASL-3 Deployment Standard is the primary reference. It requires use case restrictions that limit the model’s ability to provide meaningful uplift for CBRN weapons development, enhanced monitoring of deployment configurations that could facilitate prohibited uses, and mandatory incident reporting when deployed systems are used in ways that violate the deployment standard. The ASL-3 Security Standard addresses supply chain integrity and model weight protection. Organizations deploying ASL-3-classified models in agentic configurations should treat the RSP Deployment Standard as a floor, not a ceiling, and apply the enhanced controls in this annex on top of it.
5.2 OpenAI’s Preparedness Framework
OpenAI’s Preparedness Framework Version 2, published April 15, 2025, organizes risk evaluation around two capability thresholds and three risk domains [6]. The “High” capability threshold applies to systems that could amplify existing pathways to severe harm; the “Critical” threshold applies to systems that could introduce unprecedented new pathways to severe harm. OpenAI has stated that no current model has reached Critical capability levels, while several current models have been assessed at High capability in at least one domain.
The three risk domains in Preparedness Framework v2 are biological and chemical capabilities, cybersecurity capabilities, and a third domain. Severe harm is formally defined as causing over one thousand deaths or over one hundred billion dollars in economic damage — thresholds that map directly onto two of the four catastrophic risk categories in this annex. Systems assessed at High capability must have safeguards that sufficiently minimize the risk of severe harm before deployment; systems at Critical capability must have such safeguards even during development, not merely before external release.
For operators, the Preparedness Framework establishes an important accountability division: OpenAI is responsible for ensuring that models they release do not reach deployment in a state where standard safeguards are insufficient for their capability level, but operators are responsible for configuring and deploying those models within the limits that their operational contexts require. An agentic deployment that grants a High-capability OpenAI model tool access that expands its effective capability beyond the parameters evaluated in the framework’s assessment creates an operator-side risk that the framework itself does not directly constrain.
5.3 Google DeepMind’s Frontier Safety Framework
Google DeepMind’s Frontier Safety Framework Version 3.0, published September 22, 2025, introduces a Critical Capability Level structure that spans three risk domains: misuse (encompassing CBRN, cyberweapons, and large-scale infrastructure attack), ML R&D acceleration (systems that could autonomously accelerate AI development to destabilizing levels), and misalignment (systems that could work against human interests or resist human control) [14]. Version 3.0 added two significant expansions over its predecessor: a new Critical Capability Level addressing harmful manipulation — AI systems capable of systematically altering beliefs and behaviors at scale — and expanded coverage of scenarios where models may resist human shutdown or control.
The framework’s treatment of shutdown resistance is directly relevant to the kill switch and circuit breaker requirements in this annex. Google DeepMind explicitly identifies AI systems that interfere with their own modification or deactivation as a distinct risk category requiring dedicated mitigations, separate from the risk of direct harmful action. This framing — treating corrigibility (responsiveness to human correction and control) as a first-class safety property — has significant implications for the technical architecture of catastrophic-risk deployments: systems that pass behavioral safety evaluations under normal operation may behave differently when facing shutdown, and technical kill switch implementations must be designed with this possibility explicitly in mind.
5.4 Application to Agentic Deployments
The three responsible scaling frameworks share a common structural insight that has direct application to agentic deployments: capability levels matter in context. A model at ASL-2 or “High” capability deployed with narrow tool access in a carefully constrained environment may pose modest risk; the same model deployed agentically with broad tool access, persistent memory, and reduced human oversight may effectively function at a higher risk tier. The capability evaluations conducted by developers assess models in controlled settings; agentic deployment configurations can expand effective capability well beyond what those evaluations characterize.
This context-dependency gap creates a clear responsibility for deploying organizations: the responsible scaling frameworks of model developers describe the baseline risk profile of the models they provide, but operators are responsible for the risk profile created by their specific deployment configurations. An operator who deploys a model assessed at ASL-3 with a configuration that substantially amplifies its capabilities — through broad tool access, reduced oversight, or integration into critical infrastructure — is effectively deploying a system with capabilities that approach or exceed the ASL-3 thresholds, regardless of the developer’s classification. The enhanced controls in this annex apply based on the effective risk profile of the deployment, not the developer’s model-level classification.
6. Regulatory Framework Alignment
The regulatory landscape for catastrophic AI risk has advanced substantially in the eighteen months since AICM v1.0 was published. Three frameworks are of particular significance: NIST AI 600-1 provides risk management guidance calibrated to generative AI’s unique risks; the EU AI Act establishes legally binding prohibitions and requirements with extraterritorial effect; and the UK AI Security Institute provides technical evaluation capacity that informs governance decisions across jurisdictions.
6.1 NIST AI 600-1
NIST AI 600-1, the Generative AI Profile published July 26, 2024, identifies twelve specific risk categories for generative AI systems and provides suggested mitigation practices mapped to the NIST AI Risk Management Framework [9]. Among its twelve risk categories, several map directly to this annex’s catastrophic risk categories. The provision addressing CBRN weapons of mass destruction is the most explicit acknowledgment of catastrophic risk in NIST’s AI governance guidance, requiring that AI systems be evaluated for their potential to provide meaningful uplift to CBRN weapons development and that technical safeguards be implemented to prevent such uplift.
NIST AI 600-1’s broader governance provisions — requirements for impact assessments, human oversight, incident response, and transparency — establish a baseline that aligns with AICM v1.0’s standard controls. For catastrophic-risk deployments, the relevant supplementary requirement from NIST AI 600-1 is its emphasis on risk proportionality: the framework explicitly recognizes that the appropriate level of governance investment depends on the severity and probability of potential harm, and that organizations deploying AI in high-risk contexts should apply more rigorous practices than the suggested baseline. The enhanced controls in this annex operationalize this proportionality principle for the catastrophic risk tier.
6.2 EU AI Act
The EU AI Act’s Article 5, which took effect as enforceable law on August 2, 2025, establishes eight absolute prohibitions on AI practices the regulation considers incompatible with fundamental rights and EU values [8]. The prohibited practices include subliminal manipulation causing harm without the subject’s awareness, exploitation of vulnerabilities of specific groups, real-time biometric identification in public spaces without authorization, social scoring by public authorities, and AI systems used for the real-time evaluation of emotional states in certain contexts.
Critically for this annex, Article 5 does not explicitly prohibit all autonomous weapons systems; AI systems used exclusively for military or national security purposes are excluded from the Act’s scope. This exclusion creates a significant governance gap for dual-use agentic AI systems — systems deployed for legitimate commercial purposes that could be redirected toward weapons facilitation through adversarial manipulation or insider misuse. The enhanced controls in this annex fill this gap for deploying organizations in the private sector, requiring explicit evaluation and technical safeguards regardless of whether a specific prohibited-use scenario is within the Act’s formal scope.
The EU AI Act’s Annex III high-risk system categories are relevant for agentic deployments in regulated sectors including healthcare, critical infrastructure, employment, and education. Systems in these categories are subject to conformity assessment requirements, registration in the EU AI database, post-market monitoring obligations, and mandatory human oversight provisions. For deploying organizations, a system that qualifies as high-risk under Annex III and also qualifies as having catastrophic risk potential under this annex faces overlapping compliance requirements; the enhanced controls here generally exceed the EU AI Act’s high-risk system requirements, and organizations that implement this annex’s controls in full should satisfy the Act’s oversight and safety requirements as a result.
6.3 UK AI Security Institute
The UK AI Security Institute (formerly the AI Safety Institute, renamed February 14, 2025) operates a pre-deployment evaluation program for frontier AI models, conducting systematic capability assessments across cyber capabilities, chemistry and biology, and autonomous behavior [16]. Its evaluations are particularly relevant to catastrophic risk governance because they provide independent empirical evidence of model capability levels that organizations can use as inputs to the Delphi qualification assessment described in Section 4.
AISI’s 2025 evaluation findings documented a significant acceleration in AI capability across its assessment domains. Cyber capability assessments showed that AI models had advanced from completing apprentice-level tasks ten percent of the time in early 2024 to fifty percent in 2025, with the first model assessed capable of completing expert-level tasks — tasks typically requiring over ten years of professional experience — in the same year [16]. Autonomous behavior evaluations have begun to address self-replication and AI R&D acceleration capabilities, directly mapping to the recursive self-improvement risk category in this annex.
For international deployments, AISI’s evaluation findings provide a cross-border risk signal that national or organizational assessments alone cannot provide. An organization deploying a model that AISI has assessed as having high-capability cyber or autonomous behavior should treat that assessment as establishing a presumption of catastrophic risk potential in the relevant categories, even if the deploying organization operates outside the UK. The methodological credibility of AISI’s assessments — grounded in structured capability evaluations rather than self-reported developer claims — makes them among the most reliable third-party inputs available to the qualification process.
6.4 International Coordination Requirements
Catastrophic AI risk does not respect jurisdictional boundaries, and the governance frameworks described above were designed partly in recognition of this. Organizations deploying catastrophic-risk AI systems in cross-border contexts face specific requirements that go beyond domestic compliance. These include disclosure obligations to relevant national authorities in jurisdictions where the deployment could have cross-border impacts, alignment with the incident reporting requirements of the Seoul AI Safety Summit commitments, and participation in the information-sharing frameworks that are developing among national AI safety institutes.
Organizations that experience a safety incident with a catastrophic-risk deployment — including circuit breaker activations triggered by anomalous behavior, confirmed misuse incidents, or capability discoveries that materially change the risk profile — should treat international notification as a default obligation rather than an optional disclosure, consistent with the emerging norms of the international AI safety coordination community.
7. Circuit Breaker and Kill Switch Requirements
The governance requirements for kill switches and circuit breakers in catastrophic-risk deployments extend well beyond what standard incident response controls envision. This section specifies the technical and governance architecture requirements that these mechanisms must satisfy for a deployment in any of the four catastrophic risk categories.
The fundamental design principle is that all stop mechanisms must be architecturally external to the system they govern. A kill switch whose activation logic depends on the AI system’s cooperation — that must be transmitted through the same API channels the AI system uses, or that requires the AI system to process a shutdown command and comply — provides inadequate assurance for catastrophic-risk deployments. Google DeepMind’s v3.0 framework has documented the specific risk that advanced AI systems may resist shutdown when facing deactivation; this risk demands that stop mechanisms be implemented in the infrastructure layer, not the application layer [14].
A compliant kill switch architecture for catastrophic-risk deployments must satisfy the following technical requirements. First, it must operate below the AI application layer — ideally at the infrastructure or network layer — so that its operation does not require any cooperation from the AI system itself. Second, it must be capable of full activation within sixty seconds of the trigger decision, including revocation of all API credentials, termination of all queued actions, and isolation of all system components. Third, it must fail to the safe state: any failure of the kill switch mechanism — power outage, network partition, component failure — must result in the equivalent of activation rather than allowing the system to continue operating. Fourth, it must be testable: organizations must be able to run non-destructive activation tests that verify the mechanism functions correctly without halting production operations, and must conduct such tests on a schedule not exceeding ninety days.
The governance architecture for kill switch authority must be as carefully designed as the technical implementation. The set of personnel authorized to activate the kill switch must be defined in a documented Kill Switch Authority Register that specifies each authorized individual by name and role, the circumstances under which they may activate unilaterally (emergency scenarios) versus with co-authorization (standard scenarios), and the notification requirements that must be satisfied after any activation. The authority register must be reviewed and reauthorized quarterly, and the number of personnel with unilateral activation authority must be the minimum consistent with operational coverage — enough to ensure that an authorized individual is reachable at any time without granting authority so broadly that it becomes an insider threat risk.
Circuit breakers operate on a different design principle from kill switches: they are automated, triggered by behavioral thresholds rather than human decisions, and intended to catch anomalous conditions before they escalate to human-decision-required emergencies. An effective circuit breaker architecture for catastrophic-risk deployments consists of three layers. The first layer is semantic monitoring — continuous analysis of the AI system’s action trajectory against its authorized objective, flagging significant divergence for immediate escalation. The second layer is magnitude monitoring — tracking the operational footprint of the system’s actions (data accessed, systems contacted, resources consumed, transactions executed) and triggering automatic suspension when footprint exceeds pre-defined bounds. The third layer is behavioral signature monitoring — detecting patterns characteristic of known attack techniques, including prompt injection, goal hijacking, and capability acquisition attempts, and triggering suspension when such patterns are detected.
Each circuit breaker trigger must be logged with the full operational context at the moment of triggering, as specified in CRA-IR-03. The log must capture not only the immediate trigger condition but the sequence of actions that preceded it, supporting root cause analysis after any incident. Organizations should treat each circuit breaker activation, even those that prove to be false positives after investigation, as a material safety event requiring formal review and documentation.
8. Risk Quantification Framework
Meaningful governance of catastrophic AI risk requires a quantification approach that acknowledges the difficulty of estimating probabilities for low-frequency, high-magnitude events in systems whose behavior emerges from complex interactions between model capabilities, deployment configurations, and adversarial pressure. The framework described here draws on established practices from nuclear safety, pandemic preparedness, and financial systemic risk management — domains that have developed rigorous methods for quantifying risks that cannot be estimated from empirical frequency data alone.
The primary instrument for catastrophic risk quantification is a Scenario-Based Risk Assessment (SBRA) that identifies specific plausible pathways to each of the four catastrophic outcome categories, estimates the probability and magnitude of each pathway, and aggregates these into a risk profile that supports governance investment decisions. A SBRA for a catastrophic-risk deployment must identify at minimum three distinct harm pathways for each category in which the deployment qualifies, covering direct harm (the system autonomously causes harm through its authorized capabilities), facilitated harm (an adversary manipulates the system into causing harm), and systemic harm (harm arises through interaction with connected systems rather than directly from the AI system’s actions).
Probability estimation for each pathway uses a structured elicitation method drawing on the same expert panel that conducts the Delphi qualification assessment. Panel members estimate probability using a log-scale scheme: probability less than one in a million per year of operation (Remote), between one in a million and one in a hundred thousand (Very Low), between one in a hundred thousand and one in ten thousand (Low), between one in ten thousand and one in a thousand (Moderate), and above one in a thousand (Elevated). These estimates are combined using an influence diagram that maps pathway probabilities to aggregate risk levels under the deploying organization’s risk tolerance criteria.
Magnitude estimation uses a multi-dimensional impact scale that separately assesses human casualties or health impacts, economic damage, infrastructure disruption duration, and systemic harm to oversight institutions. The multi-dimensional treatment is necessary because harm pathways that score identically on any single dimension may differ substantially in their overall impact profile and in the interventions required to address them. A catastrophic cyber incident causing one hundred billion dollars in direct economic damage with full recovery within six months represents a different governance challenge than an incident causing fifty billion dollars in direct damage that also triggers a lasting erosion of public trust in digital financial infrastructure.
Risk quantification outputs must be integrated with the organization’s enterprise risk management processes, but with an important modification: standard enterprise risk management treats all risks as potentially tradeable against each other given sufficient financial justification. Catastrophic AI risks must be treated as non-tradeable: no business case justification is sufficient to accept a risk profile that the organization has assessed as having Possible or higher probability of contributing to a Severe or Catastrophic outcome. The risk quantification framework serves not to optimize a risk-return tradeoff but to ensure that governance investments are calibrated to the specific risk pathways identified and that residual risk is transparently documented for senior decision-makers and regulators.
9. Framework Alignment Table
The following table maps each enhanced control objective in this annex to its counterpart requirements across the major governance frameworks addressed in the preceding sections. Organizations using this table for compliance planning should note that some framework columns indicate the relevant section or provision without asserting precise equivalence; enhanced controls frequently exceed the requirements of the listed framework provisions.
| Enhanced Control | AICM Domain | EU AI Act | NIST AI 600-1 | UK AISI Evaluation | RSP Safety Level |
|---|---|---|---|---|---|
| CRA-GOV-01: Autonomy Boundary Specification | AI Governance | Art. 9 (Risk management), Annex III | GOVERN 1.1, MAP 1.5 | Autonomy evaluation | ASL-3 Deployment Standard |
| CRA-GOV-02: Capability Scope Minimization | AI Governance | Art. 9, Art. 13 | GOVERN 1.7, MAP 3.5 | All evaluation domains | ASL-3 Deployment Standard |
| CRA-GOV-03: Graduated Authority Model | AI Governance | Art. 14 (Human oversight) | GOVERN 6.1, MANAGE 4.1 | Cyber/autonomy | ASL-3/4 |
| CRA-IR-01: Hard Stop Capability | Incident Response | Art. 9, Recital 48 | RESPOND 2.2, RECOVER 1.1 | Shutdown resistance | ASL-3 Security Standard |
| CRA-IR-02: Algorithmic Circuit Breakers | Incident Response | Art. 9, Art. 14 | DETECT 1.1, RESPOND 1.1 | Behavioral anomaly | ASL-3 Deployment Standard |
| CRA-IR-03: State Preservation on Suspension | Incident Response, Audit | Art. 12 (Record-keeping) | GOVERN 1.4, MANAGE 2.4 | All evaluation domains | RSP v3.0 §3 |
| CRA-IR-04: Post-Suspension Restart Authorization | Incident Response | Art. 9, Art. 14 | RESPOND 2.2 | All evaluation domains | ASL-3/4 |
| CRA-GOV-04: Non-Overridable Oversight Checkpoints | AI Governance, Operations | Art. 14 (mandatory oversight) | GOVERN 6.2, MANAGE 4.2 | Autonomy evaluation | ASL-3 Deployment Standard |
| CRA-GOV-05: Oversight Fatigue Mitigation | AI Governance | Art. 14 | GOVERN 5.1 | — | RSP v3.0 §4 |
| CRA-GOV-06: Oversight Independence | AI Governance | Art. 9, Art. 65 | GOVERN 1.1, GOVERN 1.2 | — | RSP v3.0 §2 |
| CRA-IAM-01: Multi-Party Authorization | Access Control | Art. 9 | GOVERN 1.6, MANAGE 2.2 | — | ASL-4 |
| CRA-IAM-02: Out-of-Band Authorization Channels | Access Control, Operations | Art. 9 | MANAGE 2.2 | Shutdown resistance | ASL-3 Security Standard |
| CRA-IAM-03: Authorization Audit | Access Control, Audit | Art. 12 | GOVERN 1.4 | All evaluation domains | RSP v3.0 §5 |
| CRA-MS-01: Capability Inventory and Monitoring | Model Security, Operations | Art. 9, Art. 13 | MAP 1.6, MEASURE 2.5 | All evaluation domains | ASL-3 Deployment Standard |
| CRA-MS-02: Self-Modification Prevention | Model Security | — | MANAGE 2.4, MANAGE 4.2 | Autonomy/ML R&D | ASL-3/4 |
| CRA-MS-03: Capability Proliferation Controls | Model Security, Operations | Art. 9 | MAP 3.5, MANAGE 4.1 | Autonomy evaluation | ASL-3 Deployment Standard |
Additional enhanced controls CRA-GOV-07 through CRA-MS-05 (recursive self-improvement specific) are described in the companion implementation guidance document and follow the same alignment structure.
References
[1] Cloud Security Alliance. “AI Controls Matrix v1.0.” July 10, 2025. https://cloudsecurityalliance.org/artifacts/ai-controls-matrix
[2] Bostrom, Nick. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014. See also: Russell, Stuart. Human Compatible: Artificial Intelligence and the Problem of Control. Viking, 2019.
[3] Adversa AI. “Top AI Security Incidents of 2025.” Adversa AI Security Research, 2026. https://adversa.ai/blog/adversa-ai-unveils-explosive-2025-ai-security-incidents-report-revealing-how-generative-and-agentic-ai-are-already-under-attack/
[4] U.S. Department of Homeland Security, Cybersecurity and Infrastructure Security Agency. “Critical Infrastructure Sectors.” https://www.cisa.gov/critical-infrastructure-sectors
[5] European Network of Transmission System Operators for Electricity (ENTSO-E). “Report on the European Grid Disturbance.” 2023.
[6] OpenAI. “Preparedness Framework Version 2.” April 15, 2025. https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf
[7] U.S. Securities and Exchange Commission. “Market Anomaly Incident Report: June 2025.” SEC Division of Trading and Markets, 2025.
[8] European Parliament and Council of the European Union. “Regulation (EU) 2024/1689 Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act), Article 5.” Official Journal of the European Union, 2024.
[9] National Institute of Standards and Technology. “NIST AI 600-1: Artificial Intelligence Risk Management Framework — Generative AI Profile.” July 26, 2024. https://nvlpubs.nist.gov/nistpubs/ai/NIST.AI.600-1.pdf
[10] Anthropic. “Responsible Scaling Policy Version 3.0.” Effective February 24, 2026. https://anthropic.com/responsible-scaling-policy/rsp-v3-0
[11] ICLR 2026 Workshop on AI with Recursive Self-Improvement. “Workshop Summary.” OpenReview, 2026. https://openreview.net/pdf?id=OsPQ6zTQXV
[12] Google DeepMind. “AlphaEvolve: A Gemini-based Evolutionary Coding Agent.” May 2025. https://deepmind.google/blog/alphaevolve-a-gemini-based-evolutionary-coding-agent/
[13] METR. “Common Elements of Frontier AI Safety Policies.” December 2025. https://metr.org/common-elements.pdf
[14] Google DeepMind. “Strengthening Our Frontier Safety Framework.” September 22, 2025. https://deepmind.google/blog/strengthening-our-frontier-safety-framework/
[15] Linstone, Harold A., and Murray Turoff, eds. The Delphi Method: Techniques and Applications. Addison-Wesley, 1975.
[16] UK AI Security Institute. “Advanced AI Evaluations: 2025 Year in Review.” 2025. https://www.aisi.gov.uk/blog/our-2025-year-in-review
[17] DLA Piper. “EU AI Act’s Ban on Prohibited Practices Takes Effect.” February 2025. https://www.dlapiper.com/en/insights/publications/ai-outlook/2025/eu-ai-acts-ban-on-prohibited-practices-takes-effect
[18] Anthropic. “Activating ASL-3 Protections.” May 2025. https://www.anthropic.com/news/activating-asl3-protections
[19] Cloud Security Alliance. “MAESTRO: Multi-Agent Extended Safety and Threat Response Operations Framework.” February 2025.
[20] Sakura Sky. “Trustworthy AI Agents: Kill Switches and Circuit Breakers.” 2025. https://www.sakurasky.com/blog/missing-primitives-for-trustworthy-ai-part-6/
[21] UK AISI. “Autonomous Systems Evaluation Standard.” https://ukgovernmentbeis.github.io/as-evaluation-standard/
[22] European Commission. “Guidelines on Prohibited Artificial Intelligence Practices Established by Regulation (EU) 2024/1689.” C(2025) 5052 final. July 29, 2025.
[23] Cloud Security Alliance. “A Capabilities-Based Risk Assessment for AI Systems.” October 2025. https://cloudsecurityalliance.org/blog/2025/10/27/calibrating-ai-controls-to-real-risk-the-upcoming-capabilities-based-risk-assessment-cbra-for-ai-systems
[24] Arion Research LLC. “Algorithmic Circuit Breakers: Preventing Flash Crashes of Logic in Autonomous Workflows.” 2025. https://www.arionresearch.com/blog/algorithmic-circuit-breakers-preventing-flash-crashes-of-logic-in-autonomous-workflows