To save as PDF: File → Print → Save as PDF (or ⌘P on Mac)

Fractional CAIO · Executive Playbook Series · V-AIM Framework

The AI Incident Command Playbook

From Detection to Board Report: The V-AIM Framework for Commanding AI Failures at Scale

Author

James A Lang

Role

Fractional CAIO · Velinor

Edition

2026 · V2

Format

49-Page Playbook + 6 Templates + 3 Case Studies

Introduction: Why AI Incidents Are Different
Part 1 — Stage 1: Prepare
Part 2 — Stages 2 & 3: Detect & Contain
Part 3 — Stages 4 & 5: Govern & Recover
- Communication Rules & Stakeholder Sequence
- Recovery & Restoration Criteria
Part 4 — Stage 6: Learn
- AI TRACE Post-Incident Review
- The Board Report
Part 5 — Leadership Metrics Guide
- KPIs & Performance Indicators
- Board-Level Reporting Standards
Case Studies
Appendices: Editable Templates

↗

Introduction: Why AI Incidents Are Different

When a server goes down, the failure is visible, bounded, and recoverable. When an AI system fails, the failure may be silent, diffuse, ambiguous, and reputationally catastrophic. This playbook exists because organisations that apply standard IT incident response thinking to AI will be consistently underprepared.

AI systems occupy a different kind of risk position in your organisation. They don't just fail to work, they work incorrectly, often at scale, often undetected, and often in ways that can't simply be rolled back. Three characteristics make AI incidents qualitatively different from all other technology failures:

Characteristic	Traditional IT Failure	AI System Failure
Detection	Usually immediate, system goes down, alert fires	Often delayed, system keeps running, producing wrong outputs nobody notices
Blast radius	Bounded, the system stops, impact halts	Accumulating, an AI making 10,000 decisions/day creates 10,000 harms before detection
Accountability	Clear, the system failed	Ambiguous, the model, the data, the deployment, the humans, the vendor?
Reversibility	High, restore from backup, restart the service	Low, retrospective decisions can't be un-made; affected parties can't be un-harmed
Regulation	General IT standards and GDPR	EU AI Act Article 73, sector-specific AI regulations, GDPR, FCA, ICO

⚠ Regulatory Context, EU AI Act (2025)

The EU AI Act became enforceable in February 2025. Article 73 requires providers of high-risk AI systems to report serious incidents, including near-misses, to market surveillance authorities. Non-compliance carries penalties of up to €30M or 6% of global annual turnover. If your organisation deploys or uses high-risk AI, incident response is no longer optional governance hygiene. It is a legal obligation.

AI incidents fall into three categories under the R3AI Standard: Reliability failures (the system doesn't do what it was designed to do), Resilience failures (the system can't maintain performance under real-world conditions), and Responsibility failures (the system causes harm, bias, or ethical damage). Understanding which category an incident falls into shapes how you contain, govern, and learn from it.

This playbook gives leadership teams the V-AIM command framework to lead their response. It covers the six stages of V-AIM — Prepare, Detect, Contain, Govern, Recover, Learn — and the First 24 Hours timeline, regulatory obligations, communication sequencing, the AI TRACE post-incident review, and the Leadership Metrics Guide. Six editable templates in the appendices can be adapted for your organisation immediately.

A note on use: this playbook is written for executive leadership teams, the people who need to govern an AI incident, not just technically resolve it. It is deliberately jargon-light and action-oriented. Technical teams will need their own supplementary runbooks; this document governs the leadership layer that sits above them.

Stage 1: Prepare

The single most important thing you can do for AI incident response is prepare before anything goes wrong. Organisations that define their severity classifications, response team roles, and communication templates in advance will respond in hours. Those that start from scratch during an incident will respond in days, by which time the damage has compounded.

1.1 V-SEV Classification Matrix

Not every AI problem is a crisis. The V-SEV scale provides a consistent severity classification language across technical and non-technical teams. Use this matrix to classify incidents consistently so everyone on the command team is working from the same understanding of urgency and expected response times.

V-SEV Level	Definition	Examples	Response SLA	Escalation
V5 — SYSTEMIC TRUST EVENT	Active harm occurring, regulatory breach, or material public exposure. Widespread impact on trust, operations, or reputation.	Deepfake fraud in progress; AI producing dangerous medical or financial advice; PII breach via AI system; AI weaponised against customers; systemic bias at scale	Immediate response. Incident Lead on-call within 1 hour. Board notification within 4 hours.	Incident Lead → Executive Sponsor → CEO → Legal → Board
V4 — CRITICAL	Significant business or reputational impact; regulatory exposure likely; material customer harm.	Biased model discovered affecting hiring or lending decisions; financial miscalculations in AI-assisted reporting; identity verification bypass at scale; governance failure with external consequences	Response team assembled within 4 hours	Incident Lead → Executive Sponsor → Legal
V3 — SIGNIFICANT	Operational disruption with limited external impact; quality degradation affecting business processes; regulatory interest possible.	AI system goes offline; automation pipeline failures causing delays; model drift reducing output quality below threshold; localised bias detected	Response within 24 hours	Technical Containment Lead → Incident Lead
V2 — MODERATE	Repeated errors or anomalies with limited impact; internal review required; no external regulatory trigger.	Gradual accuracy decline affecting outputs; unusual patterns flagged by monitoring; minor data quality issues in production; edge-case model failures	Response within 48 hours	Technical Containment Lead
V1 — IRREGULARITY	Minor deviation; no external or customer impact; detected through routine monitoring. Informational only.	Single anomalous output; testing environment issues; minor configuration drift; low-confidence flag from monitoring system	Review within 72 hours. Log and monitor.	Technical Containment Lead (log only)

→ Important

When in doubt, classify up. It is better to mobilise a full V5 response for what turns out to be a V4 than to under-resource a genuine crisis. Downgrading is always easier than escalating late.

1.2 V-AIM Command Roles

Define and assign these six V-AIM command roles before any incident occurs. Each role must have a named primary and a named backup. These are decision-making roles with clear authority — not just communication functions.

V-AIM Role	Responsibilities	Typically Held By
Executive Sponsor	Ultimate accountable executive. Approves significant decisions on containment, disclosure, and recovery. Single board-to-response-team interface. Signs off regulatory notifications.	CEO, COO, or Board-designated executive
Incident Lead	Operational command of the incident. Coordinates all six V-AIM command roles. Makes the call on containment actions. Single escalation point for the response team. Produces status updates for the Executive Sponsor.	CAIO, CTO, or senior executive designated in advance
Technical Containment Lead	Owns the technical investigation. Responsible for root cause analysis, containment actions, model rollback, evidence preservation, and technical recovery validation.	Head of AI/ML, Principal Engineer, or Lead Data Scientist
Legal & Compliance Lead	Assesses regulatory notification obligations (EU AI Act Article 73, GDPR, FCA, sector-specific). Reviews all external communications before release. Manages evidence chain and liability considerations.	General Counsel, Chief Compliance Officer, or external AI law firm on retainer
Communications Lead	Owns all internal and external messaging. Controls communication timing and sequencing. Prepares and adapts templates. Approves all final language before distribution. Manages media if required.	Head of Communications, CMO, or external PR lead
Business Owner	Represents the business function(s) affected. Quantifies customer and operational impact. Manages customer service escalations. Tracks affected accounts, decisions, or transactions. Confirms when operational recovery is complete.	COO, Customer Success Director, or Operations Head

1.3 The 12 Non-Negotiables

Before your organisation is ever in an AI incident, the following 12 conditions must already be satisfied. These are the Non-Negotiables — the minimum viable governance foundation for effective V-AIM command. If any of these are missing, your incident response capability is incomplete, regardless of the quality of your technical infrastructure.

AI system inventory is documented: what systems exist, what decisions they make, what data they use, and who owns them
Model versioning is established: you can identify precisely which model version was running at any point in time
Rollback capability exists: you can revert an AI system to a previous version within hours
Data lineage is traceable: you can identify what data trained the model and what data is currently feeding it
Incident classification criteria are agreed in writing and shared with the response team
Response team roles are assigned with named primary and backup contacts
Contact list is current (including Legal Counsel and external PR, if relevant)
Communication templates are pre-approved (see Appendices B–D)
Regulatory notification obligations are mapped: which systems are high-risk under the EU AI Act, what GDPR obligations apply, what sector-specific rules apply
Board escalation threshold is defined in writing: at what severity level and time delay does the board get informed?
Incident response tabletop exercise has been conducted in the last 12 months
External AI law firm or regulatory adviser is identified and on retainer or accessible

Stages 2 & 3: Detect & Contain

The first 24 hours of an AI incident are the most consequential. Decisions made in this window determine whether the situation is contained or whether it compounds. Speed matters, but not more than the quality of decisions. Acting too fast without understanding the blast radius can make things worse.

Hour 0–4: Detection & Initial Assessment

Confirm the incident

Not every anomaly is an incident. Before triggering the full response, confirm:

Is this a genuine AI system failure, or expected system behaviour?
What is the appropriate severity classification using the matrix in Section 1.1?
Is harm currently occurring (active), or has it already occurred (retrospective)?
What is the discovery source, monitoring alert, user report, customer complaint, external media?

Declare the incident and assign the Incident Lead

Use the Incident Declaration Template (Appendix A). The act of formal declaration:

Starts the regulatory clock for any notification obligations
Creates a legally defensible record of when the organisation became aware
Triggers the response protocol, everyone knows their role

Assemble the response team

V5/V4: Full command team, immediately. All six V-AIM roles activated.
V3/V2: Core team (Incident Lead, Technical Containment Lead, Legal & Compliance Lead) within 2 hours.
V1/V2: Technical Containment Lead plus one other as required.

Make the containment decision

The first and most important question: Should we switch this AI system off?

Containment Decision Framework

If active harm is currently occurring → Shut down the AI system immediately. The cost of a false positive (unnecessary downtime) is always less than the cost of continued harm.

If harm is retrospective (already happened) → Preserve all evidence before any remediation. Do not restart the system until root cause is understood.

If the situation is ambiguous → Shut down and investigate. Resuming a broken system creates liability; prudent precaution does not.

If shutting down would cause greater harm → For example, a medical or safety-critical AI where shutdown creates patient risk. In this case, escalate immediately and engage Legal Counsel before any action.

Preserve evidence, before any remediation

This step is non-negotiable and must precede any technical fix. Legal and regulatory proceedings will depend on evidence integrity.

Capture system logs, model version identifier, and input/output samples from the incident window
Screenshot monitoring dashboards, alerts, and any anomaly indicators
Preserve all communications related to the incident (internal messages, user reports, support tickets)
Do not alter any data sources, model configurations, or system settings until evidence is preserved
Assign a named custodian responsible for evidence preservation and chain of custody

Hour 4–8: Escalation & Regulatory Assessment

Internal escalation

Brief the CEO and/or COO for V5 and V4 incidents
Do not brief the full board yet, premature board escalation adds noise without clarity. Board notification should follow the threshold defined in your pre-agreed protocol.
Establish a situation room cadence: regular check-ins (every 2 hours for V5, every 4 hours for V4)

Regulatory notification assessment

Work with Legal Counsel to determine notification obligations. Do not wait for this assessment to conclude before taking containment action, but begin it in parallel.

Regulation	Trigger	Deadline	Action Required
EU AI Act (Art. 73)	Serious incident involving a high-risk AI system, including near-misses	Immediately upon becoming aware; follow-up report within timeline set by authority	Notify relevant market surveillance authority; preserve incident records
GDPR / UK GDPR	Personal data involved in breach caused or exacerbated by AI system	72 hours from awareness to regulator; without undue delay to data subjects if high risk	Notify ICO (UK) / supervisory authority (EU); notify affected individuals if required
FCA (Financial Services)	Material operational incident affecting regulated services; AI-related consumer harm	As soon as practicable	Notify FCA; document incident and response; consider consumer redress obligations
Sector-specific	Healthcare, critical infrastructure, defence, etc.	Varies by sector and jurisdiction	Review sector-specific AI guidance; engage sector regulator

⛔ Common Error

Organisations frequently delay regulatory notification while waiting to fully understand the incident. Regulators treat this as aggravating conduct. Notify early with what you know, and update as your understanding develops. A timely incomplete notification is always better than a late complete one.

Hour 8–24: Containment Actions

By this point, the immediate crisis is stabilised. The focus shifts from containment to understanding. Root cause investigation must be structured, not anecdotal.

The four investigation questions

Every AI incident investigation must answer these four questions before remediation can be authorised:

#	Question	Why it matters
1	What failed? (The model, the data, the deployment, the process, or the humans?)	Determines the remediation approach, a model fix is different from a process fix
2	When did it start? (Point of failure vs. point of discovery)	Defines the blast radius, how many decisions were affected before detection
3	What was the blast radius? (How many decisions, transactions, or people were affected?)	Drives notification obligations, customer redress, and reputational risk assessment
4	Why did our controls not catch this? (What monitoring, testing, or governance mechanism failed?)	The answer drives the post-incident governance improvement, without this, you will face the same incident again

Document everything

Create and maintain an incident timeline throughout, every action, every decision, every communication, timestamped. This serves three purposes: legal defensibility, post-incident learning, and board/regulator reporting. Assign a named scribe to maintain the timeline in real time.

Stages 4 & 5: Govern & Recover

Communication Rules

How an organisation communicates during an AI incident often matters as much as what it actually does technically. Three rules govern effective crisis communication:

Rule 1

One voice, one message. All external communications go through a single approved spokesperson, using pre-approved language. Multiple voices create contradictions. Contradictions destroy trust.

Rule 2

Acknowledge before you explain. Organisations that lead with technical explanations before acknowledging impact destroy trust. Your first communication to affected parties is not a technical briefing, it is an acknowledgement that something went wrong and that you are taking it seriously.

Rule 3

Say what you know, say what you don't know, say what you're doing about it. Uncertainty is acceptable. Silence, vagueness, and over-reassurance are not. Stakeholders can tolerate "we are still investigating", they cannot tolerate "everything is fine" followed by a larger revelation.

Communication Sequence

Internal Sequence

As soon as classification is confirmed (Day 1)

Leadership team briefing

CEO, COO, CFO, and relevant department heads. Focus on classification, what is known, immediate actions taken, and the response plan.

Day 1–2

Affected department heads

Brief those whose teams may be affected or who will be involved in response. Give clear instructions on what to do and what not to say.

Day 2–3 (only if required)

All-company communication

Only if the incident becomes public or if staff will encounter customer questions. Must be coordinated with external communications to ensure consistent messaging.

External Sequence

Before press, always

Directly affected customers / users

No customer should learn about an incident affecting them from the media. Personalised communication (using Appendix C template) to affected parties takes priority over all other external communications.

Per notification obligation deadline

Regulators

File notifications per regulatory requirements (see Part 2). Early, incomplete notification is preferred over late, complete notification.

Only if story is likely to become public

Public statement

Do not proactively publish a public statement unless the incident is already public or likely to become so. Unnecessary disclosure creates additional reputational risk without corresponding benefit.

Reactive, always prepared in advance

Media response

Prepare a media holding statement from the moment the incident is classified V5 or V4. Do not wait for a journalist to call. Use the template in Appendix D.

Recovery & Restoration Criteria

One of the most common errors in AI incident response is restoring the system before the failure is truly understood. The pressure to restore service is real, but the cost of a second failure, or of restoring a system that continues to cause harm, is far higher than continued downtime.

⛔ Do Not Restore Until

Do not restore an AI system to production until you can answer YES to every item on the following checklist.

Root cause has been identified and documented in writing
The failure mode cannot recur in the restored configuration, this has been technically verified, not assumed
A rollback plan exists if the restored system fails again
Monitoring and alerting has been specifically enhanced to detect recurrence of this failure mode
Legal Counsel has reviewed restoration for any regulatory implications
The Incident Lead has formally signed off on restoration in writing
Affected parties have been notified before service is silently restored (where applicable)
A post-restoration monitoring period has been defined (minimum 24 hours heightened monitoring recommended)

Stage 6: Learn

The post-incident phase is where most organisations fail. The immediate crisis passes, pressure eases, and the hard work of systemic improvement gets deprioritised. Organisations that treat post-incident reviews as genuine learning exercises, rather than box-ticking, are the ones that do not face the same incident twice.

AI TRACE Post-Incident Review

Conduct within 5–10 business days of incident resolution, while details are fresh. The full review template is in Appendix E.

→ The AI TRACE Methodology

AI TRACE structures every post-incident review around five dimensions: Trust (what was the impact on stakeholder and public trust?), Root Cause (what was the systemic origin of the failure?), Accountability (who was accountable, and were those accountabilities clear before the incident?), Correction (what specific actions will prevent recurrence?), and Evolution (what governance or capability improvement does this incident require at the organisational level?). A review that answers all five produces durable learning, not just incident closure.

Who should attend

Full incident response team (all roles)
AI system owners and operators implicated in the incident
A representative from risk, compliance, or internal audit
A facilitator who was not directly involved in the incident response (for objectivity)

The five root cause questions

Use the Five Whys technique: for each proposed cause, ask "Why did this happen?" until you reach the systemic root, not just the proximate trigger.

Root cause category	Diagnostic questions
Model failure	Did the model perform outside its design envelope? Was there evidence of drift? Was the model trained on representative data?
Data quality	Was the training data representative, current, and unbiased? Were there data pipeline failures? Was there a distribution shift between training and production data?
Deployment error	Was there a misconfiguration, version mismatch, or integration failure at deployment? Were deployment checks adequate?
Process failure	Did humans rely on AI outputs without appropriate verification? Were escalation procedures followed? Were governance controls bypassed?
Adversarial attack	Was the AI system deliberately manipulated? Are there indicators of prompt injection, data poisoning, or model inversion?
Governance gap	Did adequate pre-deployment testing, monitoring, or human oversight exist? Was accountability clearly assigned?

What good post-incident governance looks like

A written action plan with named owners and deadlines, not recommendations
A named executive accountable for each action being completed
A follow-up review (30 days later) to confirm actions are being implemented
A summary finding shared with the board as part of the board incident report
The finding incorporated into future AI risk assessments and AI governance reviews

The Board Incident Report

The board needs to know about significant AI incidents, but they do not need to understand them technically. A board report on an AI incident should inform, not overwhelm.

What a good board AI incident report looks like

Criterion	Good	Common error
Length	1–2 pages maximum	10-page technical briefing with model performance charts
Language	Plain English; no AI jargon	Technical terminology that directors cannot assess
Focus	Impact, response, and prevention	Detailed technical root cause analysis
Accountability	Clear owner for each remediation action	Diffuse accountability, "the team is working on it"
Tone	Factual; appropriate seriousness	Defensive; minimising the incident or over-reassuring

The full editable Board Incident Report template is in Appendix F.

⚠ What Boards Should Not Receive

Technical logs, model performance data, or confusion matrices
Individual blame attribution, the board reviews systemic accountability, not personnel decisions
Speculation about causes not yet confirmed
An account that minimises the incident relative to what has been or will be reported to regulators

★

Case Studies

The following three case studies apply the framework in this playbook to real-world AI security incidents. Case 1 and Case 2 draw on publicly reported incidents analysed by the author; Case 3 is a composite drawn from common deployment patterns. In each case, the focus is on what the response should have looked like, not just what went wrong.

Case Study 01

The $25M Deepfake Fraud (2024)

Incident type: V5 — Systemic Trust Event · AI-enabled financial fraud · Government-sector organisation · 2024

What happened

A threat actor used AI-generated synthetic media, a real-time deepfake video call, to impersonate senior officials in a live video conference. Finance staff, believing they were receiving authorised instructions from legitimate leadership, transferred $25 million to fraudulent accounts. The deception was not discovered until after the transfers had been executed.

Root cause analysis

The proximate cause was the deepfake technology. The systemic root cause was a governance gap: there was no out-of-band verification protocol for large financial transfers. The AI-mediated communication channel (video conferencing) was being used as an authentication mechanism, which it was never designed to be. No secondary confirmation was required; no call-back protocol existed; no threshold-based approval workflow applied.

What an effective response looks like

Hour 0–4: Discovery

Containment and evidence preservation

Upon discovery that the transfer was fraudulent: immediate escalation to Incident Lead. Financial systems flagged to prevent further transactions pending investigation. Evidence preserved: all call recordings, transfer authorisation records, communications. Law enforcement contacted immediately, the window for financial recovery is narrow.

Hour 4–24: Investigation

Root cause and blast radius

Internal investigation: was this a targeted attack, or part of a wider campaign? Are any other transfers at risk? Which authentication controls failed? What was the chain of authorisation? Legal Counsel engaged: does this trigger regulatory notification?

Day 2–3: Communication

Regulated disclosure and internal briefing

Notify relevant financial regulators. Brief the board with a concise incident report. Prepare a communication plan for any third parties whose data or interests are affected.

Post-incident

Governance overhaul

Implement out-of-band verification for all financial transfers above a defined threshold. Deploy deepfake detection tooling for high-stakes video communications. Conduct tabletop exercises simulating synthetic media attacks across all departments with financial authority.

The Governance Lesson

Deepfake attacks exploit the assumption that video is a reliable authentication mechanism. Any financial or operational process that relies on video communication as its primary verification method is exposed. This is a governance gap, not a technology limitation, the technology to detect deepfakes exists. The policy requiring its use did not.

Case Study 02

McKinsey Lilli AI Tool Incident (March 2026)

Incident type: V4–V5 — Critical / Systemic Trust Event · Enterprise AI reliability failure · Professional services · 2026

What happened

McKinsey's internal AI assistant "Lilli" produced inaccurate client-facing outputs, raising questions about governance oversight and validation processes. The incident illustrates how a high-profile enterprise AI tool — built by one of the world's leading advisory firms — can produce governance failures that generate significant reputational exposure. The V-AIM response framework applied to this case demonstrates how AI reliability failures require the same command structure as security incidents.

Root cause analysis

This is a Reliability failure under the R3AI Standard: the system did not consistently do what it was designed to do when deployed in high-stakes client-facing contexts. The underlying governance gap was insufficient validation of AI outputs before client use, combined with a deployment environment that gave users insufficient signal about when outputs required independent verification. The reputational exposure was amplified by the prominence of the organisation and the trust assumptions clients place on advisory outputs.

What an effective V-AIM response looks like

Hour 0–4: Detection & Classification

Incident Lead activated; V4 classification

Upon identifying that AI outputs have reached clients and may be inaccurate: Incident Lead activated. Classification at V4 (Critical) with potential V5 escalation if client harm is confirmed at scale. Technical Containment Lead begins immediate audit of affected outputs.

Hour 4–24: Containment

Output suspension; client impact mapping

Suspension of the specific AI workflow pending investigation. Legal & Compliance Lead assesses professional liability exposure. Business Owner maps all client engagements where affected outputs may have been used. Communications Lead prepares holding statement for internal use.

Day 2–5: Governance Review

AI TRACE review; client notification

AI TRACE review initiated. Clients notified directly before any public disclosure. Accountability mapped: who approved the output for client use, and what validation steps were in place? Executive Sponsor signs off on client communication approach.

Post-incident: Evolution

Governance improvement and monitoring

Output validation protocols strengthened. Human-in-the-loop review requirements clarified for client-facing AI use. Monitoring enhanced. The incident is incorporated into the firm's AI governance framework as a standing reference case.

The Governance Lesson

Even the most sophisticated organisations can produce AI governance failures. The question is not whether your AI tool is built by a credible vendor. It is whether your deployment environment has the validation controls, output review processes, and accountability structures to catch failures before they reach clients. Lilli was a Reliability failure with Responsibility consequences — the R3AI lens reveals why both dimensions need to be governed.

Case Study 03

Model Bias Discovered in Production, Lending Decision AI

Incident type: V3 — Significant · AI bias in regulated financial service · Financial services · Composite case

What happened

A retail financial institution deploys an AI-based credit approval system. Six months post-deployment, an internal analyst notices that approval rates differ significantly across demographic groups, in a way that cannot be explained by creditworthiness factors alone. The finding is escalated internally. The legal and regulatory implications are immediately significant: this may constitute unlawful discrimination under consumer credit regulation.

What made this particularly difficult

The system was technically functioning correctly. It was producing outputs consistent with its training data. The training data reflected historical lending patterns, and historical lending patterns reflected decades of structural bias in credit markets. The AI had learned and replicated historical bias. This is a governance failure that cannot be resolved by a software fix: it requires a fundamental reconsideration of what the training data represents.

The response

Day 1, Classification

V3 declaration, no system shutdown initially

There is no active harm requiring immediate shutdown (approvals are not causing safety risk). But the situation is V3 — Significant: significant business and regulatory impact, with potential for legal liability and regulatory action. Legal & Compliance Lead is engaged immediately. Does this constitute unlawful discrimination? Does it require FCA notification?

Day 2–3, Technical review and escalation

Confirm findings; decision on suspension

Technical team confirms the statistical disparity. The decision is made to suspend the AI model and revert to manual review for new applications. This is the right call: the cost of continued discriminatory approvals exceeds the cost of manual review overhead.

Day 4–14, Audit and regulatory engagement

Retrospective audit; proactive regulatory disclosure

All approved and declined decisions from the AI system are reviewed for affected patterns. Proactive engagement with the FCA, not waiting to be discovered. Proactive disclosure is treated as a significant mitigating factor in regulatory proceedings.

Post-incident, Governance overhaul

Retraining, fairness constraints, ongoing monitoring

The model is retrained with fairness constraints. Bias metrics are added to the model monitoring dashboard alongside accuracy metrics. A quarterly fairness audit is established as a permanent governance control.

The Governance Lesson

AI system bias does not announce itself at deployment. It emerges gradually, often months later, often only through proactive analysis, not through operational failure. Post-deployment monitoring must include fairness metrics, not just accuracy metrics. This is a board-level governance decision, not a technical one.

Part 5: Leadership Metrics Guide

Governance maturity is not measured by the absence of incidents. It is measured by how well an organisation detects, commands, and learns from them. This section provides the leadership metrics that demonstrate AI incident readiness and response quality at executive and board level.

Key Performance Indicators for AI Incident Management

Metric	What It Measures	Target	Frequency
Mean Time to Detect (MTTD)	Average time between incident start and organisational awareness	<4 hours for V4/V5; <24 hours for V2/V3	Per incident; quarterly average
Mean Time to Contain (MTTC)	Average time from detection to confirmed containment	<2 hours for V5; <8 hours for V4	Per incident; quarterly average
Regulatory Notification Compliance	% of notifiable incidents notified within the required window	100%	Per incident
AI TRACE Completion Rate	% of V2+ incidents receiving a completed AI TRACE review	100% for V3+; >80% for V2	Per incident; quarterly audit
12 Non-Negotiables Compliance	% of 12 non-negotiable readiness prerequisites satisfied	100%	Quarterly review
Incident Recurrence Rate	% of incidents where the same root cause category appears more than once	0% for V4/V5 root causes	Quarterly trend
Response Team Readiness	% of V-AIM command roles with named primary and backup contacts in place	100%	Quarterly check; after any team change
Simulation Frequency	Number of tabletop incident simulations conducted per year	Minimum 2 per year	Annually reported

Board-Level Reporting Standards

Boards should receive a standardised AI incident summary at each board meeting, not just when a significant incident occurs. The absence of incidents is itself a governance signal that should be reported alongside incident data.

What the Board Should Always See

Incident count by V-SEV level in the reporting period
Any open V4 or V5 incidents, with current status and ETA to resolution
MTTD and MTTC trend vs. prior period
12 Non-Negotiables compliance status
Regulatory notifications made, with outcomes
AI TRACE completion status for prior period incidents

What the Board Should Never Receive

Raw technical logs or model performance metrics without interpretation
Jargon that obscures accountability (e.g. "the model experienced an anomaly")
Incident reports that lead with technical cause rather than business and customer impact
Action plans without named owners and deadlines

→ Board Governance Principle

The board's role in AI incident oversight is not to understand the technical failure. It is to ensure that governance conditions were in place before the incident, that the response met the organisation's stated standards, and that learning has produced durable improvement — not just resolution.

A–F

Appendices: Editable Templates

The following six templates are designed to be adapted for your organisation. Adapt the language to your organisation's tone and governance structure. Pre-approve templates B, C, and D with Legal Counsel before any incident occurs.

Appendix A

Incident Declaration Template

Date & Time of Declaration

Declared By

Incident Commander Assigned

Severity Classification

SEV-1 CriticalSEV-2 HighSEV-3 MediumSEV-4 Low

AI System(s) Affected

Nature of the Incident (plain language description)

Discovery Method

Automated monitoring alertUser / employee reportCustomer complaintExternal reportMedia / publicRoutine review

Is Active Harm Currently Occurring?

Yes, immediate shutdown requiredNo, retrospectiveUnknown, shutdown pending assessment

Immediate Containment Action Taken

System shut downSystem suspended pending investigationSystem continues with enhanced monitoringOther (specify below)

Estimated Blast Radius (how many decisions / transactions / people may be affected?)

Regulatory Notification Assessment

EU AI Act notification may be required, deadline: ___________

GDPR / UK GDPR notification may be required, 72-hour clock started

Sector-specific notification required, regulator & deadline: ___________

No immediate notification obligation identified (document rationale)

Signed by Incident Commander

Appendix B

Internal Stakeholder Communication Template

[Leadership team / Department heads / All staff, select appropriate audience]

From

[Incident Commander name and title]

Subject

AI System Incident, [System Name], [Date]

We are writing to inform you of an incident affecting [AI system name], which [brief, plain-language description of what the system does, one sentence].

What happened

[2–3 sentences describing what the AI system did or failed to do. Plain English, no technical jargon.]

Impact

[Who is affected and how, be specific. If customer impact: how many, in what way?]

What we have done

[Immediate containment actions taken. Be factual.]

What we are doing now

[Ongoing investigation and recovery actions. Who owns them.]

What you need to do

[Specific actions required from recipients, if any. If none: "No action is required from you at this time."]

Our next update to this group will be at [time/date]. Questions should be directed to [named contact] only, please do not discuss this incident externally or with customers until further notice.

[Name], Incident Commander

Appendix C

Customer / External Communication Template

Subject line

An important update regarding [your service / product name]

Dear [Customer name / Valued customer],

We are writing to let you know about an issue that affected [plain-language description of the AI-powered service, avoid the word "AI" unless legally required or already public].

What happened

[On (date), we identified that (plain language, what the system did, what the effect on the customer was). Avoid technical language entirely.]

What we have done

[Actions taken to stop the incident and protect customers. Be specific, "we suspended the service" is better than "we took immediate action".]

What this means for you

[Specific, practical impact on this customer. If there is no impact: say so explicitly. If there is a remedy: describe it clearly.]

What we are doing to prevent this from happening again

[Brief, credible description, not a list of vague commitments. Specific improvements where possible.]

If you have any questions or concerns, please contact [dedicated contact / channel, not a generic support address for a serious incident].

We take [privacy / safety / accuracy] seriously and we are sorry for any distress or inconvenience this has caused.

[Name], [Title]
[Date]

Appendix D

Media Statement Template

STATEMENT FROM [ORGANISATION NAME]

[Date]

[Organisation name] is aware of [brief, factual description of the incident, one sentence. Do not speculate. Do not use language that implies certainty about cause unless it has been confirmed].

[We have taken the following immediate action: (specific action taken, system suspension, investigation launched, regulators notified).]

[Affected parties, if applicable: "We have notified / are notifying affected customers directly."]

We are committed to [the relevant value, the safety of our customers / the integrity of our systems / transparent operation] and will provide further updates as our investigation develops.

Media enquiries

[Named press contact name, email, phone, do not use general inbox]

Note: Prepare this statement from the moment of SEV-1 or SEV-2 declaration. Do not wait for a journalist to call. The holding statement is used when a journalist contacts you before you have a full statement ready, "We are aware of the situation and are investigating. We will have a full statement within [timeframe]." Never say "no comment".

Appendix E

AI TRACE Post-Incident Review Template

AI System

Incident Date

Review Date

Facilitator

Attendees

Section 1: Incident Timeline

List all events chronologically from first detection to resolution, with timestamps.

Section 2: Root Cause Analysis

Primary root cause category

Model failureData qualityDeployment errorProcess failureAdversarial attackGovernance gap

Five Whys Analysis

Why 1 (proximate cause)

Why 2

Why 3

Why 4

Why 5 (systemic root cause)

Section 3: Response Assessment

What worked well in our response?

What did not work well?

What would we do differently?

Section 4: Detection Assessment

How was this incident discovered?

Could it have been detected earlier? If so, how?

What monitoring or alerting improvements are required?

Section 5: Governance Gaps Identified

Gap Identified	Owner	Remediation Action	Target Date

Section 6: Action Plan

Action	Owner	Priority	Target Date	Status

Approved by Incident Commander

Date

Appendix F

Board Incident Report Template

Prepared By

Date

Incident Reference

Status

OngoingContained, investigation continuingResolved

1. Incident Summary

3–4 sentences. What happened, when, what the impact was. Plain English, no technical language.

2. Immediate Response

What was done in the first 24 hours. Focus on decisions, not technical detail.

3. Root Cause

One or two plain-English sentences. The board does not need the technical root cause, they need to understand whether this was a technology failure, a process failure, a governance gap, or an external attack.

4. Impact Summary

Category	Detail
Customers / users affected
Financial impact (direct)
Financial impact (remediation estimate)
Regulatory notifications filed
Regulatory investigation status
Reputational / media exposure

5. Preventative Actions

What is being done to ensure this does not recur. Specific actions with named owners, not general commitments.

6. Open Items Requiring Board Attention

Item	Owner	Due Date	Board Action Required?

The board is asked to note this report and [any specific board action required, e.g., approve remediation budget / note regulatory status / confirm escalation threshold has been met].

Coming Soon · AI Incident Command Course

The playbook is step one.
Now learn to lead the response.

Most leaders read the plan. The best ones have rehearsed it. The AI Incident Command course takes you through live scenarios, command decisions, and board-level communication — so when the crisis lands, you're already ahead of it.

Right now, AI systems are making decisions no one is reviewing. Incidents are being mis-categorised. Regulators are asking questions no one is prepared for. The gap between knowing and doing is where organisations fail.

Join the Waitlist →

Free to join · No commitment · First access when doors open

Contents

Introduction: Why AI Incidents Are Different

Stage 1: Prepare

1.1 V-SEV Classification Matrix

1.2 V-AIM Command Roles

1.3 The 12 Non-Negotiables

Stages 2 & 3: Detect & Contain

Hour 0–4: Detection & Initial Assessment

Containment Decision Framework

Hour 4–8: Escalation & Regulatory Assessment

Internal escalation

Regulatory notification assessment

Hour 8–24: Containment Actions

The four investigation questions

Document everything

Stages 4 & 5: Govern & Recover

Communication Rules

Communication Sequence

Internal Sequence

External Sequence

Recovery & Restoration Criteria

Stage 6: Learn

AI TRACE Post-Incident Review

Who should attend

The five root cause questions

What good post-incident governance looks like

The Board Incident Report

What a good board AI incident report looks like

Case Studies

The $25M Deepfake Fraud (2024)

What happened

Root cause analysis

What an effective response looks like

McKinsey Lilli AI Tool Incident (March 2026)

What happened

Root cause analysis

What an effective V-AIM response looks like

Model Bias Discovered in Production, Lending Decision AI

What happened

What made this particularly difficult

The response

Part 5: Leadership Metrics Guide

Key Performance Indicators for AI Incident Management

Board-Level Reporting Standards

What the Board Should Always See

What the Board Should Never Receive

Appendices: Editable Templates

Incident Declaration Template

Internal Stakeholder Communication Template

Customer / External Communication Template

Media Statement Template

AI TRACE Post-Incident Review Template

Section 1: Incident Timeline

Section 2: Root Cause Analysis

Section 3: Response Assessment

Section 4: Detection Assessment

Section 5: Governance Gaps Identified

Section 6: Action Plan

Board Incident Report Template

1. Incident Summary

2. Immediate Response

3. Root Cause

4. Impact Summary

5. Preventative Actions

6. Open Items Requiring Board Attention

The playbook is step one.Now learn to lead the response.

The playbook is step one.
Now learn to lead the response.