AI systems occupy a different kind of risk position in your organisation. They don't just fail to work — they work incorrectly, often at scale, often undetected, and often in ways that can't simply be rolled back. Three characteristics make AI incidents qualitatively different from all other technology failures:
| Characteristic | Traditional IT Failure | AI System Failure |
|---|---|---|
| Detection | Usually immediate — system goes down, alert fires | Often delayed — system keeps running, producing wrong outputs nobody notices |
| Blast radius | Bounded — the system stops, impact halts | Accumulating — an AI making 10,000 decisions/day creates 10,000 harms before detection |
| Accountability | Clear — the system failed | Ambiguous — the model, the data, the deployment, the humans, the vendor? |
| Reversibility | High — restore from backup, restart the service | Low — retrospective decisions can't be un-made; affected parties can't be un-harmed |
| Regulation | General IT standards and GDPR | EU AI Act Article 73, sector-specific AI regulations, GDPR, FCA, ICO |
This playbook gives leadership teams a clear, phase-based response framework. It covers what to do in the first 24, 48, and 72 hours — and the critical post-incident phases of root cause review and board reporting. Six editable templates in the appendices can be adapted for your organisation immediately.
A note on use: this playbook is written for executive leadership teams — the people who need to govern an AI incident, not just technically resolve it. It is deliberately jargon-light and action-oriented. Technical teams will need their own supplementary runbooks; this document governs the leadership layer that sits above them.
The single most important thing you can do for AI incident response is prepare before anything goes wrong. Organisations that define their severity classifications, response team roles, and communication templates in advance will respond in hours. Those that start from scratch during an incident will respond in days — by which time the damage has compounded.
Not every AI problem is a crisis. Use this matrix to classify incidents consistently — so everyone on the response team is working from the same understanding of urgency and expected response times.
| Severity | Definition | Examples | Response SLA | Escalation |
|---|---|---|---|---|
| SEV-1 CRITICAL | Active harm occurring, regulatory breach, or material public exposure | Deepfake fraud in progress; AI producing dangerous medical or financial advice; PII breach via AI system; AI weaponised against customers | Immediate response. Incident Commander on-call within 1 hour. | CAIO/CTO → CEO → Legal → Board Liaison |
| SEV-2 HIGH | Significant business or reputational impact; regulatory exposure; material customer harm | Biased model discovered affecting hiring or lending decisions; financial miscalculations in AI-assisted reporting; identity verification bypass at scale | Response team assembled within 4 hours | CAIO/CTO → CEO → Legal |
| SEV-3 MEDIUM | Operational disruption with limited external impact; quality degradation affecting business processes | AI system goes offline; automation pipeline failures causing delays; model drift reducing output quality below threshold | Response within 24 hours | Technical Lead → CAIO/CTO |
| SEV-4 LOW | Minor deviation; no external or customer impact; detected through routine monitoring | Gradual accuracy decline; unusual output patterns flagged in testing; data quality issues in non-production environment | Response within 72 hours | Technical Lead |
Define and assign these roles before any incident occurs. Each role should have a named primary and a backup. These are decision-making roles, not just communication roles.
| Role | Responsibilities | Typically Held By |
|---|---|---|
| Incident Commander | Overall accountable lead. Makes the final call on containment, communication, and restoration. Single point of escalation for the response team. | CAIO, CTO, or senior executive designated in advance |
| Technical Lead | Owns the technical investigation. Responsible for root cause analysis, containment actions, model rollback, and evidence preservation. | Head of AI/ML, Principal Engineer, or Lead Data Scientist |
| Legal Counsel | Assesses regulatory notification obligations (EU AI Act, GDPR, sector-specific). Reviews external communications. Manages liability considerations. | General Counsel, external AI law firm on retainer |
| Communications Lead | Owns all internal and external messaging. Controls the communication schedule. Prepares templates and approves final language before distribution. | Head of Communications, CMO, or external PR lead |
| Business Impact Lead | Quantifies the customer and operational effect of the incident. Manages customer service escalations. Tracks affected accounts or transactions. | COO, Customer Success Director, or Operations Head |
| Board Liaison | Single point of contact between the incident response team and the board. Responsible for the board incident report and managing board-level questions. | CAIO, CFO, or Company Secretary |
Before your organisation is ever in an AI incident, the following should already be in place. This is not an exhaustive AI governance checklist — it is the minimum viable foundation for effective incident response.
The first 24 hours of an AI incident are the most consequential. Decisions made in this window determine whether the situation is contained or whether it compounds. Speed matters — but not more than the quality of decisions. Acting too fast without understanding the blast radius can make things worse.
Not every anomaly is an incident. Before triggering the full response, confirm:
Use the Incident Declaration Template (Appendix A). The act of formal declaration:
The first and most important question: Should we switch this AI system off?
If active harm is currently occurring → Shut down the AI system immediately. The cost of a false positive (unnecessary downtime) is always less than the cost of continued harm.
If harm is retrospective (already happened) → Preserve all evidence before any remediation. Do not restart the system until root cause is understood.
If the situation is ambiguous → Shut down and investigate. Resuming a broken system creates liability; prudent precaution does not.
If shutting down would cause greater harm → For example, a medical or safety-critical AI where shutdown creates patient risk. In this case, escalate immediately and engage Legal Counsel before any action.
This step is non-negotiable and must precede any technical fix. Legal and regulatory proceedings will depend on evidence integrity.
Work with Legal Counsel to determine notification obligations. Do not wait for this assessment to conclude before taking containment action, but begin it in parallel.
| Regulation | Trigger | Deadline | Action Required |
|---|---|---|---|
| EU AI Act (Art. 73) | Serious incident involving a high-risk AI system, including near-misses | Immediately upon becoming aware; follow-up report within timeline set by authority | Notify relevant market surveillance authority; preserve incident records |
| GDPR / UK GDPR | Personal data involved in breach caused or exacerbated by AI system | 72 hours from awareness to regulator; without undue delay to data subjects if high risk | Notify ICO (UK) / supervisory authority (EU); notify affected individuals if required |
| FCA (Financial Services) | Material operational incident affecting regulated services; AI-related consumer harm | As soon as practicable | Notify FCA; document incident and response; consider consumer redress obligations |
| Sector-specific | Healthcare, critical infrastructure, defence, etc. | Varies by sector and jurisdiction | Review sector-specific AI guidance; engage sector regulator |
By this point, the immediate crisis is stabilised. The focus shifts from containment to understanding. Root cause investigation must be structured — not anecdotal.
Every AI incident investigation must answer these four questions before remediation can be authorised:
| # | Question | Why it matters |
|---|---|---|
| 1 | What failed? (The model, the data, the deployment, the process, or the humans?) | Determines the remediation approach — a model fix is different from a process fix |
| 2 | When did it start? (Point of failure vs. point of discovery) | Defines the blast radius — how many decisions were affected before detection |
| 3 | What was the blast radius? (How many decisions, transactions, or people were affected?) | Drives notification obligations, customer redress, and reputational risk assessment |
| 4 | Why did our controls not catch this? (What monitoring, testing, or governance mechanism failed?) | The answer drives the post-incident governance improvement — without this, you will face the same incident again |
Create and maintain an incident timeline throughout — every action, every decision, every communication, timestamped. This serves three purposes: legal defensibility, post-incident learning, and board/regulator reporting. Assign a named scribe to maintain the timeline in real time.
How an organisation communicates during an AI incident often matters as much as what it actually does technically. Three rules govern effective crisis communication:
CEO, COO, CFO, and relevant department heads. Focus on classification, what is known, immediate actions taken, and the response plan.
Brief those whose teams may be affected or who will be involved in response. Give clear instructions on what to do and what not to say.
Only if the incident becomes public or if staff will encounter customer questions. Must be coordinated with external communications to ensure consistent messaging.
No customer should learn about an incident affecting them from the media. Personalised communication (using Appendix C template) to affected parties takes priority over all other external communications.
File notifications per regulatory requirements (see Part 2). Early, incomplete notification is preferred over late, complete notification.
Do not proactively publish a public statement unless the incident is already public or likely to become so. Unnecessary disclosure creates additional reputational risk without corresponding benefit.
Prepare a media holding statement from the moment the incident is classified SEV-1 or SEV-2. Do not wait for a journalist to call. Use the template in Appendix D.
One of the most common errors in AI incident response is restoring the system before the failure is truly understood. The pressure to restore service is real — but the cost of a second failure, or of restoring a system that continues to cause harm, is far higher than continued downtime.
The post-incident phase is where most organisations fail. The immediate crisis passes, pressure eases, and the hard work of systemic improvement gets deprioritised. Organisations that treat post-incident reviews as genuine learning exercises — rather than box-ticking — are the ones that do not face the same incident twice.
Conduct within 5–10 business days of incident resolution, while details are fresh. The full review template is in Appendix E.
Use the Five Whys technique: for each proposed cause, ask "Why did this happen?" until you reach the systemic root, not just the proximate trigger.
| Root cause category | Diagnostic questions |
|---|---|
| Model failure | Did the model perform outside its design envelope? Was there evidence of drift? Was the model trained on representative data? |
| Data quality | Was the training data representative, current, and unbiased? Were there data pipeline failures? Was there a distribution shift between training and production data? |
| Deployment error | Was there a misconfiguration, version mismatch, or integration failure at deployment? Were deployment checks adequate? |
| Process failure | Did humans rely on AI outputs without appropriate verification? Were escalation procedures followed? Were governance controls bypassed? |
| Adversarial attack | Was the AI system deliberately manipulated? Are there indicators of prompt injection, data poisoning, or model inversion? |
| Governance gap | Did adequate pre-deployment testing, monitoring, or human oversight exist? Was accountability clearly assigned? |
The board needs to know about significant AI incidents — but they do not need to understand them technically. A board report on an AI incident should inform, not overwhelm.
| Criterion | Good | Common error |
|---|---|---|
| Length | 1–2 pages maximum | 10-page technical briefing with model performance charts |
| Language | Plain English; no AI jargon | Technical terminology that directors cannot assess |
| Focus | Impact, response, and prevention | Detailed technical root cause analysis |
| Accountability | Clear owner for each remediation action | Diffuse accountability — "the team is working on it" |
| Tone | Factual; appropriate seriousness | Defensive; minimising the incident or over-reassuring |
The full editable Board Incident Report template is in Appendix F.
The following three case studies apply the framework in this playbook to real-world AI security incidents. Case 1 and Case 2 draw on publicly reported incidents analysed by the author; Case 3 is a composite drawn from common deployment patterns. In each case, the focus is on what the response should have looked like — not just what went wrong.
A threat actor used AI-generated synthetic media — a real-time deepfake video call — to impersonate senior officials in a live video conference. Finance staff, believing they were receiving authorised instructions from legitimate leadership, transferred $77 million to fraudulent accounts. The deception was not discovered until after the transfers had been executed.
The proximate cause was the deepfake technology. The systemic root cause was a governance gap: there was no out-of-band verification protocol for large financial transfers. The AI-mediated communication channel (video conferencing) was being used as an authentication mechanism — which it was never designed to be. No secondary confirmation was required; no call-back protocol existed; no threshold-based approval workflow applied.
Upon discovery that the transfer was fraudulent: immediate escalation to Incident Commander. Financial systems flagged to prevent further transactions pending investigation. Evidence preserved: all call recordings, transfer authorisation records, communications. Law enforcement contacted immediately — the window for financial recovery is narrow.
Internal investigation: was this a targeted attack, or part of a wider campaign? Are any other transfers at risk? Which authentication controls failed? What was the chain of authorisation? Legal Counsel engaged: does this trigger regulatory notification?
Notify relevant financial regulators. Brief the board with a concise incident report. Prepare a communication plan for any third parties whose data or interests are affected.
Implement out-of-band verification for all financial transfers above a defined threshold. Deploy deepfake detection tooling for high-stakes video communications. Conduct tabletop exercises simulating synthetic media attacks across all departments with financial authority.
A single individual exploited weaknesses in an AI-based identity verification system to file 180 fraudulent unemployment claims, obtaining over $3.4 million in payments. The AI verification system passed each claim individually — because each, in isolation, met the verification criteria. The fraud was detectable as a pattern across claims; it was invisible claim by claim.
The AI system was designed to verify identity. It did this competently. But it was not designed to detect fraud at scale — to identify patterns across multiple claims that shared device fingerprints, network characteristics, or behavioural patterns. The system had no cross-claim pattern detection. It evaluated each application in isolation, in a context where the meaningful risk was a coordinated pattern of fraud.
Pattern anomaly monitoring should have flagged: multiple claims with shared device fingerprint or network origin; velocity anomalies (an unusually high number of successful claims from correlated sources); claim characteristics clustering around identical profiles. Absence of cross-claim monitoring was the detection failure that allowed 180 fraudulent claims to succeed.
A retail financial institution deploys an AI-based credit approval system. Six months post-deployment, an internal analyst notices that approval rates differ significantly across demographic groups — in a way that cannot be explained by creditworthiness factors alone. The finding is escalated internally. The legal and regulatory implications are immediately significant: this may constitute unlawful discrimination under consumer credit regulation.
The system was technically functioning correctly. It was producing outputs consistent with its training data. The training data reflected historical lending patterns — and historical lending patterns reflected decades of structural bias in credit markets. The AI had learned and replicated historical bias. This is a governance failure that cannot be resolved by a software fix: it requires a fundamental reconsideration of what the training data represents.
There is no active harm requiring immediate shutdown (approvals are not causing safety risk). But the situation is SEV-2: significant business and regulatory impact, with potential for legal liability and regulatory action. Legal Counsel is engaged immediately. Does this constitute unlawful discrimination? Does it require FCA notification?
Technical team confirms the statistical disparity. The decision is made to suspend the AI model and revert to manual review for new applications. This is the right call: the cost of continued discriminatory approvals exceeds the cost of manual review overhead.
All approved and declined decisions from the AI system are reviewed for affected patterns. Proactive engagement with the FCA — not waiting to be discovered. Proactive disclosure is treated as a significant mitigating factor in regulatory proceedings.
The model is retrained with fairness constraints. Bias metrics are added to the model monitoring dashboard alongside accuracy metrics. A quarterly fairness audit is established as a permanent governance control.
The following six templates are designed to be adapted for your organisation. Adapt the language to your organisation's tone and governance structure. Pre-approve templates B, C, and D with Legal Counsel before any incident occurs.
We are writing to inform you of an incident affecting [AI system name], which [brief, plain-language description of what the system does — one sentence].
Our next update to this group will be at [time/date]. Questions should be directed to [named contact] only — please do not discuss this incident externally or with customers until further notice.
[Name] | Incident Commander
Dear [Customer name / Valued customer],
We are writing to let you know about an issue that affected [plain-language description of the AI-powered service — avoid the word "AI" unless legally required or already public].
If you have any questions or concerns, please contact [dedicated contact / channel — not a generic support address for a serious incident].
We take [privacy / safety / accuracy] seriously and we are sorry for any distress or inconvenience this has caused.
[Name] | [Title]
[Date]
STATEMENT FROM [ORGANISATION NAME]
[Date]
[Organisation name] is aware of [brief, factual description of the incident — one sentence. Do not speculate. Do not use language that implies certainty about cause unless it has been confirmed].
[We have taken the following immediate action: (specific action taken — system suspension, investigation launched, regulators notified).]
[Affected parties — if applicable: "We have notified / are notifying affected customers directly."]
We are committed to [the relevant value — the safety of our customers / the integrity of our systems / transparent operation] and will provide further updates as our investigation develops.
Note: Prepare this statement from the moment of SEV-1 or SEV-2 declaration. Do not wait for a journalist to call. The holding statement is used when a journalist contacts you before you have a full statement ready — "We are aware of the situation and are investigating. We will have a full statement within [timeframe]." Never say "no comment".
List all events chronologically from first detection to resolution, with timestamps.
| Gap Identified | Owner | Remediation Action | Target Date |
|---|---|---|---|
| Action | Owner | Priority | Target Date | Status |
|---|---|---|---|---|
3–4 sentences. What happened, when, what the impact was. Plain English — no technical language.
What was done in the first 24 hours. Focus on decisions, not technical detail.
One or two plain-English sentences. The board does not need the technical root cause — they need to understand whether this was a technology failure, a process failure, a governance gap, or an external attack.
| Category | Detail |
|---|---|
| Customers / users affected | |
| Financial impact (direct) | |
| Financial impact (remediation estimate) | |
| Regulatory notifications filed | |
| Regulatory investigation status | |
| Reputational / media exposure |
What is being done to ensure this does not recur. Specific actions with named owners, not general commitments.
| Item | Owner | Due Date | Board Action Required? |
|---|---|---|---|
The board is asked to note this report and [any specific board action required — e.g., approve remediation budget / note regulatory status / confirm escalation threshold has been met].