Incident Response Basics
← Part 9
Multi-Tenancy and Isolation
A security incident is not just a bad day — it is the moment that reveals whether the rest of your security investments were real or performative. Good detection surfaces the incident in minutes, not months. Good containment limits blast radius. Good communication preserves user trust. A good postmortem ensures you do not repeat the same failure.
Most engineering teams have an on-call runbook for reliability incidents. Far fewer have a runbook for security incidents — and the two are different in important ways. A reliability incident is usually a systems failure. A security incident may involve an active adversary who is watching your response and adapting.
This post covers the core phases of incident response as they apply to application engineering teams.
The Phases
Most teams collapse Eradicate and Recover into a single phase in practice. The critical distinction is between Contain and Eradicate: containment stops ongoing harm without necessarily removing the root cause.
Detection: What Should Alert
The hardest part of detection is signal-to-noise ratio. Alert on too much and alerts become noise; alert on too little and incidents go undetected.
High-value detection signals for application security:
| Signal | Implementation | Why |
|---|---|---|
| Authentication failures spike | Rate-based alert on auth failure logs | Credential stuffing, brute force |
| Successful login from new country | Geo-diff on user login history | Account takeover |
| Privilege escalation | Alert on role assignment events | Insider threat, compromised account |
| Bulk data export | Alert when single user exports >N records | Data exfiltration |
| API key usage from new IP | Alert on first-use from unknown IP | Leaked API key |
| Admin actions outside business hours | Alert on elevated actions 2am–6am | Compromised admin account |
# Example: structlog-based detection hook for bulk export
import structlog
log = structlog.get_logger()
EXPORT_THRESHOLD = 1000 # records per request
def export_records(user_id: str, tenant_id: str, filters: dict, db) -> list:
records = query_records(filters, db)
if len(records) > EXPORT_THRESHOLD:
log.warning(
"security.bulk_export",
user_id=user_id,
tenant_id=tenant_id,
record_count=len(records),
filters=filters,
# This event is routed to the security alert channel
# via log aggregator rules
)
return recordsYour log aggregator (Datadog, Splunk, Grafana) routes security.* events to a dedicated alert channel with lower latency SLAs than operational alerts.
Contain First, Investigate Second
When an active incident is confirmed, containment takes priority over understanding the full scope. You cannot investigate effectively while an attacker still has access.
Containment actions, roughly in order:
1. Rotate compromised credentials (API keys, session tokens, DB passwords)
2. Revoke user sessions (all sessions for affected accounts)
3. Block attacker's IP/ASN at the WAF or load balancer
4. Disable affected features if they are the attack vector
5. Preserve evidence before making changes (copy logs to cold storage)
6. Notify affected users (after containment, not before)# Emergency: revoke all sessions for a user across all devices
def emergency_revoke_all_sessions(user_id: str, reason: str):
# Redis: delete all session keys for this user
session_keys = redis.keys(f"session:user:{user_id}:*")
if session_keys:
redis.delete(*session_keys)
# Refresh tokens: mark entire user's token family as revoked
db.query(RefreshToken).filter(
RefreshToken.user_id == user_id,
RefreshToken.revoked == False
).update({"revoked": True, "revoke_reason": reason})
db.commit()
# Audit log
write_audit_log(
actor_id="system",
action="emergency_session_revocation",
resource_type="user",
resource_id=user_id,
outcome="success",
metadata={"reason": reason},
)Evidence Preservation
Before you patch, rotate, or restore, preserve the state that the attacker left. Forensic evidence is used to understand scope, satisfy legal or regulatory obligations, and support law enforcement if needed.
# Snapshot application logs to cold storage before rotation
aws s3 sync s3://prod-logs/application/ \
s3://incident-evidence-2026-04-05/application-snapshot/ \
--storage-class GLACIER
# Snapshot database query logs
aws rds download-db-log-file-portion \
--db-instance-identifier prod-db \
--log-file-name general/mysql-general.log \
--output text > /evidence/db-general-2026-04-05.log
# Create a forensic snapshot of the affected EC2 instance
aws ec2 create-snapshot \
--volume-id vol-0abc123 \
--description "Forensic snapshot - incident 2026-04-05"Tag all evidence artifacts with the incident ID. Write-protect them immediately. Do not modify the original evidence; work from copies.
Communication: The Stakeholder Matrix
Security incidents require communication to multiple audiences with different needs, different levels of technical context, and different timing requirements.
| Audience | What they need | When |
|---|---|---|
| Internal leadership | Business impact, scope, ETA to resolution | Within 1 hour of declaration |
| Legal / compliance | Regulatory obligations (GDPR 72h breach notification) | Immediately if PII involved |
| Affected users | What happened, what data was affected, what they should do | After containment, within 24–72h |
| All users | If service was unavailable, brief acknowledgment | As soon as service restored |
| Regulators | Formal breach notification (jurisdiction-dependent) | Per regulation (GDPR: 72h to DPA) |
Internal communication during an active incident should happen in a dedicated, documented channel (a private Slack channel named #inc-YYYY-MM-DD-shortname). This creates an automatic timeline artifact.
<!-- Incident channel template: first message sets context -->
## Incident: Unauthorized API key usage — 2026-04-05
**Severity:** P1 — potential data exfiltration
**IC (Incident Commander):** @alice
**Status:** Investigating
**Started:** 2026-04-05 03:42 UTC
**Affected:** Production API, tenant IDs TBD
**Timeline:**
- 03:42 — Alert fired: API key used from 5 previously unseen IPs
- 03:45 — IC declared, containment started
- 04:01 — Compromised API key rotated
- [updates continue...]The Blameless Postmortem
A blameless postmortem's purpose is systemic improvement, not individual accountability. "Alice forgot to validate the token" is not a finding — it is a symptom. The finding is "our token validation has no automated test coverage and no code review checklist item."
Structure:
## Postmortem: [Incident Name] — [Date]
### Impact
- Duration: X hours
- Users affected: N
- Data affected: [description or "none confirmed"]
### Timeline (UTC)
- 03:42 — Alert fired
- 03:45 — Incident declared
- ...
### Root Cause
One sentence. Focus on the systemic gap, not the human error.
### Contributing Factors
- [What made this possible]
- [What slowed detection]
- [What limited containment options]
### What Went Well
- [Genuinely]
### Corrective Actions
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add automated test for token validation | @bob | 2026-04-12 | P1 |
| Add bulk-export alert to detection suite | @carol | 2026-04-19 | P2 |
| Update incident runbook with API key rotation steps | @alice | 2026-04-15 | P2 |Track corrective actions in your engineering backlog with the same rigor as feature work. A postmortem without follow-through is documentation theater.
Runbook: The 2am Decision Tree
A security runbook removes the cognitive load of making decisions at 3am under pressure. It answers: "Given this alert, what do I do first?"
## Runbook: Credential Compromise
### Trigger
- Alert: "API key used from new IP" fires
- Alert: "Authentication spike" fires with >500% above baseline
### Step 1: Assess (< 5 min)
- Is this a test environment? → Monitor and escalate if pattern continues.
- Is this production? → Declare incident, proceed.
### Step 2: Contain (< 15 min)
- [ ] Rotate the compromised credential: `make rotate-api-key KEY_ID=<id>`
- [ ] Check audit logs for actions taken with the key: `make query-audit KEY_ID=<id>`
- [ ] Block originating IPs at WAF: `make block-ip IP=<ip>`
### Step 3: Scope
- [ ] Pull list of resources accessed by the key in the last 24h
- [ ] Identify affected tenants
- [ ] Notify legal if PII was in scope
### Step 4: Escalate
- Slack @security-oncall
- Page @engineering-lead if severity is P1Key Takeaways
- Detection quality determines whether you find out about an incident in minutes or months — instrument high-value signals like auth spikes, bulk exports, and privilege escalations explicitly.
- Contain before you investigate: stop ongoing harm first, then understand scope; an active attacker can observe your investigation and respond.
- Preserve forensic evidence before making changes — log snapshots and disk images taken after the fact may be too late or incomplete.
- Communication needs vary by audience: affected users need plain-language impact statements, legal needs PII scope within hours, regulators have statutory deadlines.
- Blameless postmortems focus on systemic gaps, not human errors — "the developer forgot" is never the root cause finding, only a symptom of a missing control.
- A runbook with a decision tree removes 3am cognitive load; every P1 alert should map to a runbook that tells the on-call engineer exactly what to do first.
← Part 9
Multi-Tenancy and Isolation