Security for Application Engineers

Incident Response Basics

Ravinder·April 5, 2026·8 min read

SecurityAppSecIncident ResponsePostmortemOn-Call

Series

Security for Application Engineers

Part 10 of 10

← Part 9

Multi-Tenancy and Isolation

End of series

A security incident is not just a bad day — it is the moment that reveals whether the rest of your security investments were real or performative. Good detection surfaces the incident in minutes, not months. Good containment limits blast radius. Good communication preserves user trust. A good postmortem ensures you do not repeat the same failure.

Most engineering teams have an on-call runbook for reliability incidents. Far fewer have a runbook for security incidents — and the two are different in important ways. A reliability incident is usually a systems failure. A security incident may involve an active adversary who is watching your response and adapting.

This post covers the core phases of incident response as they apply to application engineering teams.

The Phases

flowchart LR P1["Prepare\n(runbooks, contacts,\nlogging, detection)"] P2["Detect\n(alerts fire,\nanomalies surface)"] P3["Contain\n(stop the bleeding,\npreserve evidence)"] P4["Eradicate\n(remove attacker access,\npatch root cause)"] P5["Recover\n(restore service,\nverify clean state)"] P6["Postmortem\n(blameless review,\npreventive work)"] P1 --> P2 --> P3 --> P4 --> P5 --> P6 P6 -.-> P1

Most teams collapse Eradicate and Recover into a single phase in practice. The critical distinction is between Contain and Eradicate: containment stops ongoing harm without necessarily removing the root cause.

Detection: What Should Alert

The hardest part of detection is signal-to-noise ratio. Alert on too much and alerts become noise; alert on too little and incidents go undetected.

High-value detection signals for application security:

Signal	Implementation	Why
Authentication failures spike	Rate-based alert on auth failure logs	Credential stuffing, brute force
Successful login from new country	Geo-diff on user login history	Account takeover
Privilege escalation	Alert on role assignment events	Insider threat, compromised account
Bulk data export	Alert when single user exports >N records	Data exfiltration
API key usage from new IP	Alert on first-use from unknown IP	Leaked API key
Admin actions outside business hours	Alert on elevated actions 2am–6am	Compromised admin account

# Example: structlog-based detection hook for bulk export
import structlog
 
log = structlog.get_logger()
 
EXPORT_THRESHOLD = 1000  # records per request
 
def export_records(user_id: str, tenant_id: str, filters: dict, db) -> list:
    records = query_records(filters, db)
 
    if len(records) > EXPORT_THRESHOLD:
        log.warning(
            "security.bulk_export",
            user_id=user_id,
            tenant_id=tenant_id,
            record_count=len(records),
            filters=filters,
            # This event is routed to the security alert channel
            # via log aggregator rules
        )
 
    return records

Your log aggregator (Datadog, Splunk, Grafana) routes security.* events to a dedicated alert channel with lower latency SLAs than operational alerts.

Contain First, Investigate Second

When an active incident is confirmed, containment takes priority over understanding the full scope. You cannot investigate effectively while an attacker still has access.

Containment actions, roughly in order:

1. Rotate compromised credentials (API keys, session tokens, DB passwords)
2. Revoke user sessions (all sessions for affected accounts)
3. Block attacker's IP/ASN at the WAF or load balancer
4. Disable affected features if they are the attack vector
5. Preserve evidence before making changes (copy logs to cold storage)
6. Notify affected users (after containment, not before)

# Emergency: revoke all sessions for a user across all devices
def emergency_revoke_all_sessions(user_id: str, reason: str):
    # Redis: delete all session keys for this user
    session_keys = redis.keys(f"session:user:{user_id}:*")
    if session_keys:
        redis.delete(*session_keys)
 
    # Refresh tokens: mark entire user's token family as revoked
    db.query(RefreshToken).filter(
        RefreshToken.user_id == user_id,
        RefreshToken.revoked == False
    ).update({"revoked": True, "revoke_reason": reason})
    db.commit()
 
    # Audit log
    write_audit_log(
        actor_id="system",
        action="emergency_session_revocation",
        resource_type="user",
        resource_id=user_id,
        outcome="success",
        metadata={"reason": reason},
    )

Evidence Preservation

Before you patch, rotate, or restore, preserve the state that the attacker left. Forensic evidence is used to understand scope, satisfy legal or regulatory obligations, and support law enforcement if needed.

# Snapshot application logs to cold storage before rotation
aws s3 sync s3://prod-logs/application/ \
  s3://incident-evidence-2026-04-05/application-snapshot/ \
  --storage-class GLACIER
 
# Snapshot database query logs
aws rds download-db-log-file-portion \
  --db-instance-identifier prod-db \
  --log-file-name general/mysql-general.log \
  --output text > /evidence/db-general-2026-04-05.log
 
# Create a forensic snapshot of the affected EC2 instance
aws ec2 create-snapshot \
  --volume-id vol-0abc123 \
  --description "Forensic snapshot - incident 2026-04-05"

Tag all evidence artifacts with the incident ID. Write-protect them immediately. Do not modify the original evidence; work from copies.

Communication: The Stakeholder Matrix

Security incidents require communication to multiple audiences with different needs, different levels of technical context, and different timing requirements.

Audience	What they need	When
Internal leadership	Business impact, scope, ETA to resolution	Within 1 hour of declaration
Legal / compliance	Regulatory obligations (GDPR 72h breach notification)	Immediately if PII involved
Affected users	What happened, what data was affected, what they should do	After containment, within 24–72h
All users	If service was unavailable, brief acknowledgment	As soon as service restored
Regulators	Formal breach notification (jurisdiction-dependent)	Per regulation (GDPR: 72h to DPA)

Internal communication during an active incident should happen in a dedicated, documented channel (a private Slack channel named #inc-YYYY-MM-DD-shortname). This creates an automatic timeline artifact.

<!-- Incident channel template: first message sets context -->
## Incident: Unauthorized API key usage — 2026-04-05
**Severity:** P1 — potential data exfiltration
**IC (Incident Commander):** @alice
**Status:** Investigating
**Started:** 2026-04-05 03:42 UTC
**Affected:** Production API, tenant IDs TBD
 
**Timeline:**
- 03:42 — Alert fired: API key used from 5 previously unseen IPs
- 03:45 — IC declared, containment started
- 04:01 — Compromised API key rotated
- [updates continue...]

The Blameless Postmortem

A blameless postmortem's purpose is systemic improvement, not individual accountability. "Alice forgot to validate the token" is not a finding — it is a symptom. The finding is "our token validation has no automated test coverage and no code review checklist item."

Structure:

## Postmortem: [Incident Name] — [Date]
 
### Impact
- Duration: X hours
- Users affected: N
- Data affected: [description or "none confirmed"]
 
### Timeline (UTC)
- 03:42 — Alert fired
- 03:45 — Incident declared
- ...
 
### Root Cause
One sentence. Focus on the systemic gap, not the human error.
 
### Contributing Factors
- [What made this possible]
- [What slowed detection]
- [What limited containment options]
 
### What Went Well
- [Genuinely]
 
### Corrective Actions
| Action | Owner | Due Date | Priority |
|--------|-------|----------|----------|
| Add automated test for token validation | @bob | 2026-04-12 | P1 |
| Add bulk-export alert to detection suite | @carol | 2026-04-19 | P2 |
| Update incident runbook with API key rotation steps | @alice | 2026-04-15 | P2 |

Track corrective actions in your engineering backlog with the same rigor as feature work. A postmortem without follow-through is documentation theater.

Runbook: The 2am Decision Tree

A security runbook removes the cognitive load of making decisions at 3am under pressure. It answers: "Given this alert, what do I do first?"

## Runbook: Credential Compromise
 
### Trigger
- Alert: "API key used from new IP" fires
- Alert: "Authentication spike" fires with >500% above baseline
 
### Step 1: Assess (< 5 min)
- Is this a test environment? → Monitor and escalate if pattern continues.
- Is this production? → Declare incident, proceed.
 
### Step 2: Contain (< 15 min)
- [ ] Rotate the compromised credential: `make rotate-api-key KEY_ID=<id>`
- [ ] Check audit logs for actions taken with the key: `make query-audit KEY_ID=<id>`
- [ ] Block originating IPs at WAF: `make block-ip IP=<ip>`
 
### Step 3: Scope
- [ ] Pull list of resources accessed by the key in the last 24h
- [ ] Identify affected tenants
- [ ] Notify legal if PII was in scope
 
### Step 4: Escalate
- Slack @security-oncall
- Page @engineering-lead if severity is P1

Key Takeaways

Detection quality determines whether you find out about an incident in minutes or months — instrument high-value signals like auth spikes, bulk exports, and privilege escalations explicitly.
Contain before you investigate: stop ongoing harm first, then understand scope; an active attacker can observe your investigation and respond.
Preserve forensic evidence before making changes — log snapshots and disk images taken after the fact may be too late or incomplete.
Communication needs vary by audience: affected users need plain-language impact statements, legal needs PII scope within hours, regulators have statutory deadlines.
Blameless postmortems focus on systemic gaps, not human errors — "the developer forgot" is never the root cause finding, only a symptom of a missing control.
A runbook with a decision tree removes 3am cognitive load; every P1 alert should map to a runbook that tells the on-call engineer exactly what to do first.

Series

Security for Application Engineers

Part 10 of 10

← Part 9

Multi-Tenancy and Isolation

End of series