Playbooks and Runbooks for Incident Response

Most of the cloud service providers align their incident response around the life cycle popularised by NIST.

NIST Incident Response Life Cycle

Monitoring for events and logs supported by AWS Lambda serverless | Four main parts

  1. Preperation
  2. Detection and Analysis
  3. Containment, Eradication and Recovery
  4. Post-incident Activity

AWS Security Incident Response Whitepaper

For AWS the whole process may seem different if we rely on the official Security Incident Response Whitepaper.

For AWS the main life cycle events are

  1. Prepare People and Technology
  2. Detect and Analyse
  3. Contain, Remove and Recover
  4. Iterate by automating response using runbooks and playbooks

Fundamentals of responding to security incidentswithin acustomer’s AWS Cloud environment

NumStageTools and Aids
1.Preparation - PeopleRoles and Responsibilities known and informed
2.Preparation - PeopleOwners for assets by appropriate tags
3.Preparation - TechnologyBest practices, standards benchmarks as checklists
4.Preparation - TechnologySecure by default at the time of creation and continuous config audits
5.Detect and Analyse - ComputeMonitoring for events and logs supported by AWS Lambda serverless
6.Detect and Analyse - ComputeIf required additional processing by using Fargate tasks (Containers)
7.Detect and Analyse - ComputeStore raw logs and data and analysis in secure S3 buckets
8.Contain, Remove and Recover - Network LayerUsing security groups and network ACLs contain the EC2
9.Contain, Remove and Recover - Platform LayerRemove any backdoor users and revoke STS tokens
10.Contain, Remove and Recover - Application LayerAttach compromised disks to another secure host for forensics
11.Runbooks and Playbooks - RunbookTo ensure that all standard operating procedures are documented
12.Runbooks and Playbooks - PlaybookA series of steps to be used in case something fails

Few resources that map to the above stages

Stage - 1 - Who you gonna call

Roles and Responsibilities in an incident response scenario

Stage - 2 - Maintaining Assets using Tags

Owners for assets by appropriate tags

Stage - 3 & 4 - Compliance Checks like CIS Benchmark

Blog Post - Continuous benchmark audits

Stage - 5 - CloudWatch Metric and Alarms

Monitoring for events and logs

Stage - 6 - Using Prowler as Fargate Task to continuously check for CIS benchmark compliance

additional processing by using Fargate tasks

Runbooks and Playbooks

They seem similar but there are a few key differences

NumSimilarity or DifferenceRunbookPlaybook
1.DifferenceDocument known proceduresDocument how to investigate/troubleshoot when known thing fails
2.DifferenceEnsures when required, SOP is applied consistentlyEnsures when needed, response is consistent
3.SimilarityWell documented manual procedures should be automatedWell documented manual troubleshooting steps should be automated
4.DifferenceUseful post incident to recover and resume normal operationsUseful when investigating what could be causing failure
5.DifferenceAfter every successful recovery post incident, runbooks should be reviewed and updated as per learningsAfter every failure, playbooks should be reviewed and updated as per learnings

Real world example

Real world usage and example