In this post, I summarize the book by Jeff Bollinger et al. called Crafting the InfoSec Playbook. The book provides a methodical approach and best practices for creating a successful incident response plan. The authors for in Cisco's CSIRT where most of the book's ideas come. I think that this book is must read, especially for everybody working in incident response or SOC team in general. I extracted the essential points with additional commentary and resources.
(Cover picture credit: O'Reilly)
Incident response is an important part of the overall information ecosystem. Many people realize that intrusions are inevitable and having a rigorous plan for their detection and successful remediation can be seen as insurance. Costs put in to the security is often translated into faster recovery and smaller costs of individual incidents.
The incident response became lot complex in recent years. However, it still boils down to a couple of essential parts:
- Post-mortem / Lessons Learned
Successful incident response team should stay on a solid foundation. The foundation in this context means that the team knows what is valuable and therefore needs protection. This is an idea of risk management, but often forgotten step of newly established teams. These are the core questions every incident response team should be able to answer:
- What are we trying to protect?
- What are the threats?
- How do we detect them?
- How do we respond?
Note that this is assessment is a continuous process and assets are likely to change due to the company's acquisitions and other events.
The motivation of the adversary is often money. Their malicious methods reflect what is currently most profitable. At the time of this writing (mid-2018) it is often ransomware of cryptocurrency mining. Mainstream malicious methods such as credential stealing are indeed still happening. Members of the security team should be well aware of these threads and plan accordingly. Some threat actors might not be profit driven but politically motivated = state-sponsored threat actors.
To have successful detection and analysis capabilities, logging of security events is the first thing you should focus on. The well-prepared logging mechanism is essential to the whole process. When an incident happens, you don't want to log in manually to tens or hundreds of machines, grep the logs manually and later correlate them hand by hand. The logging should be centralized: All critical points should be normalized (compressed IPv6 address to the regular format, time formats) and transferred (in real-time) to your SIEM. Again, there is a couple of question you should ask yourself:
- How to prepare and store data?
- What is the retention policy?
- What exactly to log?
- Do we need to install additional agents?
- What about other team's data?
- Do we care about network data only? Host data only?
- Do we need to back up the logs?
The log preparation should consist of normalization and parsing to fields. You don't want to do a full-text search in hundreds of millions of raw log entries. Instead, a key-value format is often preferred. Some log formats are more challenging to parse the fields out.
Patrik's note: LogStash is an excellent tool for that. If you are using ELK stack, this is a no-brainer.
The book deals with incident response complexity by introducing the notion of playbook and plays. Plays are "self-contained, fully documented, prescriptive procedures for finding and responding to undesired activity". The finding part can be seen as some report of events which is generated by SIEM's query. All plays contain the following sections:
- Report identification — Basically a title of the report.
- Objective statement — Describes what the play does and why it exists.
- Result Analysis — Written mainly for junior-level security analysts who need additional details. The details usually include the steps to cross-validate the finding or how to interpret some sections.
- Data Query / Code — Actual query used to find specific events in SIEM.
- Analyst Comments / Notes — Additional comments such as improvements and tracking of changes.
There are two types of reports:
- High fidelity — 100% assurance that the malicious activity happened. They often include highly specific indicators.
- Investigative — Cannot indicate 100% assurance, the report needs to be investigated to confirm the infection fully.
The main point of the playbook is to have a standardized workflow through all the incident response phases. The playbooks need to be created in advance, so once the incident occurs, the play can be used immediately.
Patrik's note: I have to say that I was pretty confused by the wording the authors used. There are several sections where reports and plays are used, and I had a difficult time to distinguish between them properly. From my point of view, report can be seen as a query result whereas play is overall documentation of the query and other parts.
Although the framework described above is only a theoretical foundation, we need to have an operational approach to it. This includes several systems working together:
- People — Main component of the whole equation. The headcount should be derived from an expected number of incidents.
- SIEM — As log store + query engine.
- Playbook Tracking System — There should be an independent system which tracks changes to playbooks. Bugzilla is one option for that.
- Case Tracking System — Once an incident is confirmed, new case should be created to keep track of forensic analysis and analyst comments.
Similarly, there need to be systems which are used for creating the logs which are forwarded to SIEM.
- IDS — NIDS or HIDS, can be in a preventive mode as well, depending on its deployment.
- NetFlow — Can be seen as lightweight packet capture
- Firewall — Policy violation logs
- Antivirus — Host detections
- Web proxy — For getting valuable data about HTTP connections
- VPN logs — Can be used to detect incidents such as impossible traveler.
- DHCP logs — To know the local address of the host
- DNS logs — Which domains were resolved. Also acts as a backup for HTTPS where proxy doesn't allow TLS decryption.
Patrik's note: Although the book doesn't explicitly mention that, there are also EDR systems which act as agents and might be used to confirm additional indicators.
In addition to these systems, there needs to be an inventory of hosts. No host should be allowed to connect to the network without having an entry in host inventory (Nagios, IBM Tivoly, HP OpenView). Why? Imagine having an incident spread across multiple geographical locations of your company. You identified host which need to be manually examined (e.g., disk image), but you have no clue who owns the system or how even to locate it. That's why you need an inventory.
With these logs in place, the query can be built. The main goal of your query building should be to minimize false positives. Remember, the more general the query is, the more false positives there are. I include one great chart from the book:
Patrik's note: After releasing of the book, several companies (such as Phantom) started offering playbooks for incident response. The difference is that the book looks at playbook as a fully manual process whereas such systems allow you to create the playbook there the steps of analysis are semi-automated or fully automated.
For inspiration, here are several resources with sample playbooks:
Until next time!