How Microsoft narrows the threat funnel on over 600 billion monthly security events

|

Kristin Burke sits smiling at the camera with her PC in front of her in a Microsoft office open space.
Kristin Burke is a principal service engineer on Microsoft’s Digital Security and Resilience Security Incident Response team. (Photo by Aleenah Ansari | Inside Track)

Microsoft Digital PerspectivesAt Microsoft, we typically see around 600 billion security events each month.

Security events are innocuous actions that happen all the time, ranging from adding a user to a Microsoft SharePoint site, to creating a file, to deleting a folder, to opening an email. On their own, events are typically not notable, but in conjunction with other events, they can signal a threat, so we’ve always got to be on the lookout.

Across a large enterprise like ours, we use Microsoft Azure Sentinel to see 20 billion security events each day, and that’s a lot of data to collect and manage. Azure Sentinel is our cloud-native security information and events manager.

Instead of sending our Security Operations Center (SOC) analysts on a wild hunt after every event in the SIEM, we use our own tools and solutions to funnel that big 12-digit number down to a more reasonable number. This focuses our efforts on the events that matter and how they were created—this narrows us in on the ones that pose a real threat. The reason this is important is we can’t—and shouldn’t—respond to every event. Like anyone who has a security team, we want our security professionals spending their time on true threats to our company.

Diagram of four consecutively smaller rings forming a funnel, decreasing with the number of security events.
Combining solutions like machine learning and automation along with human expertise allows Microsoft to focus on the security events that require attention.

But how can we be certain that the right events are being flagged for investigation?

It all starts with Zero Trust

Zero Trust means verifying everything you can, including identity and device health. It means giving your users enough access to stay productive, but not enough to create unnecessary risk. Segment networks, encrypt your data from end to end, take advantage of telemetry, and assume there will be a breach.

More specifically, Zero Trust architecture reduces risk across all environments by establishing strong identity verification, validating device compliance prior to granting access, and ensuring least privilege access to only explicitly authorized resources. These are the core principles of Microsoft’s Zero Trust architecture.

Adhering to it is the first step in protecting Microsoft and serves as a framework that encapsulates all our security controls. It’s also how we empower self-service and let users access the resources they need to stay productive from anywhere.

Zero Trust keeps a lot of threats outside of our ecosystem—around 98 percent of attacks are subverted by these basic principles. Most data breaches, roughly 70 percent, come from phishing. In adhering to the values of Zero Trust, we assume that any device or user can be breached, which means carefully scrutinizing security events across our organization for concerning patterns.

Going from several billion to several thousand

You can find threats through use cases, scenarios that describe how unusual security events might be high risk.

When you collect a lot of events data, you either create your own use cases or let the products do it for you. We rely heavily on the latter.

Each of our extended detection and response tools (XDR) include a team dedicated to knowing and identifying if something is out of the ordinary. Inside Microsoft 365 Defender lives Threat Analytics, which consolidates findings—like active threats and critical vulnerabilities—from product groups and security researchers. Then it tells us what the finding means, what the associated activity is, what the problem is, and which entities—a device, a mailbox, an inbox, or something else—are involved.

The Threat Analytics feature, along with the rich research-driven behavioral alerts in Microsoft Defender for Endpoint, have a large impact in reducing the number of events we see. By utilizing our security tools that leverage machine learning, threat intelligence, data science, and more, we are able to filter an estimated 600 billion monthly events down to around 10,000 alerts. Whittling those 12 digits down to 4 gets us closer to real threats.

But we still have to determine which items require further investigation from our SOC, and we only want to give our analysts cases that lead to an actionable response.

Cutting that in half, and then some

An estimated 10,000 alerts are still a lot to handle. We utilize automation to help us understand what’s going on and to reduce the number of cases we dig into.

Microsoft Defender for Office 365’s Automated Investigation and Response (AIR) helps our SOC prioritize security events that pose the greatest risk while reducing manual steps in the process. In part, it does this by triaging low-threat cases that have already been identified by our system.

AIR works because it generates an investigation whenever a user flags an email as a phishing attempt. It doesn’t matter if it’s actually malicious or not, because our tools, like Microsoft Defender for Office 365, tell us if we need to investigate further. The tool also takes action, like combining emails for deletion, blocking URLs, or providing additional steps to our SOC so that we can protect the company.

And because AIR has taken care of the lower risk items, our SOC can worry about higher risk phishing.

We also use Microsoft Defender for Endpoint Automated Investigation and Response (MDE AIR), which finds and fixes low-level malware instances, stuff we regularly see. MDE AIR can clean up a device, remove any scheduled tasks, delete the service, erase the file, and simply tell us that the problem has been remediated. We can make choices about whether to even look at it, which reduces a lot of noise for our SOC.

These are some of the ways we go from 10,000 monthly alerts down to a manageable 3,500 cases for investigation.

But there are other ways to reduce the number of security alerts an organization must respond to.

Sometimes several events are related and can be correlated into an incident. Microsoft 365 Defender uses threat intelligence to find these relationships. Knowing the association between events, a SOC can be more efficient in its efforts, reducing the number of events along with duplicated efforts performed by multiple analysts.

Machine learning and data science can show how impactful data can be. As event data grows, divergent datasets can be merged together to assist threat hunters. There is a function in Microsoft Azure Sentinel called Fusion that can spot ransomware in the background. The solution can also automatically detect multistage attacks and recognize patterns in data that would otherwise be too complex to see.

Making it easy for the SOC

This is how we get from 3,500 cases to the actual security threats that need our attention. But it’s not only about the volume of events—it’s also about how we spend our time.

We want to be able to dig through the data, but when our analysts need to act, we want all that data to be curated and ready to go. Some of this is done through additional layers of automation, like bots that use the Microsoft Defender APIs to pull important data for an analyst. Developers within the SOC team utilized the Microsoft Bot Framework to make it easier for our analysts to get the data they need quickly and efficiently by connecting directly to these robust APIs. Here’s an example:

Sherlock Bot screenshot displaying security event information such as timestamps, frequency, affected systems, risk, exposure level.
Sherlock Bot is one of the many automation tools created and used by SOC analysts to quickly investigate a security event.

This saves time for the analyst having to go back and forth between screens to get data or choose remediation actions, all the while helping the analyst determine if the alert is a true positive that needs cleanup or a benign or false positive to add to the list for exclusion.

A SOC can also leverage playbooks with automation rules to further improve analysts’ efficiency by assigning tasks to the right team, giving further resolution to incidents, and clearing out known false positives. Automation makes things easier, more efficient, and more accurate. It empowers our SOC to make strategic decisions instead of focusing on manual steps.

The winnowing we do dramatically reduces the number of events that we need to respond to. For every 3,500 security events we investigate each month, only about 500 require remediation from Microsoft. This system allows us to quickly identify and respond to those real threats.

Why quality, not quantity, matters

We collect a lot of events each month, but we don’t have time to investigate everything. And not everything is worthy of our SOC’s time. Most of our threats are dealt with by Zero Trust and good security hygiene, but we’re still going to be cautious and perform due diligence.

The solutions and practices we have in place help funnel these security events into a manageable number by eliminating innocuous events, resolving low-risk items, and remediating common problems on behalf of our SOC analysts. When the distilled batch of cases arrive for investigation, our SOC can leverage automation and other tools to work efficiently, quickly responding to items that really matter.

That’s how we get from 600 billion monthly events, a 12-digit number, down to 500 remediations, or 3 digits’ worth of action items.

Key Takeaways

Here are three things you can do to get started at your company:

  • Use a Zero Trust security model to ensure you have a healthy and protected environment that reinforces strong identity verification, device health enforcement and management, and least privilege access.
  • Ensure that your SOC team is using enterprise security tools that leverage research and machine learning to produce actionable alerts. This includes making sure to use tools that provide alert reduction in the form of correlated pending actions for the SOC, or features such as Fusion or incident correlation.
  • Utilize the APIs provided by your security tools to build automation that ensures analysts can be efficient and get to true positives as quickly as possible.

Related links

Recent