This post also appeared on LinkedIn.
Finding the needle in the haystack is difficult enough as it is. Finding the unknown threat in 160 billion transactions a day is like trying to find a single needle in a million haystacks...and you don’t know what a needle looks like. I am glad to blog about the intriguing work my Machine Learning and AI teammates have done along with our amazing colleagues from the Security Research, Engineering, and PM team. Thanks to the team effort that made this article possible!
Most security professionals are quite capable of stopping known threats: Is the signal recognized? Block it. But how do you block something you haven’t seen before?
Identifying threats in inbound and outbound traffic is like trying to find a needle in a proverbial haystack. It’s challenging, but it’s something Zscaler has been doing for more than a decade. Every day, Zscaler enables and inspects more than 160 billion transactions. Thanks to more than 175,000 daily new security updates, about 100 million threats are detected every day for our customers. Among them, some are known threats and some are unknown threats.
Unknown threats are threats that haven’t been documented or detected before. They arrive in a system without a known identifiable signal such as an IoC (Indicator of Compromise) or signature. Sometimes they are variations of known threats — say, a variant of a well-known ransomware strain — and sometimes they are brand new, previously unseen, or original threats.
Blocking unknown threats requires an innovative approach to security. The Zscaler Zero Trust Exchange employs innovations such as Cloud Sandbox (CSB) and Cloud Browser Isolation (CBI) services to sequester the unknown bad. On top of that, the Zscaler ML/AI team has been collaborating with our security research and security engineering team closely to develop advanced technology to complement the existing CSB and CBI solutions and combat unknown threats more easily, effectively, and efficiently.
Without a signal to act upon, unknown threats are understandably more difficult to block. We have to correlate and stitch together many other signals from transactions over sometimes a long period. It can be difficult to put into practice, especially with such massive data volumes.
We need to transform such an NP-hard like problem into something more actionable and more manageable without compromising the unknown threat detection. At Zscaler, we started leveraging Machine Learning and AI technology to filter the petabytes of data, leaving us with a much smaller volume of data so that our deeper analysis (some based on the tried and true conventional technology and some based on the AI model) of the transactions is feasible, practical, and effective.
Let’s look at some examples.
The first example is with unknown web-page categorization by AI/ML. At Zscaler, we use AI to categorize an unknown page to be “Education”, “Weapons/Bombs”, or whatever category. We can do this even if we have never seen and labeled the page before. This smart capability — honed from complex algorithmic AI logic running behind the scenes — enables us to figure out some of the “good” and “bad” sites simply by using our proprietary categorization model.
The second example is with unknown web-page risk analysis by AI/ML. There is a subtle difference here as we do risk assessment more than the page categorization here. In fact, our Zscaler cloud service already scores web page destinations in production for years and now AI/ML can make that quantification even better and stronger via a domain reputation model. Let's look at this example in detail below.
We’ve created an ML-based domain reputation model that pre-filters outbound domains which will then make a downstream threat-detection module (e.g., a command and control model) more practical and effective.
Users visit many domains a day. If a domain is known to be bad (because it was associated with a known threat, for example) then it will be blocked. But there will still be unknown domains, and even though they might represent a small percentage of the total, the scale is still huge. Zscaler traffic includes visits to many millions of such domains a day.
In this case, ML aids threat detection (including but not limited to phishing and command and control detection) by identifying if a given domain is suspicious or not. This pre-filtering model or technique reduces the unknown domain list to be manageable for further advanced analysis.
In our setup, the ML-based domain reputation model returns a value between 0 and 100 that reflects the likelihood of a good domain. The lower the score, the more likely the domain is bad.
Figure 1. Zscaler patent-pending domain reputation AI/ML model.
As shown in Figure 1, the total domain reputation score is calculated from its sub-component scores. A machine learning method is used to automatically adjust the weights of these scores to make sure that the final reputation scores follow a Gaussian distribution. (See Figure 2 below.) This allows us to set a threshold to control the fraction of “suspicious” domains to be sent for further analysis.
This patent-pending ML model enables us to keep “suspicious” domain volume low enough to be practically and deeply analyzed without compromise to threat posture. The outcome is that we can filter out clean transactions effectively and focus our analytical energy on only the suspicious. Figure 2 below shows sample output based on recent real traffic plus test data sets including the recent SolarWind supply-chain attack domains.
Figure 2. Frequency distribution of the domain reputation AI model scores (blue bars) and its fitted Gaussian model (orange curve). Domains to the left of the red line are considered “suspicious” and will be passed to the downstream command and control detection AI model.
As it turned out in the lab test in Figure 2 above, our model assessed twelve (12) "SolarWind attack" related domains and classified all of them to be “suspicious” correctly.
From here, we then ran a command-and-control detection model (I will discuss it in a future blog) on all of the suspicious domains. We discovered real unknown threats and here is one example that was added to the Zscaler threat database and blocked around the world:
Figure 3. One malicious domain found in a downstream Zscaler command-and-control detection AI/ML model.
Finding the needle in the haystack is difficult enough as it is due to the scale and speed requirements.
Finding the unknown threat in massive volumes of data traffic is like trying to find a single needle in a million haystacks...and you don’t know what a needle looks like.
Every day, more and more unknown threats appear, introducing new risks to modern enterprises. The Zscaler Machine Learning and AI team’s advanced technology -- including the AI/ML-based URL categorization and AI/ML-based domain reputation scoring referenced in this blog -- mitigates the risk of unknown threats. And in the milliseconds it takes for those models to identify an unknown threat as an actual threat to be blocked, security improves for every single Zscaler customer.
For more on Zscaler Machine Learning and AI technology blocking unknown threats in practice, watch my presentation from the recent Virtual Zscaler CXO Summit.