By: ThreatLabz

Fun With N-gram Analysis

Analysis

A former colleague and I had discussed character based analysis of domains to provide indications of suspicious / malicious usage. For example, the Avalanche botnet has a domain registration and fast-flux infrastructure that is frequently used for hosting Zeus, phishing, and money-mule recruitment sites. Typically, Avalanche bulk registers domains that often do not have any correlation to any particular word or alias, for example:

erasv
jjkk
yuuikom
yy1azsva
zah7kio
zyuojli
...


The theory is that by conducting character based analysis on malicious and benign domains/URLs, indicators would emerge that if certain patterns were present or absent it would be a domain/URL worth investigating further. Additionally, a character/sequence based score could be included into a page risk scoring algorithm.

Using a technique called N-gram analysis, it is possibly to very quickly extract sequences of characters from text and calculate the frequency of occurrences with the text.

In order to have an adequate sample size for this analysis, I built a list of 250,000 malicious URLs and 250,000 benign URLs over the past few months. I used the CPAN module Text::Ngrams within Perl scripts to calculate the frequency of sequence occurrences and compute the frequency differences between the malicious and benign domains. I computed the top character sequences from 1 character (1-gram) to 6 characters (6-gram). A draft of the report can be read HERE.

Results Summary

1-Gram:
The "-" (dash character) was the most popular character within malicious versus benign domains. "h", "r", and "p" were the top three letters that were much more prevalent within malicious domains than benign domains. Other characters that are less popular within the English language were also much more prevalent within malicious domains: "j", "z", "w", "y", and "q."

2-Gram:
"te", "on", and "os" were the top three two-character sequences that were much more prevalent within malicious domains than benign. In addition, there were some notable TLD sequences that showed up much more frequently within malicious domains: "ch", "nl", "fr", and "ru."

3-Gram:
The top three-character sequences within malicious versus benign domains were character sequences making up the words "online" and "watch." Some other character patterns that were much higher within malicious domains appeared to be related to QWERTY keyboard sequences, such as "fds", "asd", and "dfd." Also, as expected, some of the X-rated themed character sequences ("xxx", "sex") were more prevalent within malicious domains.

4-Gram through 6-gram:
Some of the top 4-character word results for malicious versus benign domains include: "porn", "free", "host", "game", "scan", "anti", "evil", and "best." 5-character malicious versus benign sequences results include: "watch", "forum", "gamer", "adult", "virus", "music", "video", "-sell-", "free-", and "phone." 6-character malicious versus benign sequence results include: "online", "funpic", "video", "hostin(g)", "spywar(e)", "archiv(e)", "antivi(rus)", "system", "scanne(r)", and "securi(ty)." Many of these character sequences occur within FakeAV and Fake Codec malware sites.

Future Work

In the future, I'd like to take a larger sample size (e.g., 1 Million malicious and 10 Million benign), and analyze the malicious domains/URLs by threat (e.g., Zeus, Koobface, various exploit kits, phishing sites, etc.) to compare against benign domains/URLs. This will provide more specific character patterns for malicious sites. From this work, a dictionary of regular expressions can be built to contribute to the weighing of domain/URL risk.

Once again, a draft on my work on this can be read HERE.

Learn more about Zscaler.