Distinguishing exploited from malicious domain names using COMAR: key findings and future directions

Home > Observatory and resources > Expert papers > Distinguishing exploited from malicious domain names using COMAR: key findings and future directions

04/15/2021

By Sourena Maroofi (Grenoble Alps University), Maciej Korczyński (Grenoble Alps University), Thymen Wabeke (SIDN Labs), Cristian Hesselman (SIDN Labs), Benoît Ampeau (AFNIC), Andrzej Duda (Grenoble Alps University)

In our previous blog, we introduced the Franco-Dutch research project on automatic classification of domain name abuse using COMAR, explained the importance of the project and described its objectives. Today, we’ll discuss the key findings of our research project so far and its future directions.

What is COMAR?

In a nutshell, COMAR is an experimental system capable of distinguishing between domain names registered by cybercriminals for malicious purposes, and domain names exploited by taking advantage of vulnerabilities in web applications.

COMAR enables various intermediaries, such as registrars, hosting providers and top-level domain (TLD) registries, to further optimise their anti-abuse processes. A domain name classified as maliciously registered can be blocked by the registry or registrar, according to applicable policies, by removing the name from the zone file. A legitimate but compromised domain name should not be blocked but rather the malicious content should be removed by the hosting provider or domain owner (registrant).

Classification using COMAR

COMAR makes automatic classification decisions in near-real time using publicly available data (e.g. WHOIS, DNS or hosting data) on domains derived from URL blacklists. In our research we used the OpenPhish community feed, PhishTank, APWG and URLhaus data sets, but the system can use other types of blacklist, such as lists of fake webshops.

COMAR does not leverage the raw data. Instead, its decisions are based on extracted indicators, also called features, that we have studied extensively.

For example, features indicating that a domain name has been registered by a cybercriminal include special key words in the domain name, such as ‘verification’, ‘account’ and ‘support’ (e.g. suportaccount-services.com). Based on our in-depth word frequency analysis, we find that cybercriminals tend to incorporate such words into domain names to lure victims to enter their credentials (e.g. login and password).

Features indicating that a domain name was registered by a benign user but has since been compromised include, for example, the number of technologies (i.e. frameworks and libraries) such as a WordPress content management system used in building the website. The intuition is that legitimate domain owners put more effort into creating content to increase user interest and, therefore, website popularity, i.e. the amount of web traffic the site receives. Such efforts are not generally required for the correct operation of malicious domain names, which typically use a smaller set of technologies than benign sites.

In total, we have proposed 38 features in 7 categories, which are explained in detail in our research article published at the 2020 IEEE European Symposium on Security and Privacy^[1].

Key findings

We evaluated COMAR^[1] extensively using phishing and malware blacklists and showed that it can achieve a high degree of accuracy: 97% of domain names were correctly labelled by the classier, without using any privileged or non-publicly available data, which makes it suitable for the use by any organisation.
We found that, in the sample of phishing domains manually labelled by us, 58% were maliciously registered and 42% were compromised. In the sample of malware domain names, 57% were compromised and 43% were registered by cybercriminals^[1].
We showed that so-called content-based features (e.g. the number of technologies used to build the website and the content length of the domain homepage) were the most effective in determining the ‘level of benignness’ of domain names^[1].
We introduced a new method of estimating the domain creation time in cases where there is no access to WHOIS information, which outperforms standard statistical methods in filling missing values^[1].
We discussed the ways attackers could possibly bypass the COMAR system^[1]. High cost and effort for attackers complicates evasion and may therefore discourage malicious actors.
We found that, when used alone, the key heuristics proposed in the APWG phishing survey^[2] may not be able to correctly classify maliciously registered domain names, particularly if they contain no famous brand name, include a misleading string in the domain name, or are not used within a short time after registration^[1].
Previous research indicates that some miscreants register domains and wait for as long as several months before using them in phishing attacks, in order to ensure a higher reputation score from security organisations. The tactic is known as “domain aging”^{[2, 3]}. We showed that approximately 12% of the domains in the analysed set were compromised in the first three months after registration. The two findings suggests that domain reputation systems based only on domain age may not be capable of accurately distinguishing maliciously registered domains from compromised domains^[1].
Over the course of the project, we also investigated new anti-phishing evasion techniques used by malicious actors. By manually visiting malicious URLs, we noticed that cybercriminals tend to use Google re-CAPTCHA^[1] to hide the real content of malicious pages. You can learn more about this research from our article published recently at the 2020 ACM Internet Measurement Conference^[4].

Future directions

Distinguishing between malicious and compromised domains may help to reveal the practices or profit-maximising behaviours commonly used by attackers. Therefore, one attractive direction for future research is to separately study the patterns associated with domain names classified as malicious, and those associated with domain names classified as compromised, with a view to answering questions such as: Do malicious actors tend to deploy TLS certificates on maliciously registered domains so that they look more legitimate? What proportion of all blacklisted domains is accounted for by maliciously registered domains in different DNS ecosystems (e.g. country-code TLDs, new and legacy generic TLDs)? Such research may help intermediaries to identify the ways their domain name ecosystems are exploited and take more effective preventive measures.

The COMAR API already supports the submission of malicious URLs for analysis and classification. We plan to post-evaluate the method by manually labelling the .fr and .nl domains and crosschecking with the labels automatically assigned by COMAR. The ultimate goal of the COMAR project is to make the system available to the support teams at AFNIC and SIDN – the registries for two leading European country-code TLDs – and to set up an early notification system to facilitate the remediation of blacklisted URLs.

References

^{[1] “COMAR: Classification of Compromised versus Maliciously Registered Domains“, Sourena Maroofi, Maciej Korczyński, Cristian Hesselman, Benoit Ampeau and Andrzej Duda, IEEE European Symposium on Security and Privacy (IEEE EuroS&P 2020), Virtual Conference, September 2020}

^{[2] “Global Phishing Survey: Trends and Domain Name Use in 2016“, Greg Aaron, and Rod Rasmussen, June 2017}

^{[3] “Cybercrime After the Sunrise: A Statistical Analysis of DNS Abuse in New gTLDs“, Maciej Korczynski, Maarten Wullink, Samaneh Tajalizadehkhoob, Giovane C.M. Moura, Arman Noroozian, Drew Bagley, Cristian Hesselman, ACM Asia Conference on Computer and Communications Security (ACM AsiaCCS 2018), South Korea, June 2018}

^{[4] “Are You Human? Resilience of Phishing Detection to Evasion Techniques Based on Human Verification“, Sourena Maroofi, Maciej Korczyński, and Andrzej Duda, ACM Internet Measurement Conference (ACM IMC 2020), Virtual Conference, October 2020.}