TY - JOUR
T1 - Scalable malware classification with multifaceted content features and threat intelligence
AU - Hu, X.
AU - Jang, J.
AU - Wang, T.
AU - Ashraf, Z.
AU - Stoecklin, M. Ph
AU - Kirat, D.
PY - 2016/7/1
Y1 - 2016/7/1
N2 - Recent years have witnessed the very rapid increase in both the volume and sophistication of malware programs. Malware authors invest heavily in technologies and capabilities to streamline the process of building and mutating existing malware programs to evade traditional protection. One major challenge currently faced by the antivirus industry is to efficiently process the vast amount of incoming suspicious samples. Since most new malware is a variation of an existing malware family with the same forms of malicious behavior, automatic clustering and classification of malware programs into families have become valuable tools for malware analysts. Such grouping criteria not only allow analysts to prioritize the allocation of their investigation efforts but may also be applied to detect new malware samples based on their association with existing families. In this paper, we address the multi-class malware classification challenge from a scalability perspective. We present the design, development, and evaluation of a novel machine learning classifier trained on multifaceted content features (e.g., instruction sequences, strings, section information, and other malware features) as well as threat intelligence gathered from external sources (e.g., antivirus output). Our experiments on a dataset of 21,741 malware samples demonstrate the efficacy and precision of the proposed algorithm and also provide insights into the utility of various features.
AB - Recent years have witnessed the very rapid increase in both the volume and sophistication of malware programs. Malware authors invest heavily in technologies and capabilities to streamline the process of building and mutating existing malware programs to evade traditional protection. One major challenge currently faced by the antivirus industry is to efficiently process the vast amount of incoming suspicious samples. Since most new malware is a variation of an existing malware family with the same forms of malicious behavior, automatic clustering and classification of malware programs into families have become valuable tools for malware analysts. Such grouping criteria not only allow analysts to prioritize the allocation of their investigation efforts but may also be applied to detect new malware samples based on their association with existing families. In this paper, we address the multi-class malware classification challenge from a scalability perspective. We present the design, development, and evaluation of a novel machine learning classifier trained on multifaceted content features (e.g., instruction sequences, strings, section information, and other malware features) as well as threat intelligence gathered from external sources (e.g., antivirus output). Our experiments on a dataset of 21,741 malware samples demonstrate the efficacy and precision of the proposed algorithm and also provide insights into the utility of various features.
UR - http://www.scopus.com/inward/record.url?scp=84982685104&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84982685104&partnerID=8YFLogxK
U2 - 10.1147/JRD.2016.2559378
DO - 10.1147/JRD.2016.2559378
M3 - Article
AN - SCOPUS:84982685104
SN - 0018-8646
VL - 60
JO - IBM Journal of Research and Development
JF - IBM Journal of Research and Development
IS - 4
M1 - 7523365
ER -