MalwareHunt

semantics-based malware diffing speedup by normalized basic block memoization

Jiang Ming, Dongpeng Xu, Dinghao Wu

Research output: Contribution to journalArticle

Abstract

The key challenge of software reverse engineering is that the source code of the program under investigation is typically not available. Identifying differences between two executable binaries (binary diffing) can reveal valuable information in the absence of source code, such as vulnerability patches, software plagiarism evidence, and malware variant relations. Recently, a new binary diffing method based on symbolic execution and constraint solving has been proposed to look for the code pairs with the same semantics, even though they are ostensibly different in syntactics. Such semantics-based method captures intrinsic differences/similarities of binary code, making it a compelling choice to analyze highly-obfuscated malicious programs. However, due to the nature of symbolic execution, semantics-based binary diffing suffers from significant performance slowdown, hindering it from analyzing large numbers of malware samples. In this paper, we attempt to mitigate the high overhead of semantics-based binary diffing with application to malware lineage inference. We first study the key obstacles that contribute to the performance bottleneck. Then we propose normalized basic block memoization to speed up semantics-based binary diffing. We introduce an union-find set structure that records semantically equivalent basic blocks. Managing the union-find structure during successive comparisons allows direct reuse of previously computed results. Moreover, we utilize a set of enhanced optimization methods to further cut down the invocation numbers of constraint solver. We have implemented our technique, called MalwareHunt, on top of a trace-oriented binary diffing tool and evaluated it on 15 polymorphic and metamorphic malware families. We perform intra-family comparisons for the purpose of malware lineage inference. Our experimental results show that MalwareHuntcan accelerate symbolic execution from 2.8X to 5.3X (with an average 4.1X), and reduce constraint solver invocation by a factor of 3.0X to 6.0X (with an average 4.5X).

Original languageEnglish (US)
Pages (from-to)167-178
Number of pages12
JournalJournal of Computer Virology and Hacking Techniques
Volume13
Issue number3
DOIs
StatePublished - Aug 1 2017

Fingerprint

Semantics
Binary codes
Reverse engineering
Syntactics
Software engineering
Malware

All Science Journal Classification (ASJC) codes

  • Computer Science (miscellaneous)
  • Software
  • Hardware and Architecture
  • Computational Theory and Mathematics

Cite this

@article{31765a6d2b294d958a3a9a224aef803b,
title = "MalwareHunt: semantics-based malware diffing speedup by normalized basic block memoization",
abstract = "The key challenge of software reverse engineering is that the source code of the program under investigation is typically not available. Identifying differences between two executable binaries (binary diffing) can reveal valuable information in the absence of source code, such as vulnerability patches, software plagiarism evidence, and malware variant relations. Recently, a new binary diffing method based on symbolic execution and constraint solving has been proposed to look for the code pairs with the same semantics, even though they are ostensibly different in syntactics. Such semantics-based method captures intrinsic differences/similarities of binary code, making it a compelling choice to analyze highly-obfuscated malicious programs. However, due to the nature of symbolic execution, semantics-based binary diffing suffers from significant performance slowdown, hindering it from analyzing large numbers of malware samples. In this paper, we attempt to mitigate the high overhead of semantics-based binary diffing with application to malware lineage inference. We first study the key obstacles that contribute to the performance bottleneck. Then we propose normalized basic block memoization to speed up semantics-based binary diffing. We introduce an union-find set structure that records semantically equivalent basic blocks. Managing the union-find structure during successive comparisons allows direct reuse of previously computed results. Moreover, we utilize a set of enhanced optimization methods to further cut down the invocation numbers of constraint solver. We have implemented our technique, called MalwareHunt, on top of a trace-oriented binary diffing tool and evaluated it on 15 polymorphic and metamorphic malware families. We perform intra-family comparisons for the purpose of malware lineage inference. Our experimental results show that MalwareHuntcan accelerate symbolic execution from 2.8X to 5.3X (with an average 4.1X), and reduce constraint solver invocation by a factor of 3.0X to 6.0X (with an average 4.5X).",
author = "Jiang Ming and Dongpeng Xu and Dinghao Wu",
year = "2017",
month = "8",
day = "1",
doi = "10.1007/s11416-016-0279-x",
language = "English (US)",
volume = "13",
pages = "167--178",
journal = "Journal of Computer Virology and Hacking Techniques",
issn = "2274-2042",
publisher = "Springer Science + Business Media",
number = "3",

}

MalwareHunt : semantics-based malware diffing speedup by normalized basic block memoization. / Ming, Jiang; Xu, Dongpeng; Wu, Dinghao.

In: Journal of Computer Virology and Hacking Techniques, Vol. 13, No. 3, 01.08.2017, p. 167-178.

Research output: Contribution to journalArticle

TY - JOUR

T1 - MalwareHunt

T2 - semantics-based malware diffing speedup by normalized basic block memoization

AU - Ming, Jiang

AU - Xu, Dongpeng

AU - Wu, Dinghao

PY - 2017/8/1

Y1 - 2017/8/1

N2 - The key challenge of software reverse engineering is that the source code of the program under investigation is typically not available. Identifying differences between two executable binaries (binary diffing) can reveal valuable information in the absence of source code, such as vulnerability patches, software plagiarism evidence, and malware variant relations. Recently, a new binary diffing method based on symbolic execution and constraint solving has been proposed to look for the code pairs with the same semantics, even though they are ostensibly different in syntactics. Such semantics-based method captures intrinsic differences/similarities of binary code, making it a compelling choice to analyze highly-obfuscated malicious programs. However, due to the nature of symbolic execution, semantics-based binary diffing suffers from significant performance slowdown, hindering it from analyzing large numbers of malware samples. In this paper, we attempt to mitigate the high overhead of semantics-based binary diffing with application to malware lineage inference. We first study the key obstacles that contribute to the performance bottleneck. Then we propose normalized basic block memoization to speed up semantics-based binary diffing. We introduce an union-find set structure that records semantically equivalent basic blocks. Managing the union-find structure during successive comparisons allows direct reuse of previously computed results. Moreover, we utilize a set of enhanced optimization methods to further cut down the invocation numbers of constraint solver. We have implemented our technique, called MalwareHunt, on top of a trace-oriented binary diffing tool and evaluated it on 15 polymorphic and metamorphic malware families. We perform intra-family comparisons for the purpose of malware lineage inference. Our experimental results show that MalwareHuntcan accelerate symbolic execution from 2.8X to 5.3X (with an average 4.1X), and reduce constraint solver invocation by a factor of 3.0X to 6.0X (with an average 4.5X).

AB - The key challenge of software reverse engineering is that the source code of the program under investigation is typically not available. Identifying differences between two executable binaries (binary diffing) can reveal valuable information in the absence of source code, such as vulnerability patches, software plagiarism evidence, and malware variant relations. Recently, a new binary diffing method based on symbolic execution and constraint solving has been proposed to look for the code pairs with the same semantics, even though they are ostensibly different in syntactics. Such semantics-based method captures intrinsic differences/similarities of binary code, making it a compelling choice to analyze highly-obfuscated malicious programs. However, due to the nature of symbolic execution, semantics-based binary diffing suffers from significant performance slowdown, hindering it from analyzing large numbers of malware samples. In this paper, we attempt to mitigate the high overhead of semantics-based binary diffing with application to malware lineage inference. We first study the key obstacles that contribute to the performance bottleneck. Then we propose normalized basic block memoization to speed up semantics-based binary diffing. We introduce an union-find set structure that records semantically equivalent basic blocks. Managing the union-find structure during successive comparisons allows direct reuse of previously computed results. Moreover, we utilize a set of enhanced optimization methods to further cut down the invocation numbers of constraint solver. We have implemented our technique, called MalwareHunt, on top of a trace-oriented binary diffing tool and evaluated it on 15 polymorphic and metamorphic malware families. We perform intra-family comparisons for the purpose of malware lineage inference. Our experimental results show that MalwareHuntcan accelerate symbolic execution from 2.8X to 5.3X (with an average 4.1X), and reduce constraint solver invocation by a factor of 3.0X to 6.0X (with an average 4.5X).

UR - http://www.scopus.com/inward/record.url?scp=84969131774&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84969131774&partnerID=8YFLogxK

U2 - 10.1007/s11416-016-0279-x

DO - 10.1007/s11416-016-0279-x

M3 - Article

VL - 13

SP - 167

EP - 178

JO - Journal of Computer Virology and Hacking Techniques

JF - Journal of Computer Virology and Hacking Techniques

SN - 2274-2042

IS - 3

ER -