In-memory fuzzing for binary code similarity analysis

Shuai Wang, Dinghao Wu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

5 Citations (Scopus)

Abstract

Detecting similar functions in binary executables serves as a foundation for many binary code analysis and reuse tasks. By far, recognizing similar components in binary code remains a challenge. Existing research employs either static or dynamic approaches to capture program syntax or semantics-level features for comparison. However, there exist multiple design limitations in previous work, which result in relatively high cost, low accuracy and scalability, and thus severely impede their practical use. In this paper, we present a novel method that leverages in-memory fuzzing for binary code similarity analysis. Our prototype tool IMF-SIM applies in-memory fuzzing to launch analysis towards every function and collect traces of different kinds of program behaviors. The similarity score of two behavior traces is computed according to their longest common subsequence. To compare two functions, a feature vector is generated, whose elements are the similarity scores of the behavior trace-level comparisons. We train a machine learning model through labeled feature vectors; later, for a given feature vector by comparing two functions, the trained model gives a final score, representing the similarity score of the two functions. We evaluate IMF-SIM against binaries compiled by different compilers, optimizations, and commonly-used obfuscation methods, in total over one thousand binary executables. Our evaluation shows that IMF-SIM notably outperforms existing tools with higher accuracy and broader application scopes.

Original languageEnglish (US)
Title of host publicationASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering
EditorsTien N. Nguyen, Grigore Rosu, Massimiliano Di Penta
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages319-330
Number of pages12
ISBN (Electronic)9781538626849
DOIs
StatePublished - Nov 20 2017
Event32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017 - Urbana-Champaign, United States
Duration: Oct 30 2017Nov 3 2017

Publication series

NameASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering

Other

Other32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017
CountryUnited States
CityUrbana-Champaign
Period10/30/1711/3/17

Fingerprint

Binary codes
Binary Code
Data storage equipment
Feature Vector
Trace
Binary
Obfuscation
Compiler Optimization
Longest Common Subsequence
Leverage
Reuse
Learning systems
Scalability
Machine Learning
High Accuracy
Semantics
Similarity
Prototype
Evaluate
Evaluation

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Software
  • Control and Optimization

Cite this

Wang, S., & Wu, D. (2017). In-memory fuzzing for binary code similarity analysis. In T. N. Nguyen, G. Rosu, & M. Di Penta (Eds.), ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (pp. 319-330). [8115645] (ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ASE.2017.8115645
Wang, Shuai ; Wu, Dinghao. / In-memory fuzzing for binary code similarity analysis. ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. editor / Tien N. Nguyen ; Grigore Rosu ; Massimiliano Di Penta. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 319-330 (ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering).
@inproceedings{53c08942f23b4c52a6e7c6b68929c71f,
title = "In-memory fuzzing for binary code similarity analysis",
abstract = "Detecting similar functions in binary executables serves as a foundation for many binary code analysis and reuse tasks. By far, recognizing similar components in binary code remains a challenge. Existing research employs either static or dynamic approaches to capture program syntax or semantics-level features for comparison. However, there exist multiple design limitations in previous work, which result in relatively high cost, low accuracy and scalability, and thus severely impede their practical use. In this paper, we present a novel method that leverages in-memory fuzzing for binary code similarity analysis. Our prototype tool IMF-SIM applies in-memory fuzzing to launch analysis towards every function and collect traces of different kinds of program behaviors. The similarity score of two behavior traces is computed according to their longest common subsequence. To compare two functions, a feature vector is generated, whose elements are the similarity scores of the behavior trace-level comparisons. We train a machine learning model through labeled feature vectors; later, for a given feature vector by comparing two functions, the trained model gives a final score, representing the similarity score of the two functions. We evaluate IMF-SIM against binaries compiled by different compilers, optimizations, and commonly-used obfuscation methods, in total over one thousand binary executables. Our evaluation shows that IMF-SIM notably outperforms existing tools with higher accuracy and broader application scopes.",
author = "Shuai Wang and Dinghao Wu",
year = "2017",
month = "11",
day = "20",
doi = "10.1109/ASE.2017.8115645",
language = "English (US)",
series = "ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "319--330",
editor = "Nguyen, {Tien N.} and Grigore Rosu and {Di Penta}, Massimiliano",
booktitle = "ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering",
address = "United States",

}

Wang, S & Wu, D 2017, In-memory fuzzing for binary code similarity analysis. in TN Nguyen, G Rosu & M Di Penta (eds), ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering., 8115645, ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, Institute of Electrical and Electronics Engineers Inc., pp. 319-330, 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, Urbana-Champaign, United States, 10/30/17. https://doi.org/10.1109/ASE.2017.8115645

In-memory fuzzing for binary code similarity analysis. / Wang, Shuai; Wu, Dinghao.

ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. ed. / Tien N. Nguyen; Grigore Rosu; Massimiliano Di Penta. Institute of Electrical and Electronics Engineers Inc., 2017. p. 319-330 8115645 (ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - In-memory fuzzing for binary code similarity analysis

AU - Wang, Shuai

AU - Wu, Dinghao

PY - 2017/11/20

Y1 - 2017/11/20

N2 - Detecting similar functions in binary executables serves as a foundation for many binary code analysis and reuse tasks. By far, recognizing similar components in binary code remains a challenge. Existing research employs either static or dynamic approaches to capture program syntax or semantics-level features for comparison. However, there exist multiple design limitations in previous work, which result in relatively high cost, low accuracy and scalability, and thus severely impede their practical use. In this paper, we present a novel method that leverages in-memory fuzzing for binary code similarity analysis. Our prototype tool IMF-SIM applies in-memory fuzzing to launch analysis towards every function and collect traces of different kinds of program behaviors. The similarity score of two behavior traces is computed according to their longest common subsequence. To compare two functions, a feature vector is generated, whose elements are the similarity scores of the behavior trace-level comparisons. We train a machine learning model through labeled feature vectors; later, for a given feature vector by comparing two functions, the trained model gives a final score, representing the similarity score of the two functions. We evaluate IMF-SIM against binaries compiled by different compilers, optimizations, and commonly-used obfuscation methods, in total over one thousand binary executables. Our evaluation shows that IMF-SIM notably outperforms existing tools with higher accuracy and broader application scopes.

AB - Detecting similar functions in binary executables serves as a foundation for many binary code analysis and reuse tasks. By far, recognizing similar components in binary code remains a challenge. Existing research employs either static or dynamic approaches to capture program syntax or semantics-level features for comparison. However, there exist multiple design limitations in previous work, which result in relatively high cost, low accuracy and scalability, and thus severely impede their practical use. In this paper, we present a novel method that leverages in-memory fuzzing for binary code similarity analysis. Our prototype tool IMF-SIM applies in-memory fuzzing to launch analysis towards every function and collect traces of different kinds of program behaviors. The similarity score of two behavior traces is computed according to their longest common subsequence. To compare two functions, a feature vector is generated, whose elements are the similarity scores of the behavior trace-level comparisons. We train a machine learning model through labeled feature vectors; later, for a given feature vector by comparing two functions, the trained model gives a final score, representing the similarity score of the two functions. We evaluate IMF-SIM against binaries compiled by different compilers, optimizations, and commonly-used obfuscation methods, in total over one thousand binary executables. Our evaluation shows that IMF-SIM notably outperforms existing tools with higher accuracy and broader application scopes.

UR - http://www.scopus.com/inward/record.url?scp=85041433639&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85041433639&partnerID=8YFLogxK

U2 - 10.1109/ASE.2017.8115645

DO - 10.1109/ASE.2017.8115645

M3 - Conference contribution

AN - SCOPUS:85041433639

T3 - ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering

SP - 319

EP - 330

BT - ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering

A2 - Nguyen, Tien N.

A2 - Rosu, Grigore

A2 - Di Penta, Massimiliano

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Wang S, Wu D. In-memory fuzzing for binary code similarity analysis. In Nguyen TN, Rosu G, Di Penta M, editors, ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering. Institute of Electrical and Electronics Engineers Inc. 2017. p. 319-330. 8115645. (ASE 2017 - Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering). https://doi.org/10.1109/ASE.2017.8115645