Semantics-aware machine learning for function recognition in binary code

Shuai Wang, Pei Wang, Dinghao Wu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

6 Citations (Scopus)

Abstract

Function recognition in program binaries serves as the foundation for many binary instrumentation and analysis tasks. However, as binaries are usually stripped before distribution, function information is indeed absent in most binaries. By far, identifying functions in stripped binaries remains a challenge. Recent research work proposes to recognize functionsinbinary code through machine learning techniques. The recognition model, including typical function entry point patterns, is automatically constructed through learning. However, we observed that as previous work only leverages syntax-level features to train the model, binary obfuscation techniques can undermine the prelearned models in real-world usage scenarios. In this paper, we propose FID, a semantics-based method to recognize functions in stripped binaries. We leverage symbolic execution to generate semantic information and learn the function recognition model through well-performing machine learning techniques. FID extracts semantic information from binary code and, therefore, is effectively adapted to different compilers and optimizations. Moreover, we also demonstrate that FID has high recognition accuracy on binaries transformed by widely-used obfuscation techniques. We evaluate FID with over four thousand test cases. Our evaluation shows that FID is comparable with previous work on normal binaries and it notably outperforms existing tools on obfuscated code.

Original languageEnglish (US)
Title of host publicationProceedings - 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages388-398
Number of pages11
ISBN (Electronic)9781538609927
DOIs
StatePublished - Nov 2 2017
Event2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017 - Shanghai, China
Duration: Sep 19 2017Sep 22 2017

Other

Other2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017
CountryChina
CityShanghai
Period9/19/179/22/17

Fingerprint

Binary codes
Learning systems
Semantics
Distribution functions

All Science Journal Classification (ASJC) codes

  • Safety, Risk, Reliability and Quality
  • Software

Cite this

Wang, S., Wang, P., & Wu, D. (2017). Semantics-aware machine learning for function recognition in binary code. In Proceedings - 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017 (pp. 388-398). [8094438] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICSME.2017.59
Wang, Shuai ; Wang, Pei ; Wu, Dinghao. / Semantics-aware machine learning for function recognition in binary code. Proceedings - 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 388-398
@inproceedings{733a65f380864406b40c166862789c8a,
title = "Semantics-aware machine learning for function recognition in binary code",
abstract = "Function recognition in program binaries serves as the foundation for many binary instrumentation and analysis tasks. However, as binaries are usually stripped before distribution, function information is indeed absent in most binaries. By far, identifying functions in stripped binaries remains a challenge. Recent research work proposes to recognize functionsinbinary code through machine learning techniques. The recognition model, including typical function entry point patterns, is automatically constructed through learning. However, we observed that as previous work only leverages syntax-level features to train the model, binary obfuscation techniques can undermine the prelearned models in real-world usage scenarios. In this paper, we propose FID, a semantics-based method to recognize functions in stripped binaries. We leverage symbolic execution to generate semantic information and learn the function recognition model through well-performing machine learning techniques. FID extracts semantic information from binary code and, therefore, is effectively adapted to different compilers and optimizations. Moreover, we also demonstrate that FID has high recognition accuracy on binaries transformed by widely-used obfuscation techniques. We evaluate FID with over four thousand test cases. Our evaluation shows that FID is comparable with previous work on normal binaries and it notably outperforms existing tools on obfuscated code.",
author = "Shuai Wang and Pei Wang and Dinghao Wu",
year = "2017",
month = "11",
day = "2",
doi = "10.1109/ICSME.2017.59",
language = "English (US)",
pages = "388--398",
booktitle = "Proceedings - 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
address = "United States",

}

Wang, S, Wang, P & Wu, D 2017, Semantics-aware machine learning for function recognition in binary code. in Proceedings - 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017., 8094438, Institute of Electrical and Electronics Engineers Inc., pp. 388-398, 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017, Shanghai, China, 9/19/17. https://doi.org/10.1109/ICSME.2017.59

Semantics-aware machine learning for function recognition in binary code. / Wang, Shuai; Wang, Pei; Wu, Dinghao.

Proceedings - 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 388-398 8094438.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Semantics-aware machine learning for function recognition in binary code

AU - Wang, Shuai

AU - Wang, Pei

AU - Wu, Dinghao

PY - 2017/11/2

Y1 - 2017/11/2

N2 - Function recognition in program binaries serves as the foundation for many binary instrumentation and analysis tasks. However, as binaries are usually stripped before distribution, function information is indeed absent in most binaries. By far, identifying functions in stripped binaries remains a challenge. Recent research work proposes to recognize functionsinbinary code through machine learning techniques. The recognition model, including typical function entry point patterns, is automatically constructed through learning. However, we observed that as previous work only leverages syntax-level features to train the model, binary obfuscation techniques can undermine the prelearned models in real-world usage scenarios. In this paper, we propose FID, a semantics-based method to recognize functions in stripped binaries. We leverage symbolic execution to generate semantic information and learn the function recognition model through well-performing machine learning techniques. FID extracts semantic information from binary code and, therefore, is effectively adapted to different compilers and optimizations. Moreover, we also demonstrate that FID has high recognition accuracy on binaries transformed by widely-used obfuscation techniques. We evaluate FID with over four thousand test cases. Our evaluation shows that FID is comparable with previous work on normal binaries and it notably outperforms existing tools on obfuscated code.

AB - Function recognition in program binaries serves as the foundation for many binary instrumentation and analysis tasks. However, as binaries are usually stripped before distribution, function information is indeed absent in most binaries. By far, identifying functions in stripped binaries remains a challenge. Recent research work proposes to recognize functionsinbinary code through machine learning techniques. The recognition model, including typical function entry point patterns, is automatically constructed through learning. However, we observed that as previous work only leverages syntax-level features to train the model, binary obfuscation techniques can undermine the prelearned models in real-world usage scenarios. In this paper, we propose FID, a semantics-based method to recognize functions in stripped binaries. We leverage symbolic execution to generate semantic information and learn the function recognition model through well-performing machine learning techniques. FID extracts semantic information from binary code and, therefore, is effectively adapted to different compilers and optimizations. Moreover, we also demonstrate that FID has high recognition accuracy on binaries transformed by widely-used obfuscation techniques. We evaluate FID with over four thousand test cases. Our evaluation shows that FID is comparable with previous work on normal binaries and it notably outperforms existing tools on obfuscated code.

UR - http://www.scopus.com/inward/record.url?scp=85037094010&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85037094010&partnerID=8YFLogxK

U2 - 10.1109/ICSME.2017.59

DO - 10.1109/ICSME.2017.59

M3 - Conference contribution

SP - 388

EP - 398

BT - Proceedings - 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017

PB - Institute of Electrical and Electronics Engineers Inc.

ER -

Wang S, Wang P, Wu D. Semantics-aware machine learning for function recognition in binary code. In Proceedings - 2017 IEEE International Conference on Software Maintenance and Evolution, ICSME 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 388-398. 8094438 https://doi.org/10.1109/ICSME.2017.59