Enhancing DNN-Based Binary Code Function Search With Low-Cost Equivalence Checking

Binary code function search has been used as the core basis of various security and software engineering applications, including malware clustering, code clone detection, and vulnerability audits. Recognizing logically similar assembly functions, however, remains a challenge. Most binary code search...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on software engineering Vol. 49; no. 1; pp. 226 - 250
Main Authors	Wang, Huaijin, Ma, Pingchuan, Yuan, Yuanyuan, Liu, Zhibo, Wang, Shuai, Tang, Qiyi, Nie, Sen, Wu, Shi
Format	Journal Article
Language	English
Published	New York IEEE 01.01.2023 IEEE Computer Society
Subjects	Applications programs Artificial neural networks Assembly Binary codes Clustering Codes Compilers Constraints Cybersecurity deep learning Equivalence False alarms Flow graphs Low cost Machine learning Malware Optimization Reverse engineering Searching Security Semantics Software engineering software similarity symbolic execution Task analysis
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Binary code function search has been used as the core basis of various security and software engineering applications, including malware clustering, code clone detection, and vulnerability audits. Recognizing logically similar assembly functions, however, remains a challenge. Most binary code search tools rely on program structure-level information, such as control flow and data flow graphs, that is extracted using program analysis techniques or deep neural networks (DNNs). However, DNN-based techniques capture lexical-, control structure-, or data flow-level information of binary code for representation learning, which is often too coarse-grained and does not accurately denote program functionality. Additionally, it may exhibit low robustness to a variety of challenging settings, such as compiler optimizations and obfuscations. This paper proposes a general solution for enhancing the top-<inline-formula><tex-math notation="LaTeX">k</tex-math> <mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq1-3149240.gif"/> </inline-formula> ranked candidates in DNN-based binary code function search. The key idea is to design a low-cost and comprehensive equivalence check that quickly exposes functionality deviations between the target function and its top-<inline-formula><tex-math notation="LaTeX">k</tex-math> <mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq2-3149240.gif"/> </inline-formula> matched functions. Functions that fail this equivalence check can be shaved from the top-<inline-formula><tex-math notation="LaTeX">k</tex-math> <mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq3-3149240.gif"/> </inline-formula> list, and functions that pass the check can be revisited to move ahead on the top-<inline-formula><tex-math notation="LaTeX">k</tex-math> <mml:math><mml:mi>k</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq4-3149240.gif"/> </inline-formula> ranked candidates, in a deliberate way. We design a practical and efficient equivalence check, named BinUSE , using under-constrained symbolic execution (USE). USE, a variant of symbolic execution, improves scalability by initiating symbolic execution directly from function entry points and relaxing constraints on function parameters. It eliminates the overhead incurred by path explosion and costly constraints. BinUSE is specifically designed to deliver an assembly function-level equivalence check, enhancing DNN-based binary code search by reducing its false alarms with low cost. Our evaluation shows that BinUSE can enable a general and effective enhancement of four state-of-the-art DNN-based binary code search tools when confronted with challenges posed by different compilers, optimizations, obfuscations, and architectures.
ISSN:	0098-5589 1939-3520
DOI:	10.1109/TSE.2022.3149240