Binary Function Clustering Using Semantic Hashes
The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces of code are similar if they have the same effect on a machine's state. Current state-of-the-art tools employ a variety of pair wise com...
Saved in:
Published in | 2012 Eleventh International Conference on Machine Learning and Applications Vol. 1; pp. 386 - 391 |
---|---|
Main Authors | , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.12.2012
|
Subjects | |
Online Access | Get full text |
ISBN | 1467346519 9781467346511 |
DOI | 10.1109/ICMLA.2012.70 |
Cover
Abstract | The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces of code are similar if they have the same effect on a machine's state. Current state-of-the-art tools employ a variety of pair wise comparisons (e.g., template matching using SMT solvers, Value-Set analysis at critical program points, API call matching, etc.) However, these methods are unshakable for clustering large datasets, of size N, since they require O(N 2 ) comparisons. In this paper, we present an alternative approach based upon "hashing". We propose a scheme that captures the semantics of functions as semantic hashes. Our approach treats a function as a set of features, each of which represent the input-output behavior of a basic block. Using a form of locality-sensitive hashing known as Min Hashing, functions with many common features can be quickly identified, and the complexity of clustering is reduced to O(N). Experiments on functions extracted from the CERT malware catalog indicate that we are able to cluster closely related code with a low false positive rate. |
---|---|
AbstractList | The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces of code are similar if they have the same effect on a machine's state. Current state-of-the-art tools employ a variety of pair wise comparisons (e.g., template matching using SMT solvers, Value-Set analysis at critical program points, API call matching, etc.) However, these methods are unshakable for clustering large datasets, of size N, since they require O(N 2 ) comparisons. In this paper, we present an alternative approach based upon "hashing". We propose a scheme that captures the semantics of functions as semantic hashes. Our approach treats a function as a set of features, each of which represent the input-output behavior of a basic block. Using a form of locality-sensitive hashing known as Min Hashing, functions with many common features can be quickly identified, and the complexity of clustering is reduced to O(N). Experiments on functions extracted from the CERT malware catalog indicate that we are able to cluster closely related code with a low false positive rate. |
Author | Havrilla, J. Hines, C. Gurfinkel, A. Jin, W. Chaki, S. Cohen, C. Narasimhan, P. |
Author_xml | – sequence: 1 givenname: W. surname: Jin fullname: Jin, W. organization: Carnegie Mellon Univ., Pittsburgh, PA, USA – sequence: 2 givenname: S. surname: Chaki fullname: Chaki, S. organization: Carnegie Mellon Univ., Pittsburgh, PA, USA – sequence: 3 givenname: C. surname: Cohen fullname: Cohen, C. organization: Carnegie Mellon Univ., Pittsburgh, PA, USA – sequence: 4 givenname: A. surname: Gurfinkel fullname: Gurfinkel, A. organization: Carnegie Mellon Univ., Pittsburgh, PA, USA – sequence: 5 givenname: J. surname: Havrilla fullname: Havrilla, J. organization: Carnegie Mellon Univ., Pittsburgh, PA, USA – sequence: 6 givenname: C. surname: Hines fullname: Hines, C. organization: Carnegie Mellon Univ., Pittsburgh, PA, USA – sequence: 7 givenname: P. surname: Narasimhan fullname: Narasimhan, P. organization: Carnegie Mellon Univ., Pittsburgh, PA, USA |
BookMark | eNotjrFOw0AQRA8BEiS4pKLxDzjseu_O3jJYhEQyooDUkX1ew6HkgnxOwd9jBM1M9ebNTF2EYxClbhEWiMD3m-q5Xi5ywHxRwJlKuCihsGw0I-XnaobaFqStQb5SSYyfADBxlrS-VvDgQzN8p6tTcKM_hrTan-Iogw_v6Tb-5qscmjB6l66b-CHxRl32zT5K8t9ztV09vlXrrH552lTLOvNYmDFj4nJyau1K05F13AkYbilvWovO5q3tSKBvTUOdYzEsmpAsYgfkpBeaq7u_XS8iu6_BH6abO6vBWib6ATn4RTI |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/ICMLA.2012.70 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9780769549132 0769549136 |
EndPage | 391 |
ExternalDocumentID | 6406693 |
Genre | orig-research |
GroupedDBID | 6IE 6IF 6IK 6IL 6IN AAJGR AAWTH ADFMO ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK IERZE OCL RIE RIL |
ID | FETCH-LOGICAL-i175t-939834644c85d36c9de059b32ab61c62b6d3e0fb5a3dc9e59e4313611d03cefe3 |
IEDL.DBID | RIE |
ISBN | 1467346519 9781467346511 |
IngestDate | Wed Aug 27 03:56:07 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i175t-939834644c85d36c9de059b32ab61c62b6d3e0fb5a3dc9e59e4313611d03cefe3 |
PageCount | 6 |
ParticipantIDs | ieee_primary_6406693 |
PublicationCentury | 2000 |
PublicationDate | 2012-Dec. |
PublicationDateYYYYMMDD | 2012-12-01 |
PublicationDate_xml | – month: 12 year: 2012 text: 2012-Dec. |
PublicationDecade | 2010 |
PublicationTitle | 2012 Eleventh International Conference on Machine Learning and Applications |
PublicationTitleAbbrev | icmla |
PublicationYear | 2012 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0001106344 |
Score | 1.645653 |
Snippet | The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 386 |
SubjectTerms | Benchmark testing binary static analysis clustering Concrete Feature extraction Malware malware detection Registers reverse engineering semantic comparison Semantics |
Title | Binary Function Clustering Using Semantic Hashes |
URI | https://ieeexplore.ieee.org/document/6406693 |
Volume | 1 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELXaTkwFWsS3PDCSNI4dNx4hoiqIIiSo1K3yx1VUQItosvDrOSdpixADWxIpkn2O9d459-4RcqG5NbrP4gBsjAlKAizwXqOB0tIhnUXM5V7vPHqQw7G4mySTBrncaGEAoCw-g9Bflv_y3dIW_qisJxF9pOJN0sTPrNJqbc9TMLfhQpTaLdnn3uJbrVs61fds22Ozd5uN7q98ZVccep_iH84qJbAM2mS0HlJVT_IaFrkJ7devbo3_HfMu6W4lfPRxA057pAGLfdJeezjQekt3SHRdCnLpAPHNrxHN3grfOwFfomU5AX2Cd4z-3NKhXr3AqkvGg5vnbBjULgrBHKlBHiiuUpy-EDZNHJdWOUBKZXisjWRWxkY6DtHMJJo7qyBRgJyCS8ZcxC3MgB-Q1mK5gENCZ8gFmY4iwDRJ4MY1wqVpksRCG43Ujx2Rjg_A9KNqlDGt53789-MTsuPjX9WGnJJW_lnAGSJ8bs7Lpf0GQpWf-g |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFG4QD3pCBeNve_Do5rZ2ZT0qkQxlxERIuJG2e0SigpHt4l_v6zbAGA_etiVN2r403_e6972PkCvFjFZtP3DABJighOA71mvUkUqkSGcRc5nVOycDEY_4wzgc18j1WgsDAEXxGbj2sfiXny5Mbq_KbgSij5Bsi2wj7vOwVGttblQwu2GcF-ot0WbW5FuumjpV7_6my-ZNr5P0b21tV-Bap-If3ioFtHQbJFlNqqwoeXXzTLvm61e_xv_Oeo-0NiI--rSGp31Sg_kBaaxcHGh1qJvEuyskubSLCGejRDtvue2egINoUVBAn-Ed939maKyWL7BskVH3ftiJncpHwZkhOcgcyWSEy-fcRGHKhJEpIKnSLFBa-EYEWqQMvKkOFUuNhFACsgomfD_1mIEpsENSny_mcEToFNmgrzwPMFHieHQ1T6MoDAOutELy5x-Tpt2AyUfZKmNSrf3k78-XZCceJv1Jvzd4PCW7NhZlpcgZqWefOZwj3mf6ogjzN9uBo0c |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2012+Eleventh+International+Conference+on+Machine+Learning+and+Applications&rft.atitle=Binary+Function+Clustering+Using+Semantic+Hashes&rft.au=Jin%2C+W.&rft.au=Chaki%2C+S.&rft.au=Cohen%2C+C.&rft.au=Gurfinkel%2C+A.&rft.date=2012-12-01&rft.pub=IEEE&rft.isbn=9781467346511&rft.volume=1&rft.spage=386&rft.epage=391&rft_id=info:doi/10.1109%2FICMLA.2012.70&rft.externalDocID=6406693 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467346511/lc.gif&client=summon&freeimage=true |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467346511/mc.gif&client=summon&freeimage=true |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467346511/sc.gif&client=summon&freeimage=true |