Binary Function Clustering Using Semantic Hashes

The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces of code are similar if they have the same effect on a machine's state. Current state-of-the-art tools employ a variety of pair wise com...

Full description

Saved in:
Bibliographic Details
Published in2012 Eleventh International Conference on Machine Learning and Applications Vol. 1; pp. 386 - 391
Main Authors Jin, W., Chaki, S., Cohen, C., Gurfinkel, A., Havrilla, J., Hines, C., Narasimhan, P.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2012
Subjects
Online AccessGet full text
ISBN1467346519
9781467346511
DOI10.1109/ICMLA.2012.70

Cover

Abstract The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces of code are similar if they have the same effect on a machine's state. Current state-of-the-art tools employ a variety of pair wise comparisons (e.g., template matching using SMT solvers, Value-Set analysis at critical program points, API call matching, etc.) However, these methods are unshakable for clustering large datasets, of size N, since they require O(N 2 ) comparisons. In this paper, we present an alternative approach based upon "hashing". We propose a scheme that captures the semantics of functions as semantic hashes. Our approach treats a function as a set of features, each of which represent the input-output behavior of a basic block. Using a form of locality-sensitive hashing known as Min Hashing, functions with many common features can be quickly identified, and the complexity of clustering is reduced to O(N). Experiments on functions extracted from the CERT malware catalog indicate that we are able to cluster closely related code with a low false positive rate.
AbstractList The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces of code are similar if they have the same effect on a machine's state. Current state-of-the-art tools employ a variety of pair wise comparisons (e.g., template matching using SMT solvers, Value-Set analysis at critical program points, API call matching, etc.) However, these methods are unshakable for clustering large datasets, of size N, since they require O(N 2 ) comparisons. In this paper, we present an alternative approach based upon "hashing". We propose a scheme that captures the semantics of functions as semantic hashes. Our approach treats a function as a set of features, each of which represent the input-output behavior of a basic block. Using a form of locality-sensitive hashing known as Min Hashing, functions with many common features can be quickly identified, and the complexity of clustering is reduced to O(N). Experiments on functions extracted from the CERT malware catalog indicate that we are able to cluster closely related code with a low false positive rate.
Author Havrilla, J.
Hines, C.
Gurfinkel, A.
Jin, W.
Chaki, S.
Cohen, C.
Narasimhan, P.
Author_xml – sequence: 1
  givenname: W.
  surname: Jin
  fullname: Jin, W.
  organization: Carnegie Mellon Univ., Pittsburgh, PA, USA
– sequence: 2
  givenname: S.
  surname: Chaki
  fullname: Chaki, S.
  organization: Carnegie Mellon Univ., Pittsburgh, PA, USA
– sequence: 3
  givenname: C.
  surname: Cohen
  fullname: Cohen, C.
  organization: Carnegie Mellon Univ., Pittsburgh, PA, USA
– sequence: 4
  givenname: A.
  surname: Gurfinkel
  fullname: Gurfinkel, A.
  organization: Carnegie Mellon Univ., Pittsburgh, PA, USA
– sequence: 5
  givenname: J.
  surname: Havrilla
  fullname: Havrilla, J.
  organization: Carnegie Mellon Univ., Pittsburgh, PA, USA
– sequence: 6
  givenname: C.
  surname: Hines
  fullname: Hines, C.
  organization: Carnegie Mellon Univ., Pittsburgh, PA, USA
– sequence: 7
  givenname: P.
  surname: Narasimhan
  fullname: Narasimhan, P.
  organization: Carnegie Mellon Univ., Pittsburgh, PA, USA
BookMark eNotjrFOw0AQRA8BEiS4pKLxDzjseu_O3jJYhEQyooDUkX1ew6HkgnxOwd9jBM1M9ebNTF2EYxClbhEWiMD3m-q5Xi5ywHxRwJlKuCihsGw0I-XnaobaFqStQb5SSYyfADBxlrS-VvDgQzN8p6tTcKM_hrTan-Iogw_v6Tb-5qscmjB6l66b-CHxRl32zT5K8t9ztV09vlXrrH552lTLOvNYmDFj4nJyau1K05F13AkYbilvWovO5q3tSKBvTUOdYzEsmpAsYgfkpBeaq7u_XS8iu6_BH6abO6vBWib6ATn4RTI
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICMLA.2012.70
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9780769549132
0769549136
EndPage 391
ExternalDocumentID 6406693
Genre orig-research
GroupedDBID 6IE
6IF
6IK
6IL
6IN
AAJGR
AAWTH
ADFMO
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IEGSK
IERZE
OCL
RIE
RIL
ID FETCH-LOGICAL-i175t-939834644c85d36c9de059b32ab61c62b6d3e0fb5a3dc9e59e4313611d03cefe3
IEDL.DBID RIE
ISBN 1467346519
9781467346511
IngestDate Wed Aug 27 03:56:07 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-939834644c85d36c9de059b32ab61c62b6d3e0fb5a3dc9e59e4313611d03cefe3
PageCount 6
ParticipantIDs ieee_primary_6406693
PublicationCentury 2000
PublicationDate 2012-Dec.
PublicationDateYYYYMMDD 2012-12-01
PublicationDate_xml – month: 12
  year: 2012
  text: 2012-Dec.
PublicationDecade 2010
PublicationTitle 2012 Eleventh International Conference on Machine Learning and Applications
PublicationTitleAbbrev icmla
PublicationYear 2012
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0001106344
Score 1.645653
Snippet The ability to identify semantically-related functions, in large collections of binary executables, is important for malware detection. Intuitively, two pieces...
SourceID ieee
SourceType Publisher
StartPage 386
SubjectTerms Benchmark testing
binary static analysis
clustering
Concrete
Feature extraction
Malware
malware detection
Registers
reverse engineering
semantic comparison
Semantics
Title Binary Function Clustering Using Semantic Hashes
URI https://ieeexplore.ieee.org/document/6406693
Volume 1
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELXaTkwFWsS3PDCSNI4dNx4hoiqIIiSo1K3yx1VUQItosvDrOSdpixADWxIpkn2O9d459-4RcqG5NbrP4gBsjAlKAizwXqOB0tIhnUXM5V7vPHqQw7G4mySTBrncaGEAoCw-g9Bflv_y3dIW_qisJxF9pOJN0sTPrNJqbc9TMLfhQpTaLdnn3uJbrVs61fds22Ozd5uN7q98ZVccep_iH84qJbAM2mS0HlJVT_IaFrkJ7devbo3_HfMu6W4lfPRxA057pAGLfdJeezjQekt3SHRdCnLpAPHNrxHN3grfOwFfomU5AX2Cd4z-3NKhXr3AqkvGg5vnbBjULgrBHKlBHiiuUpy-EDZNHJdWOUBKZXisjWRWxkY6DtHMJJo7qyBRgJyCS8ZcxC3MgB-Q1mK5gENCZ8gFmY4iwDRJ4MY1wqVpksRCG43Ujx2Rjg_A9KNqlDGt53789-MTsuPjX9WGnJJW_lnAGSJ8bs7Lpf0GQpWf-g
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFG4QD3pCBeNve_Do5rZ2ZT0qkQxlxERIuJG2e0SigpHt4l_v6zbAGA_etiVN2r403_e6972PkCvFjFZtP3DABJighOA71mvUkUqkSGcRc5nVOycDEY_4wzgc18j1WgsDAEXxGbj2sfiXny5Mbq_KbgSij5Bsi2wj7vOwVGttblQwu2GcF-ot0WbW5FuumjpV7_6my-ZNr5P0b21tV-Bap-If3ioFtHQbJFlNqqwoeXXzTLvm61e_xv_Oeo-0NiI--rSGp31Sg_kBaaxcHGh1qJvEuyskubSLCGejRDtvue2egINoUVBAn-Ed939maKyWL7BskVH3ftiJncpHwZkhOcgcyWSEy-fcRGHKhJEpIKnSLFBa-EYEWqQMvKkOFUuNhFACsgomfD_1mIEpsENSny_mcEToFNmgrzwPMFHieHQ1T6MoDAOutELy5x-Tpt2AyUfZKmNSrf3k78-XZCceJv1Jvzd4PCW7NhZlpcgZqWefOZwj3mf6ogjzN9uBo0c
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2012+Eleventh+International+Conference+on+Machine+Learning+and+Applications&rft.atitle=Binary+Function+Clustering+Using+Semantic+Hashes&rft.au=Jin%2C+W.&rft.au=Chaki%2C+S.&rft.au=Cohen%2C+C.&rft.au=Gurfinkel%2C+A.&rft.date=2012-12-01&rft.pub=IEEE&rft.isbn=9781467346511&rft.volume=1&rft.spage=386&rft.epage=391&rft_id=info:doi/10.1109%2FICMLA.2012.70&rft.externalDocID=6406693
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467346511/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467346511/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467346511/sc.gif&client=summon&freeimage=true