Source File Set Search for Clone-and-Own Reuse Analysis

Clone-and-own approach is a natural way of source code reuse for software developers. To assess how known bugs and security vulnerabilities of a cloned component affect an application, developers and security analysts need to identify an original version of the component and understand how the clone...

Full description

Saved in:
Bibliographic Details
Main Authors Ishio, Takashi, Sakaguchi, Yusuke, Ito, Kaoru, Inoue, Katsuro
Format Journal Article
LanguageEnglish
Published 26.04.2017
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Clone-and-own approach is a natural way of source code reuse for software developers. To assess how known bugs and security vulnerabilities of a cloned component affect an application, developers and security analysts need to identify an original version of the component and understand how the cloned component is different from the original one. Although developers may record the original version information in a version control system and/or directory names, such information is often either unavailable or incomplete. In this research, we propose a code search method that takes as input a set of source files and extracts all the components including similar files from a software ecosystem (i.e., a collection of existing versions of software packages). Our method employs an efficient file similarity computation using b-bit minwise hashing technique. We use an aggregated file similarity for ranking components. To evaluate the effectiveness of this tool, we analyzed 75 cloned components in Firefox and Android source code. The tool took about two hours to report the original components from 10 million files in Debian GNU/Linux packages. Recall of the top-five components in the extracted lists is 0.907, while recall of a baseline using SHA-1 file hash is 0.773, according to the ground truth recorded in the source code repositories.
AbstractList Clone-and-own approach is a natural way of source code reuse for software developers. To assess how known bugs and security vulnerabilities of a cloned component affect an application, developers and security analysts need to identify an original version of the component and understand how the cloned component is different from the original one. Although developers may record the original version information in a version control system and/or directory names, such information is often either unavailable or incomplete. In this research, we propose a code search method that takes as input a set of source files and extracts all the components including similar files from a software ecosystem (i.e., a collection of existing versions of software packages). Our method employs an efficient file similarity computation using b-bit minwise hashing technique. We use an aggregated file similarity for ranking components. To evaluate the effectiveness of this tool, we analyzed 75 cloned components in Firefox and Android source code. The tool took about two hours to report the original components from 10 million files in Debian GNU/Linux packages. Recall of the top-five components in the extracted lists is 0.907, while recall of a baseline using SHA-1 file hash is 0.773, according to the ground truth recorded in the source code repositories.
Author Inoue, Katsuro
Ito, Kaoru
Sakaguchi, Yusuke
Ishio, Takashi
Author_xml – sequence: 1
  givenname: Takashi
  surname: Ishio
  fullname: Ishio, Takashi
– sequence: 2
  givenname: Yusuke
  surname: Sakaguchi
  fullname: Sakaguchi, Yusuke
– sequence: 3
  givenname: Kaoru
  surname: Ito
  fullname: Ito, Kaoru
– sequence: 4
  givenname: Katsuro
  surname: Inoue
  fullname: Inoue, Katsuro
BackLink https://doi.org/10.48550/arXiv.1704.08395$$DView paper in arXiv
BookMark eNotj81OwzAQhH2AAxQegBN-AQevf5NjFVFAqlSJ9h5tvVsRKTjIoUDfnlJ6GM1pPs13LS7ymFmIO9CVq73XD1h--q8KonaVrm3jr0Rcj_uSWC76geWaP4_Bkt7kbiyyHY5zhZnU6jvLV95PLOcZh8PUTzficofDxLfnnonN4nHTPqvl6umlnS8VhuiVaUADJmtCjFgH8syNIeSGtoSGiGzyHp2rkycH6AAgIIQEBrZkPNiZuP_Hnp53H6V_x3Lo_gy6k4H9BSRcQSk
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
GOX
DOI 10.48550/arxiv.1704.08395
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 1704_08395
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a675-29101ac32677a86d5ee92dae9dbda2ddd3c55a448c5d41a41116a16c121bd2513
IEDL.DBID GOX
IngestDate Mon Jan 08 05:46:43 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a675-29101ac32677a86d5ee92dae9dbda2ddd3c55a448c5d41a41116a16c121bd2513
OpenAccessLink https://arxiv.org/abs/1704.08395
ParticipantIDs arxiv_primary_1704_08395
PublicationCentury 2000
PublicationDate 2017-04-26
PublicationDateYYYYMMDD 2017-04-26
PublicationDate_xml – month: 04
  year: 2017
  text: 2017-04-26
  day: 26
PublicationDecade 2010
PublicationYear 2017
Score 1.6669431
SecondaryResourceType preprint
Snippet Clone-and-own approach is a natural way of source code reuse for software developers. To assess how known bugs and security vulnerabilities of a cloned...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Software Engineering
Title Source File Set Search for Clone-and-Own Reuse Analysis
URI https://arxiv.org/abs/1704.08395
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1NSwMxEB3anryIolI_ycFr0GST7OYoxbUIWrAV9rbkYxYEqdJurT_f2Y-iFy85JLnkheS9SSYvANeJcMqhMVwmVeCk_5HbylIRHHplVSWz5jXy07OZvqrHQhcDYLu3MG71_fbV-QP79Y1IGxtS4nA9hKGUTcrWw6zoLidbK66-_28_0pht1R-SyA9gv1d37K6bjkMY4PII0nl7QM5yWoBsjjXrUnwZyUU2ef9YIqdons-2S_aCmzWynU_IMSzy-8Vkyvv_Crgj2c0lMa9wgfRQmrrMRI1oZXRoo49OxhiToLUjOIKOiiCiXcY4YYKQwkeSGckJjCjkxzEwEazIgk0JXqeyqL3WFEz60OSCaqLkUxi3oyw_O0uKsgGgbAE4-7_pHPZkQ0q3iktzAaN6tcFLotTaX7W4_gA6PnUp
link.rule.ids 228,230,783,888
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Source+File+Set+Search+for+Clone-and-Own+Reuse+Analysis&rft.au=Ishio%2C+Takashi&rft.au=Sakaguchi%2C+Yusuke&rft.au=Ito%2C+Kaoru&rft.au=Inoue%2C+Katsuro&rft.date=2017-04-26&rft_id=info:doi/10.48550%2Farxiv.1704.08395&rft.externalDocID=1704_08395