A deep learning approach to identifying source code in images and video

While substantial progress has been made in mining code on an Internet scale, efforts to date have been overwhelmingly focused on data sets where source code is represented natively as text. Large volumes of source code available online and embedded in technical videos have remained largely unexplor...

Full description

Saved in:
Bibliographic Details
Published in2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) pp. 376 - 386
Main Authors Ott, Jordan, Atchison, Abigail, Harnack, Paul, Bergh, Adrienne, Linstead, Erik
Format Conference Proceeding
LanguageEnglish
Published New York, NY, USA ACM 28.05.2018
SeriesACM Conferences
Subjects
Online AccessGet full text
ISBN9781450357166
1450357164
ISSN2574-3864
DOI10.1145/3196398.3196402

Cover

Abstract While substantial progress has been made in mining code on an Internet scale, efforts to date have been overwhelmingly focused on data sets where source code is represented natively as text. Large volumes of source code available online and embedded in technical videos have remained largely unexplored, due in part to the complexity of extraction when code is represented with images. Existing approaches to code extraction and indexing in this environment rely heavily on computationally intense optical character recognition. To improve the ease and efficiency of identifying this embedded code, as well as identifying similar code examples, we develop a deep learning solution based on convolutional neural networks and autoencoders. Focusing on Java for proof of concept, our technique is able to identify the presence of typeset and handwritten source code in thousands of video images with 85.6%-98.6% accuracy based on syntactic and contextual features learned through deep architectures. When combined with traditional approaches, this provides a more scalable basis for video indexing that can be incorporated into existing software search and mining tools.
AbstractList While substantial progress has been made in mining code on an Internet scale, efforts to date have been overwhelmingly focused on data sets where source code is represented natively as text. Large volumes of source code available online and embedded in technical videos have remained largely unexplored, due in part to the complexity of extraction when code is represented with images. Existing approaches to code extraction and indexing in this environment rely heavily on computationally intense optical character recognition. To improve the ease and efficiency of identifying this embedded code, as well as identifying similar code examples, we develop a deep learning solution based on convolutional neural networks and autoencoders. Focusing on Java for proof of concept, our technique is able to identify the presence of typeset and handwritten source code in thousands of video images with 85.6%-98.6% accuracy based on syntactic and contextual features learned through deep architectures. When combined with traditional approaches, this provides a more scalable basis for video indexing that can be incorporated into existing software search and mining tools.
Author Ott, Jordan
Atchison, Abigail
Linstead, Erik
Harnack, Paul
Bergh, Adrienne
Author_xml – sequence: 1
  givenname: Jordan
  surname: Ott
  fullname: Ott, Jordan
  email: ott109@mail.chapman.edu
  organization: Chapman University
– sequence: 2
  givenname: Abigail
  surname: Atchison
  fullname: Atchison, Abigail
  email: atchi102@mail.chapman.edu
  organization: Chapman University
– sequence: 3
  givenname: Paul
  surname: Harnack
  fullname: Harnack, Paul
  email: harna100@mail.chapman.edu
  organization: Chapman University
– sequence: 4
  givenname: Adrienne
  surname: Bergh
  fullname: Bergh, Adrienne
  email: abergh@chapman.edu
  organization: Chapman University
– sequence: 5
  givenname: Erik
  surname: Linstead
  fullname: Linstead, Erik
  email: linstead@chapman.edu
  organization: Chapman University
BookMark eNqNkDtPwzAUhc1LopTODCweWVLs62fGqioFqRILzJbjXJdAa0dJQeq_J1U7MTEd6Xyfrq7ODblMOSEhd5xNOZfqUfBSi9JODykZnJFJaewAmFCGa31ORqCMLITV8uIPuyaTvv9kjIG2knMzIssZrRFbukHfpSatqW_bLvvwQXeZNjWmXRP3h77P311AGnKNtEm02fo19tSnmv4MWr4lV9FvepycckzenxZv8-di9bp8mc9WhQdrd0VVBwlVZAoMgqzQa7DRKBBgbIiBRalNGbwcGm1B8ciNB8FtYDxKMCDG5P54t0FE13bDH93eWVUqAD7QhyP1YeuqnL96x5k7rOZOq7nTaoM6_afqqq7BKH4ByJBoCw
CODEN IEEPAD
ContentType Conference Proceeding
Copyright 2018 ACM
Copyright_xml – notice: 2018 ACM
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/3196398.3196402
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781450357166
1450357164
EISSN 2574-3864
EndPage 386
ExternalDocumentID 8595221
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
AAJGR
ABLEC
ACM
ADPZR
ALMA_UNASSIGNED_HOLDINGS
APO
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
GUFHI
IEGSK
LHSKQ
OCL
RIB
RIC
RIE
RIL
AAWTH
ADZIZ
CHZPO
ID FETCH-LOGICAL-a288t-bdc42bf0527e24bea628f7523278cfc0f4679ca452368251f17a2318c01f42723
IEDL.DBID RIE
ISBN 9781450357166
1450357164
IngestDate Wed Aug 27 02:59:15 EDT 2025
Fri Sep 13 11:04:49 EDT 2024
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Keywords deep learning
video mining
convolutional neural networks
programming tutorials
Language English
License Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org
LinkModel DirectLink
MeetingName ICSE '18: 40th International Conference on Software Engineering
MergedId FETCHMERGED-LOGICAL-a288t-bdc42bf0527e24bea628f7523278cfc0f4679ca452368251f17a2318c01f42723
OpenAccessLink https://dl.acm.org/doi/pdf/10.1145/3196398.3196402
PageCount 11
ParticipantIDs ieee_primary_8595221
acm_books_10_1145_3196398_3196402
acm_books_10_1145_3196398_3196402_brief
PublicationCentury 2000
PublicationDate 20180528
2018-May
PublicationDateYYYYMMDD 2018-05-28
2018-05-01
PublicationDate_xml – month: 05
  year: 2018
  text: 20180528
  day: 28
PublicationDecade 2010
PublicationPlace New York, NY, USA
PublicationPlace_xml – name: New York, NY, USA
PublicationSeriesTitle ACM Conferences
PublicationTitle 2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR)
PublicationTitleAbbrev MSR
PublicationYear 2018
Publisher ACM
Publisher_xml – name: ACM
SSID ssj0002684117
ssj0003211714
Score 2.305185
Snippet While substantial progress has been made in mining code on an Internet scale, efforts to date have been overwhelmingly focused on data sets where source code...
SourceID ieee
acm
SourceType Publisher
StartPage 376
SubjectTerms Computer systems organization -- Architectures -- Other architectures -- Neural networks
Computing methodologies -- Machine learning -- Machine learning approaches
Convolutional neural networks
Data mining
Deep learning
Information systems -- Information retrieval -- Specialized information retrieval -- Multimedia and multimodal retrieval -- Video search
Optical character recognition software
programming tutorials
Software and its engineering -- Software notations and tools -- Software libraries and repositories
Tutorials
video mining
Title A deep learning approach to identifying source code in images and video
URI https://ieeexplore.ieee.org/document/8595221
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFA_bTp6mbuL8IoLgxW5t1jTpcUznFBRBJ7uVJH2RIWvH1l38603SbqIIemtDD-Hl5X31_X4PoYtA01QwST0T3nLPGDzwhFTgUWXciw_S5ACuy_cxGk_C-ymd1tDVFgsDAK75DLr20f3LT3O1tqWynuXiIhY1XjdqVmK1tvUUy1qywUza977JbFgQVmw-QUh7Ttli3nUcVLaMUhdq_m2oivMpoyZ62OymbCV5764L2VUfP4ga_7vdXdT-Qu_hp61f2kM1yPZRczO-AVe3uYVuB_gaYIEritU3PKj4xXGR4xLA60BQ-NkV-PEwTwHPMnw3NzZohUWW4tdZCnkbTUY3L8OxV81V8AThvPBkqkIitU8JAxJKEBHhmpmMlDCutPK1MZ6xEqFZiSyyVQdMmDCQKz_QIWGkf4AaWZ7BIcIUIqYFBLEGEaaax0zwiBHFpFEOGfgddG6EnNiEYZWUGGiaVAeRVAfRQZd_fpPI5Qx0B7WslJNFScSRVAI--n35GO2Y6IaX3YknqFEs13BqIohCnjnV-QTQm71_
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1JS8QwFH64HPTkjuMaQfBixzYmTeYobuOK4IK3kqQvMoitzHQu_nqTtDOiCHprQw_h5eVtfd_3AHYTy3MlNI9ceCsjZ_AwUtpgxI1zLzFqlwOELt_btPvILp_58wTsj7EwiBiaz7DtH8O__Lw0Q18qO_BcXNSjxqed32e8RmuNKyqet2SEmvTvhy63EQlr-HwSxg-CunVkO7BQ-ULKpDJv38aqBK9yNgc3o_3UzSSv7WGl2-bjB1Xjfzc8D8tf-D1yN_ZMCzCBxSLMjQY4kOY-L8H5ETlBfCcNyeoLOWoYxklVkhrCG2BQ5D6U-MlxmSPpFeTizVmhAVFFTp56OZbL8Hh2-nDcjZrJCpGiUlaRzg2j2sacCqRMo0qptMLlpFRIY01snfnsGMXcSuqxrTYRygWC0sSJZVTQwxWYKsoCV4FwTIVVmHQsKpZb2RFKpoIaoZ166CRuwY4TcuZThkFWo6B51hxE1hxEC_b-_CbT_R7aFix5KWfvNRVH1gh47fflbZjpPtxcZ9cXt1frMOtiHVn3Km7AVNUf4qaLJyq9FdToE_FPwMw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE%2FACM+15th+International+Conference+on+Mining+Software+Repositories+%28MSR%29&rft.atitle=A+Deep+Learning+Approach+to+Identifying+Source+Code+in+Images+and+Video&rft.au=Ott%2C+Jordan&rft.au=Atchison%2C+Abigail&rft.au=Harnack%2C+Paul&rft.au=Bergh%2C+Adrienne&rft.date=2018-05-01&rft.pub=ACM&rft.eissn=2574-3864&rft.spage=376&rft.epage=386&rft_id=info:doi/10.1145%2F3196398.3196402&rft.externalDocID=8595221
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781450357166/sc.gif&client=summon&freeimage=true