A deep learning approach to identifying source code in images and video

While substantial progress has been made in mining code on an Internet scale, efforts to date have been overwhelmingly focused on data sets where source code is represented natively as text. Large volumes of source code available online and embedded in technical videos have remained largely unexplor...

Full description

Saved in:

Bibliographic Details
Published in	2018 IEEE/ACM 15th International Conference on Mining Software Repositories (MSR) pp. 376 - 386
Main Authors	Ott, Jordan, Atchison, Abigail, Harnack, Paul, Bergh, Adrienne, Linstead, Erik
Format	Conference Proceeding
Language	English
Published	New York, NY, USA ACM 28.05.2018
Series	ACM Conferences
Subjects	Computer systems organization > Architectures > Other architectures > Neural networks Computing methodologies > Machine learning > Machine learning approaches Convolutional neural networks Data mining Deep learning Information systems > Information retrieval > Specialized information retrieval > Multimedia and multimodal retrieval > Video search Optical character recognition software programming tutorials Software and its engineering > Software notations and tools > Software libraries and repositories Tutorials video mining deep learning video mining convolutional neural networks programming tutorials
Online Access	Get full text
ISBN	9781450357166 1450357164
ISSN	2574-3864
DOI	10.1145/3196398.3196402

Cover

Loading…

More Information
Summary:	While substantial progress has been made in mining code on an Internet scale, efforts to date have been overwhelmingly focused on data sets where source code is represented natively as text. Large volumes of source code available online and embedded in technical videos have remained largely unexplored, due in part to the complexity of extraction when code is represented with images. Existing approaches to code extraction and indexing in this environment rely heavily on computationally intense optical character recognition. To improve the ease and efficiency of identifying this embedded code, as well as identifying similar code examples, we develop a deep learning solution based on convolutional neural networks and autoencoders. Focusing on Java for proof of concept, our technique is able to identify the presence of typeset and handwritten source code in thousands of video images with 85.6%-98.6% accuracy based on syntactic and contextual features learned through deep architectures. When combined with traditional approaches, this provides a more scalable basis for video indexing that can be incorporated into existing software search and mining tools.
ISBN:	9781450357166 1450357164
ISSN:	2574-3864
DOI:	10.1145/3196398.3196402