Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection

The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating da...

Full description

Saved in:
Bibliographic Details
Published inConquering Big Data with High Performance Computing pp. 269 - 286
Main Authors Arora, Ritu, Trelogan, Jessica, Ba, Trung Nguyen
Format Book Chapter
LanguageEnglish
Published Switzerland Springer International Publishing AG 2016
Springer International Publishing
Subjects
Online AccessGet full text

Cover

Loading…
Abstract The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating data that contain large numbers of identical copies, it can be particularly difficult to find a solution for identifying redundancy within image-heavy collections that have evolved over a long span of time or have been created collaboratively by large groups. These types of collections, especially in academic research settings, in which the datasets are used for a wide range of publication, teaching, and research activities, can be characterized by (1) large numbers of heterogeneous file formats, (2) repetitive photographic documentation of the same subjects in a variety of conditions (3) multiple copies or subsets of images with slight modifications (e.g., cropping or color-balancing) and (4) complex file structures and naming conventions that may not be consistent throughout. In this chapter, we present a scalable and automated approach for detecting duplicate, similar, and related images, along with subimages, in digital data collections. Our approach can assist in efficiently managing redundancy in any large image collection on High Performance Computing (HPC) resources. While we illustrate the approach with a large archaeological collection, it is domain-neutral and is widely applicable to image-heavy collections within any HPC platform that has general-purpose processors.
AbstractList The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating data that contain large numbers of identical copies, it can be particularly difficult to find a solution for identifying redundancy within image-heavy collections that have evolved over a long span of time or have been created collaboratively by large groups. These types of collections, especially in academic research settings, in which the datasets are used for a wide range of publication, teaching, and research activities, can be characterized by (1) large numbers of heterogeneous file formats, (2) repetitive photographic documentation of the same subjects in a variety of conditions (3) multiple copies or subsets of images with slight modifications (e.g., cropping or color-balancing) and (4) complex file structures and naming conventions that may not be consistent throughout. In this chapter, we present a scalable and automated approach for detecting duplicate, similar, and related images, along with subimages, in digital data collections. Our approach can assist in efficiently managing redundancy in any large image collection on High Performance Computing (HPC) resources. While we illustrate the approach with a large archaeological collection, it is domain-neutral and is widely applicable to image-heavy collections within any HPC platform that has general-purpose processors.
Author Ba, Trung Nguyen
Arora, Ritu
Trelogan, Jessica
Author_xml – sequence: 1
  givenname: Ritu
  surname: Arora
  fullname: Arora, Ritu
  email: rauta@tacc.utexas.edu
  organization: University of Texas at Austin, Texas Advanced Computing Center, Austin, USA
– sequence: 2
  givenname: Jessica
  surname: Trelogan
  fullname: Trelogan, Jessica
  organization: University of Texas, Austin, USA
– sequence: 3
  givenname: Trung Nguyen
  surname: Ba
  fullname: Ba, Trung Nguyen
  organization: University of Texas, Austin, USA
BookMark eNo1kMtOwzAQRY14iLb0D1j4AwjYniR2lqgFWqkSCOjacpxJGkiTYKf_j9PCajRndK9GZ0ou2q5FQm45u-eMyYdMqggi4FkEIGMRJZrDGZkHDAEeWXJOpv8LU1dkkiVKpbFU_JrMvf9ijHEJEng8Id3W121FV3W1o2_oys7tTWuRLrp9fxjGU0B0iQPa47Y89E1tzYB39KPe141x1LQFfccmsIKu96ZCT-uWGroxrkK6NIMJbU0zFnTtDbksTeNx_jdnZPv89LlYRZvXl_XicRP1PFEQFXluSlamGTJly0JyC1JiLAQIYROeSGkEZimLUSoDlpUqswJUXBYslya4mBFx6vW9C3-j03nXfXvNmR416uBLgw6S9FGZHjWGUHwK9a77OaAfNI4pi-3gTGN3ph_QeR2nGZMAWoSckAx-AYLSdsc
ContentType Book Chapter
Copyright Springer International Publishing Switzerland 2016
Copyright_xml – notice: Springer International Publishing Switzerland 2016
DBID FFUUA
DEWEY 005.7
DOI 10.1007/978-3-319-33742-5_13
DatabaseName ProQuest Ebook Central - Book Chapters - Demo use only
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9783319337425
3319337424
Editor Arora, Ritu
Editor_xml – sequence: 1
  fullname: Arora, Ritu
EndPage 286
ExternalDocumentID EBC4690733_242_270
GroupedDBID 0D6
0DA
38.
AABBV
AAMCO
AAQZU
ABMNI
ABOWU
ACBPT
ACLMJ
ADCXD
ADPGQ
AEIBC
AEJGN
AEJLV
AEKFX
AETDV
AEZAY
ALMA_UNASSIGNED_HOLDINGS
AORVH
AWFBM
AZZ
BBABE
CZZ
FFUUA
IEZ
JJU
MYL
SBO
SWNTM
TPJZQ
Z7R
Z7U
Z7X
Z7Y
Z7Z
Z81
Z83
Z84
Z85
Z88
ID FETCH-LOGICAL-p1583-dbbaf0f69e08cfd71c377e422322c51577a2e9604e78a3c0f89c2384fd0b7a783
ISBN 3319337408
9783319337401
IngestDate Tue Jul 29 19:43:24 EDT 2025
Thu May 29 00:28:57 EDT 2025
IsPeerReviewed false
IsScholarly false
LCCallNum QA76.9.D3
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-p1583-dbbaf0f69e08cfd71c377e422322c51577a2e9604e78a3c0f89c2384fd0b7a783
OCLC 958864781
PQID EBC4690733_242_270
PageCount 18
ParticipantIDs springer_books_10_1007_978_3_319_33742_5_13
proquest_ebookcentralchapters_4690733_242_270
PublicationCentury 2000
PublicationDate 2016
20160917
PublicationDateYYYYMMDD 2016-01-01
2016-09-17
PublicationDate_xml – year: 2016
  text: 2016
PublicationDecade 2010
PublicationPlace Switzerland
PublicationPlace_xml – name: Switzerland
– name: Cham
PublicationTitle Conquering Big Data with High Performance Computing
PublicationYear 2016
Publisher Springer International Publishing AG
Springer International Publishing
Publisher_xml – name: Springer International Publishing AG
– name: Springer International Publishing
SSID ssj0001737314
Score 1.5068637
Snippet The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging,...
SourceID springer
proquest
SourceType Publisher
StartPage 269
SubjectTerms Algorithms & data structures
Color Histogram
High Performance Computing
Related Image
Research Collection
Systems analysis & design
Template Image
Title Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection
URI http://ebookcentral.proquest.com/lib/SITE_ID/reader.action?docID=4690733&ppg=270
http://link.springer.com/10.1007/978-3-319-33742-5_13
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lj9MwELZKuSAOvMXykg_cSlASJ7VzZNlFS7XsqYv2Zjl-VJHYFHWTA_x6Zuy8WpbDcrFay3ISf9Z4_PkbDyHveWKyhJsi0pnIoyxzcVQ6KHRRiiR3qYo5xjt_u1ieXWarq_xqNnMT1VLblB_171vjSv4HVagDXDFK9g7IDp1CBfwGfKEEhKE8cH73adZwr8C2Bpvu5XPH1Qbga1RgVVG6gcr2ISAgZG7o1yiP7nYXnMaqaYe9-86CHQx86ArFsaOI59g3Xu_AMCwuNu2vPnysS53k6YZ_P9UrGU8snlbgv5M2HJkH7ru6rlAHG8Ikfyj0f79eq42XiS3U4hyF6uHbPMWhh3nUMRXJEmUVITBzj6k84DpHum1va8vANjCG-QKn5jWkdelW6jRcov3XIjDVfWCMFnYDO26JyY3vcZHPyf1Pp6vz7yMXxxlnPgPY8FQRLmca32ISdnlbz3sblIMzde-qrB-Thxi-QjGuBL7_CZnZ-il51CfvoJ0tf0a2HjaKsNEJbHSAjUIVHWCjA2wfaAcaBdBoBxoNoNGqpop60CiCRkfQnpPLL6frz2dRl30j-pnkgkWmLJWL3bKwsdDO8EQzzm0G7mSaavCCOVepxat9LBeK6diJQoP_lzkTl1zB0L0g83pb25eEimViEiOsTQqTlYqrwjBrY82Nsirl6ohE_eBJrxHohMk6DNWN9BwOYxL8SZny-Igs-hGW2PxG9pdvAzSSSYBGemgkQvPqTq1fkwfj3H1D5s2utW_B72zKd920-QOAKICA
linkProvider Library Specific Holdings
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Conquering+Big+Data+with+High+Performance+Computing&rft.au=Arora%2C+Ritu&rft.au=Trelogan%2C+Jessica&rft.au=Ba%2C+Trung+Nguyen&rft.atitle=Using+High+Performance+Computing+for+Detecting+Duplicate%2C+Similar+and+Related+Images+in+a+Large+Data+Collection&rft.date=2016-09-17&rft.pub=Springer+International+Publishing&rft.isbn=9783319337401&rft.spage=269&rft.epage=286&rft_id=info:doi/10.1007%2F978-3-319-33742-5_13
thumbnail_s http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Febookcentral.proquest.com%2Fcovers%2F4690733-l.jpg