Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection

The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating da...

Full description

Saved in:

Bibliographic Details
Published in	Conquering Big Data with High Performance Computing pp. 269 - 286
Main Authors	Arora, Ritu, Trelogan, Jessica, Ba, Trung Nguyen
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2016 Springer International Publishing
Subjects	Algorithms & data structures Color Histogram High Performance Computing Related Image Research Collection Systems analysis & design Template Image
Online Access	Get full text

Cover

Loading…

Abstract	The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating data that contain large numbers of identical copies, it can be particularly difficult to find a solution for identifying redundancy within image-heavy collections that have evolved over a long span of time or have been created collaboratively by large groups. These types of collections, especially in academic research settings, in which the datasets are used for a wide range of publication, teaching, and research activities, can be characterized by (1) large numbers of heterogeneous file formats, (2) repetitive photographic documentation of the same subjects in a variety of conditions (3) multiple copies or subsets of images with slight modifications (e.g., cropping or color-balancing) and (4) complex file structures and naming conventions that may not be consistent throughout. In this chapter, we present a scalable and automated approach for detecting duplicate, similar, and related images, along with subimages, in digital data collections. Our approach can assist in efficiently managing redundancy in any large image collection on High Performance Computing (HPC) resources. While we illustrate the approach with a large archaeological collection, it is domain-neutral and is widely applicable to image-heavy collections within any HPC platform that has general-purpose processors.
AbstractList	The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating data that contain large numbers of identical copies, it can be particularly difficult to find a solution for identifying redundancy within image-heavy collections that have evolved over a long span of time or have been created collaboratively by large groups. These types of collections, especially in academic research settings, in which the datasets are used for a wide range of publication, teaching, and research activities, can be characterized by (1) large numbers of heterogeneous file formats, (2) repetitive photographic documentation of the same subjects in a variety of conditions (3) multiple copies or subsets of images with slight modifications (e.g., cropping or color-balancing) and (4) complex file structures and naming conventions that may not be consistent throughout. In this chapter, we present a scalable and automated approach for detecting duplicate, similar, and related images, along with subimages, in digital data collections. Our approach can assist in efficiently managing redundancy in any large image collection on High Performance Computing (HPC) resources. While we illustrate the approach with a large archaeological collection, it is domain-neutral and is widely applicable to image-heavy collections within any HPC platform that has general-purpose processors.
Author	Ba, Trung Nguyen Arora, Ritu Trelogan, Jessica
Author_xml	– sequence: 1 givenname: Ritu surname: Arora fullname: Arora, Ritu email: rauta@tacc.utexas.edu organization: University of Texas at Austin, Texas Advanced Computing Center, Austin, USA – sequence: 2 givenname: Jessica surname: Trelogan fullname: Trelogan, Jessica organization: University of Texas, Austin, USA – sequence: 3 givenname: Trung Nguyen surname: Ba fullname: Ba, Trung Nguyen organization: University of Texas, Austin, USA
BookMark	eNo1kMtOwzAQRY14iLb0D1j4AwjYniR2lqgFWqkSCOjacpxJGkiTYKf_j9PCajRndK9GZ0ou2q5FQm45u-eMyYdMqggi4FkEIGMRJZrDGZkHDAEeWXJOpv8LU1dkkiVKpbFU_JrMvf9ijHEJEng8Id3W121FV3W1o2_oys7tTWuRLrp9fxjGU0B0iQPa47Y89E1tzYB39KPe141x1LQFfccmsIKu96ZCT-uWGroxrkK6NIMJbU0zFnTtDbksTeNx_jdnZPv89LlYRZvXl_XicRP1PFEQFXluSlamGTJly0JyC1JiLAQIYROeSGkEZimLUSoDlpUqswJUXBYslya4mBFx6vW9C3-j03nXfXvNmR416uBLgw6S9FGZHjWGUHwK9a77OaAfNI4pi-3gTGN3ph_QeR2nGZMAWoSckAx-AYLSdsc
ContentType	Book Chapter
Copyright	Springer International Publishing Switzerland 2016
Copyright_xml	– notice: Springer International Publishing Switzerland 2016
DBID	FFUUA
DEWEY	005.7
DOI	10.1007/978-3-319-33742-5_13
DatabaseName	ProQuest Ebook Central - Book Chapters - Demo use only
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9783319337425 3319337424
Editor	Arora, Ritu
Editor_xml	– sequence: 1 fullname: Arora, Ritu
EndPage	286
ExternalDocumentID	EBC4690733_242_270
GroupedDBID	0D6 0DA 38. AABBV AAMCO AAQZU ABMNI ABOWU ACBPT ACLMJ ADCXD ADPGQ AEIBC AEJGN AEJLV AEKFX AETDV AEZAY ALMA_UNASSIGNED_HOLDINGS AORVH AWFBM AZZ BBABE CZZ FFUUA IEZ JJU MYL SBO SWNTM TPJZQ Z7R Z7U Z7X Z7Y Z7Z Z81 Z83 Z84 Z85 Z88
ID	FETCH-LOGICAL-p1583-dbbaf0f69e08cfd71c377e422322c51577a2e9604e78a3c0f89c2384fd0b7a783
ISBN	3319337408 9783319337401
IngestDate	Tue Jul 29 19:43:24 EDT 2025 Thu May 29 00:28:57 EDT 2025
IsPeerReviewed	false
IsScholarly	false
LCCallNum	QA76.9.D3
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-p1583-dbbaf0f69e08cfd71c377e422322c51577a2e9604e78a3c0f89c2384fd0b7a783
OCLC	958864781
PQID	EBC4690733_242_270
PageCount	18
ParticipantIDs	springer_books_10_1007_978_3_319_33742_5_13 proquest_ebookcentralchapters_4690733_242_270
PublicationCentury	2000
PublicationDate	2016 20160917
PublicationDateYYYYMMDD	2016-01-01 2016-09-17
PublicationDate_xml	– year: 2016 text: 2016
PublicationDecade	2010
PublicationPlace	Switzerland
PublicationPlace_xml	– name: Switzerland – name: Cham
PublicationTitle	Conquering Big Data with High Performance Computing
PublicationYear	2016
Publisher	Springer International Publishing AG Springer International Publishing
Publisher_xml	– name: Springer International Publishing AG – name: Springer International Publishing
SSID	ssj0001737314
Score	1.5068637
Snippet	The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging,...
SourceID	springer proquest
SourceType	Publisher
StartPage	269
SubjectTerms	Algorithms & data structures Color Histogram High Performance Computing Related Image Research Collection Systems analysis & design Template Image
Title	Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection
URI	http://ebookcentral.proquest.com/lib/SITE_ID/reader.action?docID=4690733&ppg=270 http://link.springer.com/10.1007/978-3-319-33742-5_13
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lj9MwELZKuSAOvMXykg_cSlASJ7VzZNlFS7XsqYv2Zjl-VJHYFHWTA_x6Zuy8WpbDcrFay3ISf9Z4_PkbDyHveWKyhJsi0pnIoyxzcVQ6KHRRiiR3qYo5xjt_u1ieXWarq_xqNnMT1VLblB_171vjSv4HVagDXDFK9g7IDp1CBfwGfKEEhKE8cH73adZwr8C2Bpvu5XPH1Qbga1RgVVG6gcr2ISAgZG7o1yiP7nYXnMaqaYe9-86CHQx86ArFsaOI59g3Xu_AMCwuNu2vPnysS53k6YZ_P9UrGU8snlbgv5M2HJkH7ru6rlAHG8Ikfyj0f79eq42XiS3U4hyF6uHbPMWhh3nUMRXJEmUVITBzj6k84DpHum1va8vANjCG-QKn5jWkdelW6jRcov3XIjDVfWCMFnYDO26JyY3vcZHPyf1Pp6vz7yMXxxlnPgPY8FQRLmca32ISdnlbz3sblIMzde-qrB-Thxi-QjGuBL7_CZnZ-il51CfvoJ0tf0a2HjaKsNEJbHSAjUIVHWCjA2wfaAcaBdBoBxoNoNGqpop60CiCRkfQnpPLL6frz2dRl30j-pnkgkWmLJWL3bKwsdDO8EQzzm0G7mSaavCCOVepxat9LBeK6diJQoP_lzkTl1zB0L0g83pb25eEimViEiOsTQqTlYqrwjBrY82Nsirl6ohE_eBJrxHohMk6DNWN9BwOYxL8SZny-Igs-hGW2PxG9pdvAzSSSYBGemgkQvPqTq1fkwfj3H1D5s2utW_B72zKd920-QOAKICA
linkProvider	Library Specific Holdings
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Conquering+Big+Data+with+High+Performance+Computing&rft.au=Arora%2C+Ritu&rft.au=Trelogan%2C+Jessica&rft.au=Ba%2C+Trung+Nguyen&rft.atitle=Using+High+Performance+Computing+for+Detecting+Duplicate%2C+Similar+and+Related+Images+in+a+Large+Data+Collection&rft.date=2016-09-17&rft.pub=Springer+International+Publishing&rft.isbn=9783319337401&rft.spage=269&rft.epage=286&rft_id=info:doi/10.1007%2F978-3-319-33742-5_13
thumbnail_s	http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Febookcentral.proquest.com%2Fcovers%2F4690733-l.jpg