Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection
The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating da...
Saved in:
Published in | Conquering Big Data with High Performance Computing pp. 269 - 286 |
---|---|
Main Authors | , , |
Format | Book Chapter |
Language | English |
Published |
Switzerland
Springer International Publishing AG
2016
Springer International Publishing |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating data that contain large numbers of identical copies, it can be particularly difficult to find a solution for identifying redundancy within image-heavy collections that have evolved over a long span of time or have been created collaboratively by large groups. These types of collections, especially in academic research settings, in which the datasets are used for a wide range of publication, teaching, and research activities, can be characterized by (1) large numbers of heterogeneous file formats, (2) repetitive photographic documentation of the same subjects in a variety of conditions (3) multiple copies or subsets of images with slight modifications (e.g., cropping or color-balancing) and (4) complex file structures and naming conventions that may not be consistent throughout. In this chapter, we present a scalable and automated approach for detecting duplicate, similar, and related images, along with subimages, in digital data collections. Our approach can assist in efficiently managing redundancy in any large image collection on High Performance Computing (HPC) resources. While we illustrate the approach with a large archaeological collection, it is domain-neutral and is widely applicable to image-heavy collections within any HPC platform that has general-purpose processors. |
---|---|
AbstractList | The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging, if not impossible, to do manually in large, unstructured, and noisy collections. While there are many automated solutions for deduplicating data that contain large numbers of identical copies, it can be particularly difficult to find a solution for identifying redundancy within image-heavy collections that have evolved over a long span of time or have been created collaboratively by large groups. These types of collections, especially in academic research settings, in which the datasets are used for a wide range of publication, teaching, and research activities, can be characterized by (1) large numbers of heterogeneous file formats, (2) repetitive photographic documentation of the same subjects in a variety of conditions (3) multiple copies or subsets of images with slight modifications (e.g., cropping or color-balancing) and (4) complex file structures and naming conventions that may not be consistent throughout. In this chapter, we present a scalable and automated approach for detecting duplicate, similar, and related images, along with subimages, in digital data collections. Our approach can assist in efficiently managing redundancy in any large image collection on High Performance Computing (HPC) resources. While we illustrate the approach with a large archaeological collection, it is domain-neutral and is widely applicable to image-heavy collections within any HPC platform that has general-purpose processors. |
Author | Ba, Trung Nguyen Arora, Ritu Trelogan, Jessica |
Author_xml | – sequence: 1 givenname: Ritu surname: Arora fullname: Arora, Ritu email: rauta@tacc.utexas.edu organization: University of Texas at Austin, Texas Advanced Computing Center, Austin, USA – sequence: 2 givenname: Jessica surname: Trelogan fullname: Trelogan, Jessica organization: University of Texas, Austin, USA – sequence: 3 givenname: Trung Nguyen surname: Ba fullname: Ba, Trung Nguyen organization: University of Texas, Austin, USA |
BookMark | eNo1kMtOwzAQRY14iLb0D1j4AwjYniR2lqgFWqkSCOjacpxJGkiTYKf_j9PCajRndK9GZ0ou2q5FQm45u-eMyYdMqggi4FkEIGMRJZrDGZkHDAEeWXJOpv8LU1dkkiVKpbFU_JrMvf9ijHEJEng8Id3W121FV3W1o2_oys7tTWuRLrp9fxjGU0B0iQPa47Y89E1tzYB39KPe141x1LQFfccmsIKu96ZCT-uWGroxrkK6NIMJbU0zFnTtDbksTeNx_jdnZPv89LlYRZvXl_XicRP1PFEQFXluSlamGTJly0JyC1JiLAQIYROeSGkEZimLUSoDlpUqswJUXBYslya4mBFx6vW9C3-j03nXfXvNmR416uBLgw6S9FGZHjWGUHwK9a77OaAfNI4pi-3gTGN3ph_QeR2nGZMAWoSckAx-AYLSdsc |
ContentType | Book Chapter |
Copyright | Springer International Publishing Switzerland 2016 |
Copyright_xml | – notice: Springer International Publishing Switzerland 2016 |
DBID | FFUUA |
DEWEY | 005.7 |
DOI | 10.1007/978-3-319-33742-5_13 |
DatabaseName | ProQuest Ebook Central - Book Chapters - Demo use only |
DatabaseTitleList | |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9783319337425 3319337424 |
Editor | Arora, Ritu |
Editor_xml | – sequence: 1 fullname: Arora, Ritu |
EndPage | 286 |
ExternalDocumentID | EBC4690733_242_270 |
GroupedDBID | 0D6 0DA 38. AABBV AAMCO AAQZU ABMNI ABOWU ACBPT ACLMJ ADCXD ADPGQ AEIBC AEJGN AEJLV AEKFX AETDV AEZAY ALMA_UNASSIGNED_HOLDINGS AORVH AWFBM AZZ BBABE CZZ FFUUA IEZ JJU MYL SBO SWNTM TPJZQ Z7R Z7U Z7X Z7Y Z7Z Z81 Z83 Z84 Z85 Z88 |
ID | FETCH-LOGICAL-p1583-dbbaf0f69e08cfd71c377e422322c51577a2e9604e78a3c0f89c2384fd0b7a783 |
ISBN | 3319337408 9783319337401 |
IngestDate | Tue Jul 29 19:43:24 EDT 2025 Thu May 29 00:28:57 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | false |
LCCallNum | QA76.9.D3 |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-p1583-dbbaf0f69e08cfd71c377e422322c51577a2e9604e78a3c0f89c2384fd0b7a783 |
OCLC | 958864781 |
PQID | EBC4690733_242_270 |
PageCount | 18 |
ParticipantIDs | springer_books_10_1007_978_3_319_33742_5_13 proquest_ebookcentralchapters_4690733_242_270 |
PublicationCentury | 2000 |
PublicationDate | 2016 20160917 |
PublicationDateYYYYMMDD | 2016-01-01 2016-09-17 |
PublicationDate_xml | – year: 2016 text: 2016 |
PublicationDecade | 2010 |
PublicationPlace | Switzerland |
PublicationPlace_xml | – name: Switzerland – name: Cham |
PublicationTitle | Conquering Big Data with High Performance Computing |
PublicationYear | 2016 |
Publisher | Springer International Publishing AG Springer International Publishing |
Publisher_xml | – name: Springer International Publishing AG – name: Springer International Publishing |
SSID | ssj0001737314 |
Score | 1.5068637 |
Snippet | The detection of duplicate and related content is a critical data curation task in the context of digital research collections. This task can be challenging,... |
SourceID | springer proquest |
SourceType | Publisher |
StartPage | 269 |
SubjectTerms | Algorithms & data structures Color Histogram High Performance Computing Related Image Research Collection Systems analysis & design Template Image |
Title | Using High Performance Computing for Detecting Duplicate, Similar and Related Images in a Large Data Collection |
URI | http://ebookcentral.proquest.com/lib/SITE_ID/reader.action?docID=4690733&ppg=270 http://link.springer.com/10.1007/978-3-319-33742-5_13 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lj9MwELZKuSAOvMXykg_cSlASJ7VzZNlFS7XsqYv2Zjl-VJHYFHWTA_x6Zuy8WpbDcrFay3ISf9Z4_PkbDyHveWKyhJsi0pnIoyxzcVQ6KHRRiiR3qYo5xjt_u1ieXWarq_xqNnMT1VLblB_171vjSv4HVagDXDFK9g7IDp1CBfwGfKEEhKE8cH73adZwr8C2Bpvu5XPH1Qbga1RgVVG6gcr2ISAgZG7o1yiP7nYXnMaqaYe9-86CHQx86ArFsaOI59g3Xu_AMCwuNu2vPnysS53k6YZ_P9UrGU8snlbgv5M2HJkH7ru6rlAHG8Ikfyj0f79eq42XiS3U4hyF6uHbPMWhh3nUMRXJEmUVITBzj6k84DpHum1va8vANjCG-QKn5jWkdelW6jRcov3XIjDVfWCMFnYDO26JyY3vcZHPyf1Pp6vz7yMXxxlnPgPY8FQRLmca32ISdnlbz3sblIMzde-qrB-Thxi-QjGuBL7_CZnZ-il51CfvoJ0tf0a2HjaKsNEJbHSAjUIVHWCjA2wfaAcaBdBoBxoNoNGqpop60CiCRkfQnpPLL6frz2dRl30j-pnkgkWmLJWL3bKwsdDO8EQzzm0G7mSaavCCOVepxat9LBeK6diJQoP_lzkTl1zB0L0g83pb25eEimViEiOsTQqTlYqrwjBrY82Nsirl6ohE_eBJrxHohMk6DNWN9BwOYxL8SZny-Igs-hGW2PxG9pdvAzSSSYBGemgkQvPqTq1fkwfj3H1D5s2utW_B72zKd920-QOAKICA |
linkProvider | Library Specific Holdings |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Conquering+Big+Data+with+High+Performance+Computing&rft.au=Arora%2C+Ritu&rft.au=Trelogan%2C+Jessica&rft.au=Ba%2C+Trung+Nguyen&rft.atitle=Using+High+Performance+Computing+for+Detecting+Duplicate%2C+Similar+and+Related+Images+in+a+Large+Data+Collection&rft.date=2016-09-17&rft.pub=Springer+International+Publishing&rft.isbn=9783319337401&rft.spage=269&rft.epage=286&rft_id=info:doi/10.1007%2F978-3-319-33742-5_13 |
thumbnail_s | http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Febookcentral.proquest.com%2Fcovers%2F4690733-l.jpg |