DART: An automated end-to-end object detection pipeline with data Diversification, open-vocabulary bounding box Annotation, pseudo-label Review, and model Training

Accurate real-time object detection is vital across numerous industrial applications, from safety monitoring to quality control. Traditional approaches, however, are hindered by arduous manual annotation and data collection, struggling to adapt to ever-changing environments and novel target objects....

Full description

Saved in:

Bibliographic Details
Published in	Expert systems with applications Vol. 258; p. 125124
Main Authors	Xin, Chen, Hartel, Andreas, Kasneci, Enkelejda
Format	Journal Article
Language	English
Published	Elsevier Ltd 15.12.2024
Subjects	Data diversification Large multimodal model (LMM) Open-vocabulary object detection (OVD) Pseudo-label Stable diffusion YOLO YOLO Large multimodal model (LMM) Data diversification Stable diffusion Open-vocabulary object detection (OVD) Pseudo-label
Online Access	Get full text
ISSN	0957-4174
DOI	10.1016/j.eswa.2024.125124

Cover

Loading…

Abstract	Accurate real-time object detection is vital across numerous industrial applications, from safety monitoring to quality control. Traditional approaches, however, are hindered by arduous manual annotation and data collection, struggling to adapt to ever-changing environments and novel target objects. To address these limitations, this paper presents DART, an innovative automated end-to-end pipeline that revolutionizes object detection workflows from data collection to model evaluation. It eliminates the need for laborious human labeling and extensive data collection while achieving outstanding accuracy across diverse scenarios. DART encompasses four key stages: (1) Data Diversification using subject-driven image generation (DreamBooth with SDXL), (2) Annotation via open-vocabulary object detection (Grounding DINO) to generate bounding box and class labels, (3) Review of generated images and pseudo-labels by large multimodal models (InternVL-1.5 and GPT-4o) to guarantee credibility, and (4) Training of real-time object detectors (YOLOv8 and YOLOv10) using the verified data. We apply DART to a self-collected dataset of construction machines named Liebherr Product, which contains over 15K high-quality images across 23 categories. The current instantiation of DART significantly increases average precision (AP) from 0.064 to 0.832. Its modular design ensures easy exchangeability and extensibility, allowing for future algorithm upgrades, seamless integration of new object categories, and adaptability to customized environments without manual labeling and additional data collection. The code and dataset are released at https://github.com/chen-xin-94/DART. •DART, an automated end-to-end object detection pipeline without manual labeling.•DART streamlines data diversification, annotation, review, and training stages.•DART boosts AP from 0.064 to 0.832 on a collected dataset of construction machines.•DART’s modular design ensures flexibility, robustness, and customization.•Code and dataset are made publicly available for further research.
AbstractList	Accurate real-time object detection is vital across numerous industrial applications, from safety monitoring to quality control. Traditional approaches, however, are hindered by arduous manual annotation and data collection, struggling to adapt to ever-changing environments and novel target objects. To address these limitations, this paper presents DART, an innovative automated end-to-end pipeline that revolutionizes object detection workflows from data collection to model evaluation. It eliminates the need for laborious human labeling and extensive data collection while achieving outstanding accuracy across diverse scenarios. DART encompasses four key stages: (1) Data Diversification using subject-driven image generation (DreamBooth with SDXL), (2) Annotation via open-vocabulary object detection (Grounding DINO) to generate bounding box and class labels, (3) Review of generated images and pseudo-labels by large multimodal models (InternVL-1.5 and GPT-4o) to guarantee credibility, and (4) Training of real-time object detectors (YOLOv8 and YOLOv10) using the verified data. We apply DART to a self-collected dataset of construction machines named Liebherr Product, which contains over 15K high-quality images across 23 categories. The current instantiation of DART significantly increases average precision (AP) from 0.064 to 0.832. Its modular design ensures easy exchangeability and extensibility, allowing for future algorithm upgrades, seamless integration of new object categories, and adaptability to customized environments without manual labeling and additional data collection. The code and dataset are released at https://github.com/chen-xin-94/DART. •DART, an automated end-to-end object detection pipeline without manual labeling.•DART streamlines data diversification, annotation, review, and training stages.•DART boosts AP from 0.064 to 0.832 on a collected dataset of construction machines.•DART’s modular design ensures flexibility, robustness, and customization.•Code and dataset are made publicly available for further research.
ArticleNumber	125124
Author	Hartel, Andreas Xin, Chen Kasneci, Enkelejda
Author_xml	– sequence: 1 givenname: Chen orcidid: 0000-0001-6860-7907 surname: Xin fullname: Xin, Chen email: chen.xin@tum.de organization: Technical University of Munich, Arcisstraße 21, München, 80333, Germany – sequence: 2 givenname: Andreas surname: Hartel fullname: Hartel, Andreas email: andreas.hartel@liebherr.com organization: Liebherr-Electronics and Drives GmbH, Peter-Dornier-Straße 11, Lindau Bodensee, 88131, Germany – sequence: 3 givenname: Enkelejda orcidid: 0000-0003-3146-4484 surname: Kasneci fullname: Kasneci, Enkelejda email: enkelejda.kasneci@tum.de organization: Technical University of Munich, Arcisstraße 21, München, 80333, Germany
BookMark	eNp9kE1OwzAQhb0ACQpcgJUPQIrtOA1BbKqWPwkJCZW1NbEn4Ci1o9hp4TxcFFftigWrZ43nm5n3JuTIeYeEXHI25YzPrtsphi1MBRNyykXBhTwip6wqykzyUp6QSQgtY7xkrDwlP8v52-qWzh2FMfo1RDQUncmiz5JQX7eoIzUYk1jvaG977KxDurXxkxqIQJd2g0OwjdWwa7mivkeXbbyGeuxg-Ka1H52x7iM9vtIm5-OhsQ84Gp91UGNH33BjcXtFIa1de5MqqwGsS9w5OW6gC3hx0DPy_nC_WjxlL6-Pz4v5S6ZFUcmMm7qEgskGi_qGmaYpGdcIFVZSoMxnBegmN2jSr5Y5FAZzLARUeiYlzJjIz4jYz9WDD2HARvWDXScHijO1i1a1ahet2kWr9tEm6OYPpO3eX0znd_-jd3sUk6lkflBBW3QajR1S3Mp4-x_-C9cynKM
CitedBy_id	crossref_primary_10_3389_fenvs_2024_1486212
Cites_doi	10.1109/CVPR52688.2022.01042 10.1016/j.autcon.2023.105066 10.1016/j.eswa.2024.124661 10.1016/j.eswa.2024.123253 10.1109/CVPR52729.2023.00721 10.1109/CVPRW50498.2020.00203 10.1016/j.autcon.2024.105486 10.1109/CVPR.2019.00550 10.1109/CVPR46437.2021.01352 10.1109/ICCV48922.2021.00986 10.1109/CVPR52733.2024.01599 10.1609/aaai.v34i07.6999 10.1109/ICCV51070.2023.00684
ContentType	Journal Article
Copyright	2024 The Author(s)
Copyright_xml	– notice: 2024 The Author(s)
DBID	6I. AAFTH AAYXX CITATION
DOI	10.1016/j.eswa.2024.125124
DatabaseName	ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef
DatabaseTitle	CrossRef
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
ExternalDocumentID	10_1016_j_eswa_2024_125124 S0957417424019912
GroupedDBID	--K --M .DC .~1 0R~ 13V 1B1 1RT 1~. 1~5 4.4 457 4G. 5GY 5VS 6I. 7-5 71M 8P~ 9JN 9JO AAAKF AABNK AACTN AAEDT AAEDW AAFTH AAIKJ AAKOC AALRI AAOAW AAQFI AARIN AAXKI AAXUO AAYFN ABBOA ABFNM ABMAC ABMVD ABUCO ACDAQ ACGFS ACHRH ACNTT ACRLP ACZNC ADBBV ADEZE ADTZH AEBSH AECPX AEKER AENEX AFKWA AFTJW AGHFR AGUBO AGUMN AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJOXV AKRWK ALEQD ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD APLSM AXJTR BJAXD BKOJK BLXMC BNSAS CS3 DU5 EBS EFJIC EO8 EO9 EP2 EP3 F5P FDB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HAMUX IHE J1W JJJVA KOM LG9 LY1 LY7 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. PQQKQ Q38 RIG ROL RPZ SDF SDG SDP SDS SES SEW SPC SPCBC SSB SSD SSL SST SSV SSZ T5K TN5 ~G- 29G AAAKG AAQXK AATTM AAYWO AAYXX ABJNI ABKBG ABWVN ABXDB ACNNM ACRPL ACVFH ADCNI ADJOM ADMUD ADNMO AEIPS AEUPX AFJKZ AFPUW AFXIZ AGCQF AGQPQ AGRNS AIGII AIIUN AKBMS AKYEP ANKPU APXCP ASPBG AVWKF AZFZN BNPGV CITATION EJD FEDTE FGOYB G-2 HLZ HVGLF HZ~ R2- SBC SET SSH WUQ XPP ZMT
ID	FETCH-LOGICAL-c2594-1db7a504fe5b80dff701cea9e942e4365acf3ded5b8c43a5de3e52a9c644a6023
IEDL.DBID	.~1
ISSN	0957-4174
IngestDate	Tue Jul 01 01:51:25 EDT 2025 Thu Apr 24 23:00:27 EDT 2025 Sat Sep 14 18:01:07 EDT 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Keywords	YOLO Large multimodal model (LMM) Data diversification Stable diffusion Open-vocabulary object detection (OVD) Pseudo-label
Language	English
License	This is an open access article under the CC BY-NC license.
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c2594-1db7a504fe5b80dff701cea9e942e4365acf3ded5b8c43a5de3e52a9c644a6023
ORCID	0000-0003-3146-4484 0000-0001-6860-7907
OpenAccessLink	https://www.sciencedirect.com/science/article/pii/S0957417424019912
ParticipantIDs	crossref_primary_10_1016_j_eswa_2024_125124 crossref_citationtrail_10_1016_j_eswa_2024_125124 elsevier_sciencedirect_doi_10_1016_j_eswa_2024_125124
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2024-12-15
PublicationDateYYYYMMDD	2024-12-15
PublicationDate_xml	– month: 12 year: 2024 text: 2024-12-15 day: 15
PublicationDecade	2020
PublicationTitle	Expert systems with applications
PublicationYear	2024
Publisher	Elsevier Ltd
Publisher_xml	– name: Elsevier Ltd
References	Xu, Yao, Guo, Cui, Ni, Ge (b65) 2024 Zong, Song, Liu (b71) 2023 Saharia, Chan, Saxena, Li, Whang, Denton (b54) 2022; 35 Google (b16) 2024 Yang, Zhang, Li, Zou, Li, Gao (b67) 2023 Wang, Yeh, Liao (b62) 2024 Li, Li, Jiang, Weng, Geng, Li (b26) 2022 Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., & Shan, Y. (2024). YOLO-World: Real-Time Open-Vocabulary Object Detection. In Liu, Qi, Qin, Shi, Jia (b34) 2018 (pp. 7441–7451). OpenAI (b38) 2023 Redmon, Farhadi (b45) 2017 Ren, He, Girshick, Sun (b48) 2015; Vol. 28 Li, Wang, Wu, Chen, Hu, Li (b27) 2020 Wang, Bai, Wang, Qin, Chen, Li (b56) 2024 (pp. 5356–5364). Liu, Li, Li, Lee (b31) 2024 Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez (b55) 2017; 30 Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y. M. (2023). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Liu, Li, Wu, Lee (b32) 2023 Radford, Kim, Hallacy, Ramesh, Goh, Agarwal (b41) 2021 (pp. 13733–13742). Loshchilov, Hutter (b36) 2019 Alayrac, Donahue, Luc, Miech, Barr, Hasson (b1) 2022 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Redmon, Divvala, Girshick, Farhadi (b44) 2016 Saharia, Chan, Saxena, Li, Whang, Denton (b53) 2022 Wei, Wang, Schuurmans, Bosma, Ichter, Xia (b63) 2023 Ronneberger, Fischer, Brox (b51) 2015 Wang, Dai, Chen, Huang, Li, Zhu (b59) 2023 Gu, Lin, Kuo, Cui (b17) 2022 Hu, Shen, Wallis, Allen-Zhu, Li, Wang (b21) 2022 Bochkovskiy, Wang, Liao (b4) 2020 Zhu, Chen, Shen, Li, Elhoseiny (b70) 2023 . Liu, Zeng, Ren, Li, Zhang, Yang (b35) 2023 Yang, Li, Lin, Wang, Lin, Liu (b66) 2023 Dong, Jiang, Hu, Du, Pan (b10) 2024; 255 Gupta, A., Dollar, P., & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In Wang, Liao, Yeh (b60) 2022 Jocher (b22) 2023 Podell, English, Lacey, Blattmann, Dockhorn, Müller (b40) 2024 OpenAI (b39) 2024 Ramesh, Dhariwal, Nichol, Chu, Chen (b42) 2022 Li, Li, Geng, Jiang, Cheng, Zhang (b25) 2023 Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner (b11) 2021 Wang, Chen, Liu, Chen, Lin, Han (b58) 2024 Klinger, Starkweather (b23) 2008 Reimers, Gurevych (b47) 2019 Xu, Jha, Mehadi, Mandal (b64) 2024; 165 Lee, Zhai (b24) 2023 (pp. 12993–13000). Anthropic (b3) 2024 Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., & Sun, J. (2021). Repvgg: Making vgg-style convnets great again. In Anon (b2) 2020 Mou, Wang, Xie, Wu, Zhang, Qi (b37) 2023 Ye, Zhang, Liu, Han, Yang (b68) 2023 (pp. 10684–10695). Feng, Zhong, Gao, Scott, Huang (b12) 2021 (pp. 10012–10022). Rezatofighi, Tsoi, Gwak, Sadeghian, Reid, Savarese (b49) 2019 Lin, Maire, Belongie, Hays, Perona, Ramanan (b30) 2014 Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020). Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Chen, Wu, Wang, Su, Chen, Xing (b6) 2024 Ruiz, Li, Jampani, Pritch, Rubinstein, Aberman (b52) 2023 Gal, Alaluf, Atzmon, Patashnik, Bermano, Chechik (b13) 2022 Li, Wong, Zhang, Usuyama, Liu, Yang (b28) 2023 Gao, Zhang, Geng, Tang, Bhatti (b14) 2024; 246 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In Wang, C.-Y., Mark Liao, H.-Y., Wu, Y.-H., Chen, P.-Y., Hsieh, J.-W., & Yeh, I.-H. (2020). CSPNet: A new backbone that can enhance learning capability of cnn. In Cho, Kim (b8) 2023; 155 Li, Zhang, Zhang, Yang, Li, Zhong (b29) 2022 Hang, T., Gu, S., Li, C., Bao, J., Chen, D., Hu, H., et al. (2023). Efficient Diffusion Training via Min-SNR Weighting Strategy. In Redmon, Farhadi (b46) 2018 RangeKing (b43) 2023 He, Zhang, Ren, Sun (b20) 2014 Gevorgyan (b15) 2022 Chen, Wang, Tian, Ye, Gao, Cui (b5) 2024 (pp. 390–391). 10.1016/j.eswa.2024.125124_b33 Ronneberger (10.1016/j.eswa.2024.125124_b51) 2015 Wang (10.1016/j.eswa.2024.125124_b62) 2024 Anthropic (10.1016/j.eswa.2024.125124_b3) 2024 Gao (10.1016/j.eswa.2024.125124_b14) 2024; 246 Li (10.1016/j.eswa.2024.125124_b29) 2022 Klinger (10.1016/j.eswa.2024.125124_b23) 2008 Ye (10.1016/j.eswa.2024.125124_b68) 2023 OpenAI (10.1016/j.eswa.2024.125124_b38) 2023 Ramesh (10.1016/j.eswa.2024.125124_b42) 2022 Reimers (10.1016/j.eswa.2024.125124_b47) 2019 Alayrac (10.1016/j.eswa.2024.125124_b1) 2022 Li (10.1016/j.eswa.2024.125124_b25) 2023 Xu (10.1016/j.eswa.2024.125124_b65) 2024 Cho (10.1016/j.eswa.2024.125124_b8) 2023; 155 Liu (10.1016/j.eswa.2024.125124_b32) 2023 Ruiz (10.1016/j.eswa.2024.125124_b52) 2023 Xu (10.1016/j.eswa.2024.125124_b64) 2024; 165 Dosovitskiy (10.1016/j.eswa.2024.125124_b11) 2021 Podell (10.1016/j.eswa.2024.125124_b40) 2024 Lee (10.1016/j.eswa.2024.125124_b24) 2023 Redmon (10.1016/j.eswa.2024.125124_b44) 2016 Wang (10.1016/j.eswa.2024.125124_b56) 2024 10.1016/j.eswa.2024.125124_b7 Li (10.1016/j.eswa.2024.125124_b28) 2023 10.1016/j.eswa.2024.125124_b9 RangeKing (10.1016/j.eswa.2024.125124_b43) 2023 Wang (10.1016/j.eswa.2024.125124_b59) 2023 Saharia (10.1016/j.eswa.2024.125124_b53) 2022 Zhu (10.1016/j.eswa.2024.125124_b70) 2023 Yang (10.1016/j.eswa.2024.125124_b67) 2023 Chen (10.1016/j.eswa.2024.125124_b5) 2024 Dong (10.1016/j.eswa.2024.125124_b10) 2024; 255 Liu (10.1016/j.eswa.2024.125124_b34) 2018 10.1016/j.eswa.2024.125124_b57 Jocher (10.1016/j.eswa.2024.125124_b22) 2023 Google (10.1016/j.eswa.2024.125124_b16) 2024 Loshchilov (10.1016/j.eswa.2024.125124_b36) 2019 Liu (10.1016/j.eswa.2024.125124_b35) 2023 10.1016/j.eswa.2024.125124_b50 Liu (10.1016/j.eswa.2024.125124_b31) 2024 Rezatofighi (10.1016/j.eswa.2024.125124_b49) 2019 Feng (10.1016/j.eswa.2024.125124_b12) 2021 OpenAI (10.1016/j.eswa.2024.125124_b39) 2024 Hu (10.1016/j.eswa.2024.125124_b21) 2022 Chen (10.1016/j.eswa.2024.125124_b6) 2024 Radford (10.1016/j.eswa.2024.125124_b41) 2021 Wang (10.1016/j.eswa.2024.125124_b60) 2022 Redmon (10.1016/j.eswa.2024.125124_b46) 2018 Gevorgyan (10.1016/j.eswa.2024.125124_b15) 2022 Li (10.1016/j.eswa.2024.125124_b27) 2020 Wang (10.1016/j.eswa.2024.125124_b58) 2024 Ren (10.1016/j.eswa.2024.125124_b48) 2015; Vol. 28 10.1016/j.eswa.2024.125124_b69 Yang (10.1016/j.eswa.2024.125124_b66) 2023 Zong (10.1016/j.eswa.2024.125124_b71) 2023 Saharia (10.1016/j.eswa.2024.125124_b54) 2022; 35 Vaswani (10.1016/j.eswa.2024.125124_b55) 2017; 30 10.1016/j.eswa.2024.125124_b61 He (10.1016/j.eswa.2024.125124_b20) 2014 Gu (10.1016/j.eswa.2024.125124_b17) 2022 Mou (10.1016/j.eswa.2024.125124_b37) 2023 Anon (10.1016/j.eswa.2024.125124_b2) 2020 Bochkovskiy (10.1016/j.eswa.2024.125124_b4) 2020 10.1016/j.eswa.2024.125124_b18 Redmon (10.1016/j.eswa.2024.125124_b45) 2017 Gal (10.1016/j.eswa.2024.125124_b13) 2022 10.1016/j.eswa.2024.125124_b19 Li (10.1016/j.eswa.2024.125124_b26) 2022 Lin (10.1016/j.eswa.2024.125124_b30) 2014 Wei (10.1016/j.eswa.2024.125124_b63) 2023
References_xml	– reference: Gupta, A., Dollar, P., & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In – year: 2022 ident: b29 article-title: Grounded language-image pre-training publication-title: CVPR – reference: Hang, T., Gu, S., Li, C., Bao, J., Chen, D., Hu, H., et al. (2023). Efficient Diffusion Training via Min-SNR Weighting Strategy. In – volume: 165 year: 2024 ident: b64 article-title: Multiscale object detection on complex architectural floor plans publication-title: Automation in Construction – reference: Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y. M. (2023). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In – year: 2022 ident: b17 article-title: Open-vocabulary object detection via vision and language knowledge distillation publication-title: International conference on learning representations – year: 2023 ident: b28 article-title: LLaVA-med: Training a large language-and-vision assistant for biomedicine in one day publication-title: Thirty-seventh conference on neural information processing systems datasets and benchmarks track – year: 2024 ident: b5 article-title: How far are we to GPT-4V? closing the gap to commercial multimodal models with open-source suites – year: 2015 ident: b51 article-title: U-Net: Convolutional networks for biomedical image segmentation – year: 2023 ident: b24 article-title: NERIF: GPT-4V for automatic scoring of drawn models – year: 2017 ident: b45 article-title: YOLO9000: Better, faster, stronger publication-title: CVPR – year: 2022 ident: b53 article-title: Photorealistic text-to-image diffusion models with deep language understanding – year: 2023 ident: b63 article-title: Chain-of-thought prompting elicits reasoning in large language models – year: 2023 ident: b59 article-title: InternImage: Exploring large-scale vision foundation models with deformable convolutions – year: 2024 ident: b62 article-title: YOLOv9: Learning what you want to learn using programmable gradient information – year: 2008 ident: b23 article-title: pHash.org: Home of pHash, the open source perceptual hash library – year: 2019 ident: b36 article-title: Decoupled weight decay regularization publication-title: International conference on learning representations – reference: (pp. 10012–10022). – reference: Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In – year: 2024 ident: b31 article-title: Improved baselines with visual instruction tuning – reference: (pp. 5356–5364). – year: 2023 ident: b71 article-title: DETRs with collaborative hybrid assignments training – year: 2024 ident: b3 article-title: Claude 3 family – volume: 155 year: 2023 ident: b8 article-title: Detection of moving objects in multi-complex environments using selective attention networks (SANet) publication-title: Automation in Construction – volume: Vol. 28 year: 2015 ident: b48 article-title: Faster R-CNN: Towards real-time object detection with region proposal networks publication-title: Advances in neural information processing systems – year: 2023 ident: b66 article-title: The Dawn of LMMs: preliminary explorations with GPT-4V(ision) – reference: (pp. 7441–7451). – year: 2024 ident: b56 article-title: InstantID: Zero-shot identity-preserving generation in seconds – volume: 30 year: 2017 ident: b55 article-title: Attention is all you need publication-title: Advances in Neural Information Processing Systems – year: 2022 ident: b1 article-title: Flamingo: A visual language model for few-shot learning – reference: (pp. 13733–13742). – year: 2022 ident: b60 article-title: Designing network design strategies through gradient path analysis – year: 2023 ident: b37 article-title: T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models – volume: 246 year: 2024 ident: b14 article-title: PE-Transformer: Path enhanced transformer for improving underwater object detection publication-title: Expert Systems with Applications – volume: 255 year: 2024 ident: b10 article-title: EL-Net: An efficient and lightweight optimized network for object detection in remote sensing images publication-title: Expert Systems with Applications – year: 2024 ident: b6 article-title: InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks – year: 2022 ident: b15 article-title: SIoU Loss: More powerful learning for bounding box regression – year: 2018 ident: b46 article-title: YOLOv3: An incremental improvement – reference: Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., & Sun, J. (2021). Repvgg: Making vgg-style convnets great again. In – year: 2023 ident: b38 article-title: GPT-4 technical report – start-page: 346 year: 2014 end-page: 361 ident: b20 article-title: Spatial pyramid pooling in deep convolutional networks for visual recognition publication-title: Computer vision – ECCV 2014 – year: 2024 ident: b40 article-title: SDXL: Improving latent diffusion models for high-resolution image synthesis publication-title: The twelfth international conference on learning representations – year: 2023 ident: b67 article-title: Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4V – year: 2023 ident: b22 article-title: Ultralytics YOLOv8 – year: 2023 ident: b32 article-title: Visual instruction tuning – year: 2019 ident: b47 article-title: Sentence-BERT: Sentence embeddings using siamese BERT-networks publication-title: Proceedings of the 2019 conference on empirical methods in natural language processing – year: 2020 ident: b2 article-title: YOLOv5 by ultralytics – reference: (pp. 12993–13000). – year: 2024 ident: b39 article-title: GPT-4o – year: 2022 ident: b42 article-title: Hierarchical text-conditional image generation with CLIP latents – reference: Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020). Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In – year: 2023 ident: b25 article-title: YOLOv6 v3.0: A full-scale reloading – year: 2021 ident: b11 article-title: An image is worth 16x16 words: transformers for image recognition at scale – year: 2014 ident: b30 article-title: Microsoft COCO: Common objects in context publication-title: ECCV – year: 2024 ident: b58 article-title: YOLOv10: Real-time end-to-end object detection – year: 2024 ident: b65 article-title: LLaVA-UHD: An LMM perceiving any aspect ratio and high-resolution images – reference: (pp. 390–391). – start-page: 8748 year: 2021 end-page: 8763 ident: b41 article-title: Learning transferable visual models from natural language supervision publication-title: International conference on machine learning – year: 2024 ident: b16 article-title: Gemini: A family of highly capable multimodal models – year: 2022 ident: b26 article-title: YOLOv6: A single-stage object detection framework for industrial applications – year: 2023 ident: b52 article-title: DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation – year: 2018 ident: b34 article-title: Path aggregation network for instance segmentation – volume: 35 start-page: 36479 year: 2022 end-page: 36494 ident: b54 article-title: Photorealistic text-to-image diffusion models with deep language understanding publication-title: Advances in Neural Information Processing Systems – year: 2020 ident: b4 article-title: YOLOv4: Optimal speed and accuracy of object detection – year: 2023 ident: b43 article-title: Brief summary of YOLOv8 model structure – reference: Wang, C.-Y., Mark Liao, H.-Y., Wu, Y.-H., Chen, P.-Y., Hsieh, J.-W., & Yeh, I.-H. (2020). CSPNet: A new backbone that can enhance learning capability of cnn. In – year: 2020 ident: b27 article-title: Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection publication-title: NeurIPS – reference: . – reference: Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In – year: 2022 ident: b21 article-title: LoRA: Low-rank adaptation of large language models publication-title: International conference on learning representations – reference: (pp. 10684–10695). – year: 2023 ident: b35 article-title: Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection – year: 2016 ident: b44 article-title: You only look once: unified, real-time object detection publication-title: CVPR – year: 2019 ident: b49 article-title: Generalized intersection over union: a metric and A loss for bounding box regression publication-title: CVPR – year: 2022 ident: b13 article-title: An image is worth one word: personalizing text-to-image generation using textual inversion – year: 2021 ident: b12 article-title: TOOD: Task-aligned one-stage object detection publication-title: ICCV – reference: Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., & Shan, Y. (2024). YOLO-World: Real-Time Open-Vocabulary Object Detection. In – year: 2023 ident: b70 article-title: MiniGPT-4: Enhancing vision-language understanding with advanced large language models – year: 2023 ident: b68 article-title: IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models – ident: 10.1016/j.eswa.2024.125124_b50 doi: 10.1109/CVPR52688.2022.01042 – year: 2023 ident: 10.1016/j.eswa.2024.125124_b59 – year: 2024 ident: 10.1016/j.eswa.2024.125124_b16 – year: 2024 ident: 10.1016/j.eswa.2024.125124_b62 – year: 2023 ident: 10.1016/j.eswa.2024.125124_b43 – start-page: 8748 year: 2021 ident: 10.1016/j.eswa.2024.125124_b41 article-title: Learning transferable visual models from natural language supervision – year: 2015 ident: 10.1016/j.eswa.2024.125124_b51 – year: 2022 ident: 10.1016/j.eswa.2024.125124_b17 article-title: Open-vocabulary object detection via vision and language knowledge distillation – year: 2023 ident: 10.1016/j.eswa.2024.125124_b35 – year: 2008 ident: 10.1016/j.eswa.2024.125124_b23 – year: 2023 ident: 10.1016/j.eswa.2024.125124_b66 – year: 2024 ident: 10.1016/j.eswa.2024.125124_b3 – year: 2023 ident: 10.1016/j.eswa.2024.125124_b37 – year: 2019 ident: 10.1016/j.eswa.2024.125124_b47 article-title: Sentence-BERT: Sentence embeddings using siamese BERT-networks – volume: 155 year: 2023 ident: 10.1016/j.eswa.2024.125124_b8 article-title: Detection of moving objects in multi-complex environments using selective attention networks (SANet) publication-title: Automation in Construction doi: 10.1016/j.autcon.2023.105066 – volume: 255 year: 2024 ident: 10.1016/j.eswa.2024.125124_b10 article-title: EL-Net: An efficient and lightweight optimized network for object detection in remote sensing images publication-title: Expert Systems with Applications doi: 10.1016/j.eswa.2024.124661 – year: 2021 ident: 10.1016/j.eswa.2024.125124_b11 – year: 2023 ident: 10.1016/j.eswa.2024.125124_b63 – volume: 246 year: 2024 ident: 10.1016/j.eswa.2024.125124_b14 article-title: PE-Transformer: Path enhanced transformer for improving underwater object detection publication-title: Expert Systems with Applications doi: 10.1016/j.eswa.2024.123253 – year: 2022 ident: 10.1016/j.eswa.2024.125124_b53 – year: 2016 ident: 10.1016/j.eswa.2024.125124_b44 article-title: You only look once: unified, real-time object detection – year: 2023 ident: 10.1016/j.eswa.2024.125124_b38 – year: 2024 ident: 10.1016/j.eswa.2024.125124_b39 – year: 2014 ident: 10.1016/j.eswa.2024.125124_b30 article-title: Microsoft COCO: Common objects in context – ident: 10.1016/j.eswa.2024.125124_b57 doi: 10.1109/CVPR52729.2023.00721 – ident: 10.1016/j.eswa.2024.125124_b61 doi: 10.1109/CVPRW50498.2020.00203 – year: 2020 ident: 10.1016/j.eswa.2024.125124_b4 – year: 2024 ident: 10.1016/j.eswa.2024.125124_b58 – volume: 165 year: 2024 ident: 10.1016/j.eswa.2024.125124_b64 article-title: Multiscale object detection on complex architectural floor plans publication-title: Automation in Construction doi: 10.1016/j.autcon.2024.105486 – year: 2023 ident: 10.1016/j.eswa.2024.125124_b32 – year: 2022 ident: 10.1016/j.eswa.2024.125124_b42 – volume: Vol. 28 year: 2015 ident: 10.1016/j.eswa.2024.125124_b48 article-title: Faster R-CNN: Towards real-time object detection with region proposal networks – ident: 10.1016/j.eswa.2024.125124_b18 doi: 10.1109/CVPR.2019.00550 – year: 2024 ident: 10.1016/j.eswa.2024.125124_b65 – year: 2020 ident: 10.1016/j.eswa.2024.125124_b2 – year: 2023 ident: 10.1016/j.eswa.2024.125124_b68 – ident: 10.1016/j.eswa.2024.125124_b9 doi: 10.1109/CVPR46437.2021.01352 – year: 2018 ident: 10.1016/j.eswa.2024.125124_b34 – ident: 10.1016/j.eswa.2024.125124_b33 doi: 10.1109/ICCV48922.2021.00986 – year: 2019 ident: 10.1016/j.eswa.2024.125124_b36 article-title: Decoupled weight decay regularization – year: 2022 ident: 10.1016/j.eswa.2024.125124_b1 – ident: 10.1016/j.eswa.2024.125124_b7 doi: 10.1109/CVPR52733.2024.01599 – year: 2017 ident: 10.1016/j.eswa.2024.125124_b45 article-title: YOLO9000: Better, faster, stronger – year: 2018 ident: 10.1016/j.eswa.2024.125124_b46 – year: 2024 ident: 10.1016/j.eswa.2024.125124_b5 – year: 2022 ident: 10.1016/j.eswa.2024.125124_b13 – year: 2022 ident: 10.1016/j.eswa.2024.125124_b26 – year: 2024 ident: 10.1016/j.eswa.2024.125124_b40 article-title: SDXL: Improving latent diffusion models for high-resolution image synthesis – ident: 10.1016/j.eswa.2024.125124_b69 doi: 10.1609/aaai.v34i07.6999 – year: 2019 ident: 10.1016/j.eswa.2024.125124_b49 article-title: Generalized intersection over union: a metric and A loss for bounding box regression – year: 2023 ident: 10.1016/j.eswa.2024.125124_b28 article-title: LLaVA-med: Training a large language-and-vision assistant for biomedicine in one day – year: 2022 ident: 10.1016/j.eswa.2024.125124_b29 article-title: Grounded language-image pre-training – year: 2024 ident: 10.1016/j.eswa.2024.125124_b31 – year: 2023 ident: 10.1016/j.eswa.2024.125124_b24 – year: 2022 ident: 10.1016/j.eswa.2024.125124_b60 – year: 2021 ident: 10.1016/j.eswa.2024.125124_b12 article-title: TOOD: Task-aligned one-stage object detection – year: 2020 ident: 10.1016/j.eswa.2024.125124_b27 article-title: Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection – year: 2023 ident: 10.1016/j.eswa.2024.125124_b70 – year: 2023 ident: 10.1016/j.eswa.2024.125124_b52 – year: 2023 ident: 10.1016/j.eswa.2024.125124_b25 – year: 2024 ident: 10.1016/j.eswa.2024.125124_b6 – year: 2024 ident: 10.1016/j.eswa.2024.125124_b56 – start-page: 346 year: 2014 ident: 10.1016/j.eswa.2024.125124_b20 article-title: Spatial pyramid pooling in deep convolutional networks for visual recognition – volume: 35 start-page: 36479 year: 2022 ident: 10.1016/j.eswa.2024.125124_b54 article-title: Photorealistic text-to-image diffusion models with deep language understanding publication-title: Advances in Neural Information Processing Systems – year: 2023 ident: 10.1016/j.eswa.2024.125124_b67 – year: 2023 ident: 10.1016/j.eswa.2024.125124_b71 – year: 2023 ident: 10.1016/j.eswa.2024.125124_b22 – year: 2022 ident: 10.1016/j.eswa.2024.125124_b21 article-title: LoRA: Low-rank adaptation of large language models – year: 2022 ident: 10.1016/j.eswa.2024.125124_b15 – ident: 10.1016/j.eswa.2024.125124_b19 doi: 10.1109/ICCV51070.2023.00684 – volume: 30 year: 2017 ident: 10.1016/j.eswa.2024.125124_b55 article-title: Attention is all you need publication-title: Advances in Neural Information Processing Systems
SSID	ssj0017007
Score	2.45439
Snippet	Accurate real-time object detection is vital across numerous industrial applications, from safety monitoring to quality control. Traditional approaches,...
SourceID	crossref elsevier
SourceType	Enrichment Source Index Database Publisher
StartPage	125124
SubjectTerms	Data diversification Large multimodal model (LMM) Open-vocabulary object detection (OVD) Pseudo-label Stable diffusion YOLO
Title	DART: An automated end-to-end object detection pipeline with data Diversification, open-vocabulary bounding box Annotation, pseudo-label Review, and model Training
URI	https://dx.doi.org/10.1016/j.eswa.2024.125124
Volume	258
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9NAEF5F5cIFaAFRaKs59EY2sePdOOYWtUQpiFxopNysfYyloMq2WofHhT_TP8qMvY5AQj305Neu1toZz3y7_mZGiPOpyZxzMUpD9lEqi7E0M-OldThRMyx0bDnA-ctqulyrTxu9GYiLPhaGaZXB9nc2vbXW4c44zOa43m7HXwkckDukpR0tEQjlsB1WKmUtH_3e0zw4_Vza5dtLJbcOgTMdxwvvfnDuoYkatX5e_d85_eVwFi_Es4AUYd69zKEYYHkknvdVGCB8lC_F_SVh0g8wL8HsmooAKHrA0sumknSAyvJOC3hsWtJVCfW25hh0BN6CBWaIwmVHzijC_t0QuKaW_E5uzjJL9RdYLr5ETo5OftJIZdWEhvUd7nwlSZXwBrr_DEMwNGxbYQeuQwGKV2K9-Hh9sZSh9IJ0tB5SMvY2NTpSBWo7i3xRpFHs0GSYqQmqZKqNKxKPnp46lRjtMUE9IbkTvDJTwgGvxUFZlfhGgGIIGBX04TvCZqhsYlN0hoGd1VmMxyLu5zx3IS85l8e4yXsC2rec5ZSznPJOTsfi_b5P3WXleLC17kWZ_6NbObmNB_q9fWS_d-IpXzHpJdYn4qC53eEpQZfGnrW6eSaezK8-L1d_ACtU8QA
linkProvider	Elsevier
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LT-MwELZQObAXWB4r3syBG3ibNHYfe6t4qLx6oUjcIj8mUlcoiZYU2N-zf5SZxkEgrThwShTbcuSxZz7b38wIcdg1A-dcjNKQfpTKYixN33hpHXZUHzMdW3Zwvhl3R3fq8l7fL4iTxheGaZVB99c6fa6tw5d2GM12OZ22bwkckDmkrR1tEQjlkB5e5OhUuiUWhxdXo_HbZUIvqr2mqb7kBsF3pqZ54eMzhx_qqJ9zU6_-b5_e2Zzz72I5gEUY1v-zKhYwXxMrTSIGCOtyXfw7JVj6C4Y5mFlVEAZFD5h7WRWSHlBYPmwBj9Wcd5VDOS3ZDR2BT2GBSaJwWvMzsnCEdwycVks-kaWzTFT9C5bzL5Gdo5cX6ikvqlCxfMSZLyTNJnyA-qrhGAx1O0-yA5OQg2JD3J2fTU5GMmRfkI62RErG3vaMjlSG2vYjn2W9KHZoBjhQHVRJVxuXJR49lTqVGO0xQd0h0RPCMl2CAj9EKy9y3BSgGAVGGa19R_AMlU1sD51hbGf1IMYtETdjnroQmpwzZDykDQftd8pySllOaS2nLXH01qasA3N8Wls3okw_TK-ULMcn7ba_2O5ALI0mN9fp9cX4akd84xLmwMR6V7SqPzPcIyRT2f0wU18BaxXzsQ
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DART%3A+An+automated+end-to-end+object+detection+pipeline+with+data+Diversification%2C+open-vocabulary+bounding+box+Annotation%2C+pseudo-label+Review%2C+and+model+Training&rft.jtitle=Expert+systems+with+applications&rft.au=Xin%2C+Chen&rft.au=Hartel%2C+Andreas&rft.au=Kasneci%2C+Enkelejda&rft.date=2024-12-15&rft.issn=0957-4174&rft.volume=258&rft.spage=125124&rft_id=info:doi/10.1016%2Fj.eswa.2024.125124&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_eswa_2024_125124
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0957-4174&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0957-4174&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0957-4174&client=summon