DART: An automated end-to-end object detection pipeline with data Diversification, open-vocabulary bounding box Annotation, pseudo-label Review, and model Training

Accurate real-time object detection is vital across numerous industrial applications, from safety monitoring to quality control. Traditional approaches, however, are hindered by arduous manual annotation and data collection, struggling to adapt to ever-changing environments and novel target objects....

Full description

Saved in:
Bibliographic Details
Published inExpert systems with applications Vol. 258; p. 125124
Main Authors Xin, Chen, Hartel, Andreas, Kasneci, Enkelejda
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 15.12.2024
Subjects
Online AccessGet full text
ISSN0957-4174
DOI10.1016/j.eswa.2024.125124

Cover

Loading…
Abstract Accurate real-time object detection is vital across numerous industrial applications, from safety monitoring to quality control. Traditional approaches, however, are hindered by arduous manual annotation and data collection, struggling to adapt to ever-changing environments and novel target objects. To address these limitations, this paper presents DART, an innovative automated end-to-end pipeline that revolutionizes object detection workflows from data collection to model evaluation. It eliminates the need for laborious human labeling and extensive data collection while achieving outstanding accuracy across diverse scenarios. DART encompasses four key stages: (1) Data Diversification using subject-driven image generation (DreamBooth with SDXL), (2) Annotation via open-vocabulary object detection (Grounding DINO) to generate bounding box and class labels, (3) Review of generated images and pseudo-labels by large multimodal models (InternVL-1.5 and GPT-4o) to guarantee credibility, and (4) Training of real-time object detectors (YOLOv8 and YOLOv10) using the verified data. We apply DART to a self-collected dataset of construction machines named Liebherr Product, which contains over 15K high-quality images across 23 categories. The current instantiation of DART significantly increases average precision (AP) from 0.064 to 0.832. Its modular design ensures easy exchangeability and extensibility, allowing for future algorithm upgrades, seamless integration of new object categories, and adaptability to customized environments without manual labeling and additional data collection. The code and dataset are released at https://github.com/chen-xin-94/DART. •DART, an automated end-to-end object detection pipeline without manual labeling.•DART streamlines data diversification, annotation, review, and training stages.•DART boosts AP from 0.064 to 0.832 on a collected dataset of construction machines.•DART’s modular design ensures flexibility, robustness, and customization.•Code and dataset are made publicly available for further research.
AbstractList Accurate real-time object detection is vital across numerous industrial applications, from safety monitoring to quality control. Traditional approaches, however, are hindered by arduous manual annotation and data collection, struggling to adapt to ever-changing environments and novel target objects. To address these limitations, this paper presents DART, an innovative automated end-to-end pipeline that revolutionizes object detection workflows from data collection to model evaluation. It eliminates the need for laborious human labeling and extensive data collection while achieving outstanding accuracy across diverse scenarios. DART encompasses four key stages: (1) Data Diversification using subject-driven image generation (DreamBooth with SDXL), (2) Annotation via open-vocabulary object detection (Grounding DINO) to generate bounding box and class labels, (3) Review of generated images and pseudo-labels by large multimodal models (InternVL-1.5 and GPT-4o) to guarantee credibility, and (4) Training of real-time object detectors (YOLOv8 and YOLOv10) using the verified data. We apply DART to a self-collected dataset of construction machines named Liebherr Product, which contains over 15K high-quality images across 23 categories. The current instantiation of DART significantly increases average precision (AP) from 0.064 to 0.832. Its modular design ensures easy exchangeability and extensibility, allowing for future algorithm upgrades, seamless integration of new object categories, and adaptability to customized environments without manual labeling and additional data collection. The code and dataset are released at https://github.com/chen-xin-94/DART. •DART, an automated end-to-end object detection pipeline without manual labeling.•DART streamlines data diversification, annotation, review, and training stages.•DART boosts AP from 0.064 to 0.832 on a collected dataset of construction machines.•DART’s modular design ensures flexibility, robustness, and customization.•Code and dataset are made publicly available for further research.
ArticleNumber 125124
Author Hartel, Andreas
Xin, Chen
Kasneci, Enkelejda
Author_xml – sequence: 1
  givenname: Chen
  orcidid: 0000-0001-6860-7907
  surname: Xin
  fullname: Xin, Chen
  email: chen.xin@tum.de
  organization: Technical University of Munich, Arcisstraße 21, München, 80333, Germany
– sequence: 2
  givenname: Andreas
  surname: Hartel
  fullname: Hartel, Andreas
  email: andreas.hartel@liebherr.com
  organization: Liebherr-Electronics and Drives GmbH, Peter-Dornier-Straße 11, Lindau Bodensee, 88131, Germany
– sequence: 3
  givenname: Enkelejda
  orcidid: 0000-0003-3146-4484
  surname: Kasneci
  fullname: Kasneci, Enkelejda
  email: enkelejda.kasneci@tum.de
  organization: Technical University of Munich, Arcisstraße 21, München, 80333, Germany
BookMark eNp9kE1OwzAQhb0ACQpcgJUPQIrtOA1BbKqWPwkJCZW1NbEn4Ci1o9hp4TxcFFftigWrZ43nm5n3JuTIeYeEXHI25YzPrtsphi1MBRNyykXBhTwip6wqykzyUp6QSQgtY7xkrDwlP8v52-qWzh2FMfo1RDQUncmiz5JQX7eoIzUYk1jvaG977KxDurXxkxqIQJd2g0OwjdWwa7mivkeXbbyGeuxg-Ka1H52x7iM9vtIm5-OhsQ84Gp91UGNH33BjcXtFIa1de5MqqwGsS9w5OW6gC3hx0DPy_nC_WjxlL6-Pz4v5S6ZFUcmMm7qEgskGi_qGmaYpGdcIFVZSoMxnBegmN2jSr5Y5FAZzLARUeiYlzJjIz4jYz9WDD2HARvWDXScHijO1i1a1ahet2kWr9tEm6OYPpO3eX0znd_-jd3sUk6lkflBBW3QajR1S3Mp4-x_-C9cynKM
CitedBy_id crossref_primary_10_3389_fenvs_2024_1486212
Cites_doi 10.1109/CVPR52688.2022.01042
10.1016/j.autcon.2023.105066
10.1016/j.eswa.2024.124661
10.1016/j.eswa.2024.123253
10.1109/CVPR52729.2023.00721
10.1109/CVPRW50498.2020.00203
10.1016/j.autcon.2024.105486
10.1109/CVPR.2019.00550
10.1109/CVPR46437.2021.01352
10.1109/ICCV48922.2021.00986
10.1109/CVPR52733.2024.01599
10.1609/aaai.v34i07.6999
10.1109/ICCV51070.2023.00684
ContentType Journal Article
Copyright 2024 The Author(s)
Copyright_xml – notice: 2024 The Author(s)
DBID 6I.
AAFTH
AAYXX
CITATION
DOI 10.1016/j.eswa.2024.125124
DatabaseName ScienceDirect Open Access Titles
Elsevier:ScienceDirect:Open Access
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
ExternalDocumentID 10_1016_j_eswa_2024_125124
S0957417424019912
GroupedDBID --K
--M
.DC
.~1
0R~
13V
1B1
1RT
1~.
1~5
4.4
457
4G.
5GY
5VS
6I.
7-5
71M
8P~
9JN
9JO
AAAKF
AABNK
AACTN
AAEDT
AAEDW
AAFTH
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AARIN
AAXKI
AAXUO
AAYFN
ABBOA
ABFNM
ABMAC
ABMVD
ABUCO
ACDAQ
ACGFS
ACHRH
ACNTT
ACRLP
ACZNC
ADBBV
ADEZE
ADTZH
AEBSH
AECPX
AEKER
AENEX
AFKWA
AFTJW
AGHFR
AGUBO
AGUMN
AGYEJ
AHHHB
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJOXV
AKRWK
ALEQD
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
APLSM
AXJTR
BJAXD
BKOJK
BLXMC
BNSAS
CS3
DU5
EBS
EFJIC
EO8
EO9
EP2
EP3
F5P
FDB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HAMUX
IHE
J1W
JJJVA
KOM
LG9
LY1
LY7
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
PQQKQ
Q38
RIG
ROL
RPZ
SDF
SDG
SDP
SDS
SES
SEW
SPC
SPCBC
SSB
SSD
SSL
SST
SSV
SSZ
T5K
TN5
~G-
29G
AAAKG
AAQXK
AATTM
AAYWO
AAYXX
ABJNI
ABKBG
ABWVN
ABXDB
ACNNM
ACRPL
ACVFH
ADCNI
ADJOM
ADMUD
ADNMO
AEIPS
AEUPX
AFJKZ
AFPUW
AFXIZ
AGCQF
AGQPQ
AGRNS
AIGII
AIIUN
AKBMS
AKYEP
ANKPU
APXCP
ASPBG
AVWKF
AZFZN
BNPGV
CITATION
EJD
FEDTE
FGOYB
G-2
HLZ
HVGLF
HZ~
R2-
SBC
SET
SSH
WUQ
XPP
ZMT
ID FETCH-LOGICAL-c2594-1db7a504fe5b80dff701cea9e942e4365acf3ded5b8c43a5de3e52a9c644a6023
IEDL.DBID .~1
ISSN 0957-4174
IngestDate Tue Jul 01 01:51:25 EDT 2025
Thu Apr 24 23:00:27 EDT 2025
Sat Sep 14 18:01:07 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords YOLO
Large multimodal model (LMM)
Data diversification
Stable diffusion
Open-vocabulary object detection (OVD)
Pseudo-label
Language English
License This is an open access article under the CC BY-NC license.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c2594-1db7a504fe5b80dff701cea9e942e4365acf3ded5b8c43a5de3e52a9c644a6023
ORCID 0000-0003-3146-4484
0000-0001-6860-7907
OpenAccessLink https://www.sciencedirect.com/science/article/pii/S0957417424019912
ParticipantIDs crossref_primary_10_1016_j_eswa_2024_125124
crossref_citationtrail_10_1016_j_eswa_2024_125124
elsevier_sciencedirect_doi_10_1016_j_eswa_2024_125124
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2024-12-15
PublicationDateYYYYMMDD 2024-12-15
PublicationDate_xml – month: 12
  year: 2024
  text: 2024-12-15
  day: 15
PublicationDecade 2020
PublicationTitle Expert systems with applications
PublicationYear 2024
Publisher Elsevier Ltd
Publisher_xml – name: Elsevier Ltd
References Xu, Yao, Guo, Cui, Ni, Ge (b65) 2024
Zong, Song, Liu (b71) 2023
Saharia, Chan, Saxena, Li, Whang, Denton (b54) 2022; 35
Google (b16) 2024
Yang, Zhang, Li, Zou, Li, Gao (b67) 2023
Wang, Yeh, Liao (b62) 2024
Li, Li, Jiang, Weng, Geng, Li (b26) 2022
Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., & Shan, Y. (2024). YOLO-World: Real-Time Open-Vocabulary Object Detection. In
Liu, Qi, Qin, Shi, Jia (b34) 2018
(pp. 7441–7451).
OpenAI (b38) 2023
Redmon, Farhadi (b45) 2017
Ren, He, Girshick, Sun (b48) 2015; Vol. 28
Li, Wang, Wu, Chen, Hu, Li (b27) 2020
Wang, Bai, Wang, Qin, Chen, Li (b56) 2024
(pp. 5356–5364).
Liu, Li, Li, Lee (b31) 2024
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez (b55) 2017; 30
Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y. M. (2023). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In
Liu, Li, Wu, Lee (b32) 2023
Radford, Kim, Hallacy, Ramesh, Goh, Agarwal (b41) 2021
(pp. 13733–13742).
Loshchilov, Hutter (b36) 2019
Alayrac, Donahue, Luc, Miech, Barr, Hasson (b1) 2022
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In
Redmon, Divvala, Girshick, Farhadi (b44) 2016
Saharia, Chan, Saxena, Li, Whang, Denton (b53) 2022
Wei, Wang, Schuurmans, Bosma, Ichter, Xia (b63) 2023
Ronneberger, Fischer, Brox (b51) 2015
Wang, Dai, Chen, Huang, Li, Zhu (b59) 2023
Gu, Lin, Kuo, Cui (b17) 2022
Hu, Shen, Wallis, Allen-Zhu, Li, Wang (b21) 2022
Bochkovskiy, Wang, Liao (b4) 2020
Zhu, Chen, Shen, Li, Elhoseiny (b70) 2023
.
Liu, Zeng, Ren, Li, Zhang, Yang (b35) 2023
Yang, Li, Lin, Wang, Lin, Liu (b66) 2023
Dong, Jiang, Hu, Du, Pan (b10) 2024; 255
Gupta, A., Dollar, P., & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In
Wang, Liao, Yeh (b60) 2022
Jocher (b22) 2023
Podell, English, Lacey, Blattmann, Dockhorn, Müller (b40) 2024
OpenAI (b39) 2024
Ramesh, Dhariwal, Nichol, Chu, Chen (b42) 2022
Li, Li, Geng, Jiang, Cheng, Zhang (b25) 2023
Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner (b11) 2021
Wang, Chen, Liu, Chen, Lin, Han (b58) 2024
Klinger, Starkweather (b23) 2008
Reimers, Gurevych (b47) 2019
Xu, Jha, Mehadi, Mandal (b64) 2024; 165
Lee, Zhai (b24) 2023
(pp. 12993–13000).
Anthropic (b3) 2024
Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., & Sun, J. (2021). Repvgg: Making vgg-style convnets great again. In
Anon (b2) 2020
Mou, Wang, Xie, Wu, Zhang, Qi (b37) 2023
Ye, Zhang, Liu, Han, Yang (b68) 2023
(pp. 10684–10695).
Feng, Zhong, Gao, Scott, Huang (b12) 2021
(pp. 10012–10022).
Rezatofighi, Tsoi, Gwak, Sadeghian, Reid, Savarese (b49) 2019
Lin, Maire, Belongie, Hays, Perona, Ramanan (b30) 2014
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020). Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In
Chen, Wu, Wang, Su, Chen, Xing (b6) 2024
Ruiz, Li, Jampani, Pritch, Rubinstein, Aberman (b52) 2023
Gal, Alaluf, Atzmon, Patashnik, Bermano, Chechik (b13) 2022
Li, Wong, Zhang, Usuyama, Liu, Yang (b28) 2023
Gao, Zhang, Geng, Tang, Bhatti (b14) 2024; 246
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In
Wang, C.-Y., Mark Liao, H.-Y., Wu, Y.-H., Chen, P.-Y., Hsieh, J.-W., & Yeh, I.-H. (2020). CSPNet: A new backbone that can enhance learning capability of cnn. In
Cho, Kim (b8) 2023; 155
Li, Zhang, Zhang, Yang, Li, Zhong (b29) 2022
Hang, T., Gu, S., Li, C., Bao, J., Chen, D., Hu, H., et al. (2023). Efficient Diffusion Training via Min-SNR Weighting Strategy. In
Redmon, Farhadi (b46) 2018
RangeKing (b43) 2023
He, Zhang, Ren, Sun (b20) 2014
Gevorgyan (b15) 2022
Chen, Wang, Tian, Ye, Gao, Cui (b5) 2024
(pp. 390–391).
10.1016/j.eswa.2024.125124_b33
Ronneberger (10.1016/j.eswa.2024.125124_b51) 2015
Wang (10.1016/j.eswa.2024.125124_b62) 2024
Anthropic (10.1016/j.eswa.2024.125124_b3) 2024
Gao (10.1016/j.eswa.2024.125124_b14) 2024; 246
Li (10.1016/j.eswa.2024.125124_b29) 2022
Klinger (10.1016/j.eswa.2024.125124_b23) 2008
Ye (10.1016/j.eswa.2024.125124_b68) 2023
OpenAI (10.1016/j.eswa.2024.125124_b38) 2023
Ramesh (10.1016/j.eswa.2024.125124_b42) 2022
Reimers (10.1016/j.eswa.2024.125124_b47) 2019
Alayrac (10.1016/j.eswa.2024.125124_b1) 2022
Li (10.1016/j.eswa.2024.125124_b25) 2023
Xu (10.1016/j.eswa.2024.125124_b65) 2024
Cho (10.1016/j.eswa.2024.125124_b8) 2023; 155
Liu (10.1016/j.eswa.2024.125124_b32) 2023
Ruiz (10.1016/j.eswa.2024.125124_b52) 2023
Xu (10.1016/j.eswa.2024.125124_b64) 2024; 165
Dosovitskiy (10.1016/j.eswa.2024.125124_b11) 2021
Podell (10.1016/j.eswa.2024.125124_b40) 2024
Lee (10.1016/j.eswa.2024.125124_b24) 2023
Redmon (10.1016/j.eswa.2024.125124_b44) 2016
Wang (10.1016/j.eswa.2024.125124_b56) 2024
10.1016/j.eswa.2024.125124_b7
Li (10.1016/j.eswa.2024.125124_b28) 2023
10.1016/j.eswa.2024.125124_b9
RangeKing (10.1016/j.eswa.2024.125124_b43) 2023
Wang (10.1016/j.eswa.2024.125124_b59) 2023
Saharia (10.1016/j.eswa.2024.125124_b53) 2022
Zhu (10.1016/j.eswa.2024.125124_b70) 2023
Yang (10.1016/j.eswa.2024.125124_b67) 2023
Chen (10.1016/j.eswa.2024.125124_b5) 2024
Dong (10.1016/j.eswa.2024.125124_b10) 2024; 255
Liu (10.1016/j.eswa.2024.125124_b34) 2018
10.1016/j.eswa.2024.125124_b57
Jocher (10.1016/j.eswa.2024.125124_b22) 2023
Google (10.1016/j.eswa.2024.125124_b16) 2024
Loshchilov (10.1016/j.eswa.2024.125124_b36) 2019
Liu (10.1016/j.eswa.2024.125124_b35) 2023
10.1016/j.eswa.2024.125124_b50
Liu (10.1016/j.eswa.2024.125124_b31) 2024
Rezatofighi (10.1016/j.eswa.2024.125124_b49) 2019
Feng (10.1016/j.eswa.2024.125124_b12) 2021
OpenAI (10.1016/j.eswa.2024.125124_b39) 2024
Hu (10.1016/j.eswa.2024.125124_b21) 2022
Chen (10.1016/j.eswa.2024.125124_b6) 2024
Radford (10.1016/j.eswa.2024.125124_b41) 2021
Wang (10.1016/j.eswa.2024.125124_b60) 2022
Redmon (10.1016/j.eswa.2024.125124_b46) 2018
Gevorgyan (10.1016/j.eswa.2024.125124_b15) 2022
Li (10.1016/j.eswa.2024.125124_b27) 2020
Wang (10.1016/j.eswa.2024.125124_b58) 2024
Ren (10.1016/j.eswa.2024.125124_b48) 2015; Vol. 28
10.1016/j.eswa.2024.125124_b69
Yang (10.1016/j.eswa.2024.125124_b66) 2023
Zong (10.1016/j.eswa.2024.125124_b71) 2023
Saharia (10.1016/j.eswa.2024.125124_b54) 2022; 35
Vaswani (10.1016/j.eswa.2024.125124_b55) 2017; 30
10.1016/j.eswa.2024.125124_b61
He (10.1016/j.eswa.2024.125124_b20) 2014
Gu (10.1016/j.eswa.2024.125124_b17) 2022
Mou (10.1016/j.eswa.2024.125124_b37) 2023
Anon (10.1016/j.eswa.2024.125124_b2) 2020
Bochkovskiy (10.1016/j.eswa.2024.125124_b4) 2020
10.1016/j.eswa.2024.125124_b18
Redmon (10.1016/j.eswa.2024.125124_b45) 2017
Gal (10.1016/j.eswa.2024.125124_b13) 2022
10.1016/j.eswa.2024.125124_b19
Li (10.1016/j.eswa.2024.125124_b26) 2022
Lin (10.1016/j.eswa.2024.125124_b30) 2014
Wei (10.1016/j.eswa.2024.125124_b63) 2023
References_xml – reference: Gupta, A., Dollar, P., & Girshick, R. (2019). Lvis: A dataset for large vocabulary instance segmentation. In
– year: 2022
  ident: b29
  article-title: Grounded language-image pre-training
  publication-title: CVPR
– reference: Hang, T., Gu, S., Li, C., Bao, J., Chen, D., Hu, H., et al. (2023). Efficient Diffusion Training via Min-SNR Weighting Strategy. In
– volume: 165
  year: 2024
  ident: b64
  article-title: Multiscale object detection on complex architectural floor plans
  publication-title: Automation in Construction
– reference: Wang, C.-Y., Bochkovskiy, A., & Liao, H.-Y. M. (2023). YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In
– year: 2022
  ident: b17
  article-title: Open-vocabulary object detection via vision and language knowledge distillation
  publication-title: International conference on learning representations
– year: 2023
  ident: b28
  article-title: LLaVA-med: Training a large language-and-vision assistant for biomedicine in one day
  publication-title: Thirty-seventh conference on neural information processing systems datasets and benchmarks track
– year: 2024
  ident: b5
  article-title: How far are we to GPT-4V? closing the gap to commercial multimodal models with open-source suites
– year: 2015
  ident: b51
  article-title: U-Net: Convolutional networks for biomedical image segmentation
– year: 2023
  ident: b24
  article-title: NERIF: GPT-4V for automatic scoring of drawn models
– year: 2017
  ident: b45
  article-title: YOLO9000: Better, faster, stronger
  publication-title: CVPR
– year: 2022
  ident: b53
  article-title: Photorealistic text-to-image diffusion models with deep language understanding
– year: 2023
  ident: b63
  article-title: Chain-of-thought prompting elicits reasoning in large language models
– year: 2023
  ident: b59
  article-title: InternImage: Exploring large-scale vision foundation models with deformable convolutions
– year: 2024
  ident: b62
  article-title: YOLOv9: Learning what you want to learn using programmable gradient information
– year: 2008
  ident: b23
  article-title: pHash.org: Home of pHash, the open source perceptual hash library
– year: 2019
  ident: b36
  article-title: Decoupled weight decay regularization
  publication-title: International conference on learning representations
– reference: (pp. 10012–10022).
– reference: Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., et al. (2021). Swin transformer: Hierarchical vision transformer using shifted windows. In
– year: 2024
  ident: b31
  article-title: Improved baselines with visual instruction tuning
– reference: (pp. 5356–5364).
– year: 2023
  ident: b71
  article-title: DETRs with collaborative hybrid assignments training
– year: 2024
  ident: b3
  article-title: Claude 3 family
– volume: 155
  year: 2023
  ident: b8
  article-title: Detection of moving objects in multi-complex environments using selective attention networks (SANet)
  publication-title: Automation in Construction
– volume: Vol. 28
  year: 2015
  ident: b48
  article-title: Faster R-CNN: Towards real-time object detection with region proposal networks
  publication-title: Advances in neural information processing systems
– year: 2023
  ident: b66
  article-title: The Dawn of LMMs: preliminary explorations with GPT-4V(ision)
– reference: (pp. 7441–7451).
– year: 2024
  ident: b56
  article-title: InstantID: Zero-shot identity-preserving generation in seconds
– volume: 30
  year: 2017
  ident: b55
  article-title: Attention is all you need
  publication-title: Advances in Neural Information Processing Systems
– year: 2022
  ident: b1
  article-title: Flamingo: A visual language model for few-shot learning
– reference: (pp. 13733–13742).
– year: 2022
  ident: b60
  article-title: Designing network design strategies through gradient path analysis
– year: 2023
  ident: b37
  article-title: T2I-Adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models
– volume: 246
  year: 2024
  ident: b14
  article-title: PE-Transformer: Path enhanced transformer for improving underwater object detection
  publication-title: Expert Systems with Applications
– volume: 255
  year: 2024
  ident: b10
  article-title: EL-Net: An efficient and lightweight optimized network for object detection in remote sensing images
  publication-title: Expert Systems with Applications
– year: 2024
  ident: b6
  article-title: InternVL: Scaling up vision foundation models and aligning for generic visual-linguistic tasks
– year: 2022
  ident: b15
  article-title: SIoU Loss: More powerful learning for bounding box regression
– year: 2018
  ident: b46
  article-title: YOLOv3: An incremental improvement
– reference: Ding, X., Zhang, X., Ma, N., Han, J., Ding, G., & Sun, J. (2021). Repvgg: Making vgg-style convnets great again. In
– year: 2023
  ident: b38
  article-title: GPT-4 technical report
– start-page: 346
  year: 2014
  end-page: 361
  ident: b20
  article-title: Spatial pyramid pooling in deep convolutional networks for visual recognition
  publication-title: Computer vision – ECCV 2014
– year: 2024
  ident: b40
  article-title: SDXL: Improving latent diffusion models for high-resolution image synthesis
  publication-title: The twelfth international conference on learning representations
– year: 2023
  ident: b67
  article-title: Set-of-mark prompting unleashes extraordinary visual grounding in GPT-4V
– year: 2023
  ident: b22
  article-title: Ultralytics YOLOv8
– year: 2023
  ident: b32
  article-title: Visual instruction tuning
– year: 2019
  ident: b47
  article-title: Sentence-BERT: Sentence embeddings using siamese BERT-networks
  publication-title: Proceedings of the 2019 conference on empirical methods in natural language processing
– year: 2020
  ident: b2
  article-title: YOLOv5 by ultralytics
– reference: (pp. 12993–13000).
– year: 2024
  ident: b39
  article-title: GPT-4o
– year: 2022
  ident: b42
  article-title: Hierarchical text-conditional image generation with CLIP latents
– reference: Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., & Ren, D. (2020). Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In
– year: 2023
  ident: b25
  article-title: YOLOv6 v3.0: A full-scale reloading
– year: 2021
  ident: b11
  article-title: An image is worth 16x16 words: transformers for image recognition at scale
– year: 2014
  ident: b30
  article-title: Microsoft COCO: Common objects in context
  publication-title: ECCV
– year: 2024
  ident: b58
  article-title: YOLOv10: Real-time end-to-end object detection
– year: 2024
  ident: b65
  article-title: LLaVA-UHD: An LMM perceiving any aspect ratio and high-resolution images
– reference: (pp. 390–391).
– start-page: 8748
  year: 2021
  end-page: 8763
  ident: b41
  article-title: Learning transferable visual models from natural language supervision
  publication-title: International conference on machine learning
– year: 2024
  ident: b16
  article-title: Gemini: A family of highly capable multimodal models
– year: 2022
  ident: b26
  article-title: YOLOv6: A single-stage object detection framework for industrial applications
– year: 2023
  ident: b52
  article-title: DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation
– year: 2018
  ident: b34
  article-title: Path aggregation network for instance segmentation
– volume: 35
  start-page: 36479
  year: 2022
  end-page: 36494
  ident: b54
  article-title: Photorealistic text-to-image diffusion models with deep language understanding
  publication-title: Advances in Neural Information Processing Systems
– year: 2020
  ident: b4
  article-title: YOLOv4: Optimal speed and accuracy of object detection
– year: 2023
  ident: b43
  article-title: Brief summary of YOLOv8 model structure
– reference: Wang, C.-Y., Mark Liao, H.-Y., Wu, Y.-H., Chen, P.-Y., Hsieh, J.-W., & Yeh, I.-H. (2020). CSPNet: A new backbone that can enhance learning capability of cnn. In
– year: 2020
  ident: b27
  article-title: Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection
  publication-title: NeurIPS
– reference: .
– reference: Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. In
– year: 2022
  ident: b21
  article-title: LoRA: Low-rank adaptation of large language models
  publication-title: International conference on learning representations
– reference: (pp. 10684–10695).
– year: 2023
  ident: b35
  article-title: Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection
– year: 2016
  ident: b44
  article-title: You only look once: unified, real-time object detection
  publication-title: CVPR
– year: 2019
  ident: b49
  article-title: Generalized intersection over union: a metric and A loss for bounding box regression
  publication-title: CVPR
– year: 2022
  ident: b13
  article-title: An image is worth one word: personalizing text-to-image generation using textual inversion
– year: 2021
  ident: b12
  article-title: TOOD: Task-aligned one-stage object detection
  publication-title: ICCV
– reference: Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., & Shan, Y. (2024). YOLO-World: Real-Time Open-Vocabulary Object Detection. In
– year: 2023
  ident: b70
  article-title: MiniGPT-4: Enhancing vision-language understanding with advanced large language models
– year: 2023
  ident: b68
  article-title: IP-Adapter: Text compatible image prompt adapter for text-to-image diffusion models
– ident: 10.1016/j.eswa.2024.125124_b50
  doi: 10.1109/CVPR52688.2022.01042
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b59
– year: 2024
  ident: 10.1016/j.eswa.2024.125124_b16
– year: 2024
  ident: 10.1016/j.eswa.2024.125124_b62
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b43
– start-page: 8748
  year: 2021
  ident: 10.1016/j.eswa.2024.125124_b41
  article-title: Learning transferable visual models from natural language supervision
– year: 2015
  ident: 10.1016/j.eswa.2024.125124_b51
– year: 2022
  ident: 10.1016/j.eswa.2024.125124_b17
  article-title: Open-vocabulary object detection via vision and language knowledge distillation
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b35
– year: 2008
  ident: 10.1016/j.eswa.2024.125124_b23
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b66
– year: 2024
  ident: 10.1016/j.eswa.2024.125124_b3
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b37
– year: 2019
  ident: 10.1016/j.eswa.2024.125124_b47
  article-title: Sentence-BERT: Sentence embeddings using siamese BERT-networks
– volume: 155
  year: 2023
  ident: 10.1016/j.eswa.2024.125124_b8
  article-title: Detection of moving objects in multi-complex environments using selective attention networks (SANet)
  publication-title: Automation in Construction
  doi: 10.1016/j.autcon.2023.105066
– volume: 255
  year: 2024
  ident: 10.1016/j.eswa.2024.125124_b10
  article-title: EL-Net: An efficient and lightweight optimized network for object detection in remote sensing images
  publication-title: Expert Systems with Applications
  doi: 10.1016/j.eswa.2024.124661
– year: 2021
  ident: 10.1016/j.eswa.2024.125124_b11
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b63
– volume: 246
  year: 2024
  ident: 10.1016/j.eswa.2024.125124_b14
  article-title: PE-Transformer: Path enhanced transformer for improving underwater object detection
  publication-title: Expert Systems with Applications
  doi: 10.1016/j.eswa.2024.123253
– year: 2022
  ident: 10.1016/j.eswa.2024.125124_b53
– year: 2016
  ident: 10.1016/j.eswa.2024.125124_b44
  article-title: You only look once: unified, real-time object detection
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b38
– year: 2024
  ident: 10.1016/j.eswa.2024.125124_b39
– year: 2014
  ident: 10.1016/j.eswa.2024.125124_b30
  article-title: Microsoft COCO: Common objects in context
– ident: 10.1016/j.eswa.2024.125124_b57
  doi: 10.1109/CVPR52729.2023.00721
– ident: 10.1016/j.eswa.2024.125124_b61
  doi: 10.1109/CVPRW50498.2020.00203
– year: 2020
  ident: 10.1016/j.eswa.2024.125124_b4
– year: 2024
  ident: 10.1016/j.eswa.2024.125124_b58
– volume: 165
  year: 2024
  ident: 10.1016/j.eswa.2024.125124_b64
  article-title: Multiscale object detection on complex architectural floor plans
  publication-title: Automation in Construction
  doi: 10.1016/j.autcon.2024.105486
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b32
– year: 2022
  ident: 10.1016/j.eswa.2024.125124_b42
– volume: Vol. 28
  year: 2015
  ident: 10.1016/j.eswa.2024.125124_b48
  article-title: Faster R-CNN: Towards real-time object detection with region proposal networks
– ident: 10.1016/j.eswa.2024.125124_b18
  doi: 10.1109/CVPR.2019.00550
– year: 2024
  ident: 10.1016/j.eswa.2024.125124_b65
– year: 2020
  ident: 10.1016/j.eswa.2024.125124_b2
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b68
– ident: 10.1016/j.eswa.2024.125124_b9
  doi: 10.1109/CVPR46437.2021.01352
– year: 2018
  ident: 10.1016/j.eswa.2024.125124_b34
– ident: 10.1016/j.eswa.2024.125124_b33
  doi: 10.1109/ICCV48922.2021.00986
– year: 2019
  ident: 10.1016/j.eswa.2024.125124_b36
  article-title: Decoupled weight decay regularization
– year: 2022
  ident: 10.1016/j.eswa.2024.125124_b1
– ident: 10.1016/j.eswa.2024.125124_b7
  doi: 10.1109/CVPR52733.2024.01599
– year: 2017
  ident: 10.1016/j.eswa.2024.125124_b45
  article-title: YOLO9000: Better, faster, stronger
– year: 2018
  ident: 10.1016/j.eswa.2024.125124_b46
– year: 2024
  ident: 10.1016/j.eswa.2024.125124_b5
– year: 2022
  ident: 10.1016/j.eswa.2024.125124_b13
– year: 2022
  ident: 10.1016/j.eswa.2024.125124_b26
– year: 2024
  ident: 10.1016/j.eswa.2024.125124_b40
  article-title: SDXL: Improving latent diffusion models for high-resolution image synthesis
– ident: 10.1016/j.eswa.2024.125124_b69
  doi: 10.1609/aaai.v34i07.6999
– year: 2019
  ident: 10.1016/j.eswa.2024.125124_b49
  article-title: Generalized intersection over union: a metric and A loss for bounding box regression
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b28
  article-title: LLaVA-med: Training a large language-and-vision assistant for biomedicine in one day
– year: 2022
  ident: 10.1016/j.eswa.2024.125124_b29
  article-title: Grounded language-image pre-training
– year: 2024
  ident: 10.1016/j.eswa.2024.125124_b31
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b24
– year: 2022
  ident: 10.1016/j.eswa.2024.125124_b60
– year: 2021
  ident: 10.1016/j.eswa.2024.125124_b12
  article-title: TOOD: Task-aligned one-stage object detection
– year: 2020
  ident: 10.1016/j.eswa.2024.125124_b27
  article-title: Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b70
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b52
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b25
– year: 2024
  ident: 10.1016/j.eswa.2024.125124_b6
– year: 2024
  ident: 10.1016/j.eswa.2024.125124_b56
– start-page: 346
  year: 2014
  ident: 10.1016/j.eswa.2024.125124_b20
  article-title: Spatial pyramid pooling in deep convolutional networks for visual recognition
– volume: 35
  start-page: 36479
  year: 2022
  ident: 10.1016/j.eswa.2024.125124_b54
  article-title: Photorealistic text-to-image diffusion models with deep language understanding
  publication-title: Advances in Neural Information Processing Systems
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b67
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b71
– year: 2023
  ident: 10.1016/j.eswa.2024.125124_b22
– year: 2022
  ident: 10.1016/j.eswa.2024.125124_b21
  article-title: LoRA: Low-rank adaptation of large language models
– year: 2022
  ident: 10.1016/j.eswa.2024.125124_b15
– ident: 10.1016/j.eswa.2024.125124_b19
  doi: 10.1109/ICCV51070.2023.00684
– volume: 30
  year: 2017
  ident: 10.1016/j.eswa.2024.125124_b55
  article-title: Attention is all you need
  publication-title: Advances in Neural Information Processing Systems
SSID ssj0017007
Score 2.45439
Snippet Accurate real-time object detection is vital across numerous industrial applications, from safety monitoring to quality control. Traditional approaches,...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 125124
SubjectTerms Data diversification
Large multimodal model (LMM)
Open-vocabulary object detection (OVD)
Pseudo-label
Stable diffusion
YOLO
Title DART: An automated end-to-end object detection pipeline with data Diversification, open-vocabulary bounding box Annotation, pseudo-label Review, and model Training
URI https://dx.doi.org/10.1016/j.eswa.2024.125124
Volume 258
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9NAEF5F5cIFaAFRaKs59EY2sePdOOYWtUQpiFxopNysfYyloMq2WofHhT_TP8qMvY5AQj305Neu1toZz3y7_mZGiPOpyZxzMUpD9lEqi7E0M-OldThRMyx0bDnA-ctqulyrTxu9GYiLPhaGaZXB9nc2vbXW4c44zOa43m7HXwkckDukpR0tEQjlsB1WKmUtH_3e0zw4_Vza5dtLJbcOgTMdxwvvfnDuoYkatX5e_d85_eVwFi_Es4AUYd69zKEYYHkknvdVGCB8lC_F_SVh0g8wL8HsmooAKHrA0sumknSAyvJOC3hsWtJVCfW25hh0BN6CBWaIwmVHzijC_t0QuKaW_E5uzjJL9RdYLr5ETo5OftJIZdWEhvUd7nwlSZXwBrr_DEMwNGxbYQeuQwGKV2K9-Hh9sZSh9IJ0tB5SMvY2NTpSBWo7i3xRpFHs0GSYqQmqZKqNKxKPnp46lRjtMUE9IbkTvDJTwgGvxUFZlfhGgGIIGBX04TvCZqhsYlN0hoGd1VmMxyLu5zx3IS85l8e4yXsC2rec5ZSznPJOTsfi_b5P3WXleLC17kWZ_6NbObmNB_q9fWS_d-IpXzHpJdYn4qC53eEpQZfGnrW6eSaezK8-L1d_ACtU8QA
linkProvider Elsevier
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LT-MwELZQObAXWB4r3syBG3ibNHYfe6t4qLx6oUjcIj8mUlcoiZYU2N-zf5SZxkEgrThwShTbcuSxZz7b38wIcdg1A-dcjNKQfpTKYixN33hpHXZUHzMdW3Zwvhl3R3fq8l7fL4iTxheGaZVB99c6fa6tw5d2GM12OZ22bwkckDmkrR1tEQjlkB5e5OhUuiUWhxdXo_HbZUIvqr2mqb7kBsF3pqZ54eMzhx_qqJ9zU6_-b5_e2Zzz72I5gEUY1v-zKhYwXxMrTSIGCOtyXfw7JVj6C4Y5mFlVEAZFD5h7WRWSHlBYPmwBj9Wcd5VDOS3ZDR2BT2GBSaJwWvMzsnCEdwycVks-kaWzTFT9C5bzL5Gdo5cX6ikvqlCxfMSZLyTNJnyA-qrhGAx1O0-yA5OQg2JD3J2fTU5GMmRfkI62RErG3vaMjlSG2vYjn2W9KHZoBjhQHVRJVxuXJR49lTqVGO0xQd0h0RPCMl2CAj9EKy9y3BSgGAVGGa19R_AMlU1sD51hbGf1IMYtETdjnroQmpwzZDykDQftd8pySllOaS2nLXH01qasA3N8Wls3okw_TK-ULMcn7ba_2O5ALI0mN9fp9cX4akd84xLmwMR6V7SqPzPcIyRT2f0wU18BaxXzsQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DART%3A+An+automated+end-to-end+object+detection+pipeline+with+data+Diversification%2C+open-vocabulary+bounding+box+Annotation%2C+pseudo-label+Review%2C+and+model+Training&rft.jtitle=Expert+systems+with+applications&rft.au=Xin%2C+Chen&rft.au=Hartel%2C+Andreas&rft.au=Kasneci%2C+Enkelejda&rft.date=2024-12-15&rft.issn=0957-4174&rft.volume=258&rft.spage=125124&rft_id=info:doi/10.1016%2Fj.eswa.2024.125124&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_eswa_2024_125124
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0957-4174&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0957-4174&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0957-4174&client=summon