ConfigILM: A general purpose configurable library for combining image and language models for visual question answering

ConfigILM is an open-source Python library for rapid iterative development of image-language models for visual question answering in PyTorch. It provides a convenient implementation for seamlessly combining image and language models from two popular PyTorch libraries that are timm and huggingface. T...

Full description

Saved in:

Bibliographic Details
Published in	SoftwareX Vol. 26; p. 101731
Main Authors	Hackel, Leonard, Clasen, Kai Norman, Demir, Begüm
Format	Journal Article
Language	English
Published	Elsevier B.V 01.05.2024 Elsevier
Subjects	Image analysis Machine learning Natural language processing Open source Python Visual question answering 68T45 Image analysis Machine learning 68T07 Natural language processing Open source Visual question answering 68T50 Python
Online Access	Get full text

Cover

Loading…

Abstract	ConfigILM is an open-source Python library for rapid iterative development of image-language models for visual question answering in PyTorch. It provides a convenient implementation for seamlessly combining image and language models from two popular PyTorch libraries that are timm and huggingface. These libraries allow a variety of configurations of models without additional implementation effort. The monolithic interface provided by ConfigILM simplifies the exchange of components of a considered model and offers possibilities for developing new image-language models based on recombining the selected encoders. Additionally, the library provides pre-built and throughput-optimized PyTorch dataloaders. We also provide a guideline document that contains installation instructions, tutorial examples, and a complete discussion of the monolithic interface to the library. ConfigILM is released under the MIT License, encouraging its use in academic and commercial environments. The source code and documentation of ConfigILM are available at https://github.com/lhackel-tub/ConfigILM.
AbstractList	ConfigILM is an open-source Python library for rapid iterative development of image-language models for visual question answering in PyTorch. It provides a convenient implementation for seamlessly combining image and language models from two popular PyTorch libraries that are timm and huggingface. These libraries allow a variety of configurations of models without additional implementation effort. The monolithic interface provided by ConfigILM simplifies the exchange of components of a considered model and offers possibilities for developing new image-language models based on recombining the selected encoders. Additionally, the library provides pre-built and throughput-optimized PyTorch dataloaders. We also provide a guideline document that contains installation instructions, tutorial examples, and a complete discussion of the monolithic interface to the library. ConfigILM is released under the MIT License, encouraging its use in academic and commercial environments. The source code and documentation of ConfigILM are available at https://github.com/lhackel-tub/ConfigILM.
ArticleNumber	101731
Author	Clasen, Kai Norman Demir, Begüm Hackel, Leonard
Author_xml	– sequence: 1 givenname: Leonard orcidid: 0000-0002-5831-1237 surname: Hackel fullname: Hackel, Leonard email: l.hackel@tu-berlin.de – sequence: 2 givenname: Kai Norman surname: Clasen fullname: Clasen, Kai Norman – sequence: 3 givenname: Begüm surname: Demir fullname: Demir, Begüm
BookMark	eNp9kctqGzEUhkVJIGmSJ8hGL2BXl9HFhS6C6cXg0k27FrqcGWTGkivNJM3bV2OX0lVXkv5z_o9z9L9FVyknQOiRkjUlVL47rGvup19rRli3KIrTN-iWccFWilJy9c_9Bj3UeiCEUMG0YN0tetnm1Mdht__6Hj_hARIUO-LTXE65Avbn4lysGwGP0RVbXnGfSyscXUwxDTge7QDYpoBHm4Z5eRxzgLGe-55jnRvv5wx1ijm1vvoCpfnu0XVvxwoPf8479OPTx-_bL6v9t8-77dN-5TndTCstNnoDHaPBg1fOM86bQqVUmmykC4oz1ZbRQngumeKBE0171VniiA5W8zu0u3BDtgdzKm3c8mqyjeYs5DIYW6boRzCOcxo66bXTpJMUdCeckpQJL6Ti_cLiF5YvudYC_V8eJWaJwhzMOQqzRGEuUTTXh4ur_Qk8Ryim-gjJQ4gF_NTmiP_1_walW5RE
Cites_doi	10.1109/IGARSS46834.2022.9884036 10.1109/CVPRW56347.2022.00143 10.1109/TMI.2020.2975344 10.1109/IGARSS52108.2023.10281674 10.1109/MGRS.2021.3089174 10.1109/IGARSS47720.2021.9553307 10.1109/ICCV.2015.279 10.18653/v1/2020.emnlp-demos.6 10.1109/ICCV.2017.285 10.1109/TGRS.2020.2988782
ContentType	Journal Article
Copyright	2024 The Author(s)
Copyright_xml	– notice: 2024 The Author(s)
DBID	6I. AAFTH AAYXX CITATION DOA
DOI	10.1016/j.softx.2024.101731
DatabaseName	ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef DOAJ Directory of Open Access Journals
DatabaseTitle	CrossRef
DatabaseTitleList
Database_xml	– sequence: 1 dbid: DOA name: Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	2352-7110
ExternalDocumentID	oai_doaj_org_article_b331d46c8b80461e845b76125c5673f8 10_1016_j_softx_2024_101731 S235271102400102X
GroupedDBID	0R~ 0SF 457 5VS 6I. AACTN AAEDW AAFTH AALRI AAXUO ABMAC ACGFS ADBBV ADEZE AEXQZ AFTJW AGHFR AITUG AKRWK ALMA_UNASSIGNED_HOLDINGS AMRAJ BCNDV EBS EJD FDB GROUPED_DOAJ IPNFZ IXB KQ8 M~E NCXOZ O9- OK1 RIG ROL SSZ AAYXX ADVLN AFJKZ CITATION
ID	FETCH-LOGICAL-c319t-85989e421dcec7bc23359816678096bd7327015855c36273d3081f74a0b08da83
IEDL.DBID	IXB
ISSN	2352-7110
IngestDate	Fri Oct 04 13:15:50 EDT 2024 Thu Sep 26 21:30:51 EDT 2024 Tue Jun 18 08:51:56 EDT 2024
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Keywords	68T45 Image analysis Machine learning 68T07 Natural language processing Open source Visual question answering 68T50 Python
Language	English
License	This is an open access article under the CC BY license.
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c319t-85989e421dcec7bc23359816678096bd7327015855c36273d3081f74a0b08da83
ORCID	0000-0002-5831-1237
OpenAccessLink	https://www.sciencedirect.com/science/article/pii/S235271102400102X
ParticipantIDs	doaj_primary_oai_doaj_org_article_b331d46c8b80461e845b76125c5673f8 crossref_primary_10_1016_j_softx_2024_101731 elsevier_sciencedirect_doi_10_1016_j_softx_2024_101731
PublicationCentury	2000
PublicationDate	May 2024 2024-05-00 2024-05-01
PublicationDateYYYYMMDD	2024-05-01
PublicationDate_xml	– month: 05 year: 2024 text: May 2024
PublicationDecade	2020
PublicationTitle	SoftwareX
PublicationYear	2024
Publisher	Elsevier B.V Elsevier
Publisher_xml	– name: Elsevier B.V – name: Elsevier
References	Otto, Fong (b18) 2017 Hackel (b23) 2023 Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, et al. VQA: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. 2015, p. 2425–33. Chase (b10) 2022 Hackel (b24) 2023 EOLab (b22) 2023 Wightman (b16) 2019 Wang, Xiong, Wei, Wang, Li (b11) 2021 Shen, Song, Tan, Li, Lu, Zhuang (b14) 2023 Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: State-of-the-Art Natural Language Processing. In: Conference on empirical methods in natural language processing: system demonstrations. Stroudsburg, PA, USA; 2020, p. 38–45. Zhou, Fu, Chen, Shen, Shao (b20) 2020; 39 Radford, Kim, Hallacy, Ramesh, Goh, Agarwal (b2) 2021 Chappuis C, Mendez V, Walt E, Lobry S, Le Saux B, Tuia D. Language Transformers for Remote Sensing Visual Question Answering. In: IEEE international symposium on geoscience and remote sensing. 2022, p. 4855–8. Hackel L, Clasen KN, Ravanbakhsh M, Demir B. LiT-4-RSVQA: Lightweight Transformer-Based Visual Question Answering in Remote Sensing. In: IEEE international symposium on geoscience and remote sensing. 2023, p. 2231–4. Wang, Xiong, Qian, Wei, Li, Wang (b12) 2021 Lobry, Marcos, Murray, Tuia (b4) 2020; 58 Ben-Younes H, Cadene R, Cord M, Thome N. Mutan: Multimodal tucker fusion for visual question answering. In: IEEE international conference on computer vision. 2017, p. 2612–20. Sumbul, de Wall, Kreuziger, Marcelino, Costa, Benevides (b21) 2021; 9 Li, Li, Le, Wang, Savarese, Hoi (b13) 2022 Siebert, Clasen, Ravanbakhsh, Demir (b7) 2022 Lobry S, Demir B, Tuia D. RSVQA Meets BigEarthNet: A New, Large-Scale, Visual Question Answering Dataset for Remote Sensing. In: IEEE international symposium on geoscience and remote sensing. 2021, p. 1218–21. Chappuis C, Zermatten V, Lobry S, Saux BL, Tuia D. Prompt–RSVQA: Prompting visual context to a language model for Remote Sensing Visual Question Answering. In: IEEE conference on computer vision and pattern recognition. 2022, p. 1371–80. Li, Yatskar, Yin, Hsieh, Chang (b3) 2019 Paszke, Gross, Massa, Lerer, Bradbury, Chanan (b15) 2019; vol. 32 10.1016/j.softx.2024.101731_b1 Hackel (10.1016/j.softx.2024.101731_b23) 2023 Zhou (10.1016/j.softx.2024.101731_b20) 2020; 39 EOLab (10.1016/j.softx.2024.101731_b22) 2023 Radford (10.1016/j.softx.2024.101731_b2) 2021 Chase (10.1016/j.softx.2024.101731_b10) 2022 Shen (10.1016/j.softx.2024.101731_b14) 2023 Wang (10.1016/j.softx.2024.101731_b12) 2021 Lobry (10.1016/j.softx.2024.101731_b4) 2020; 58 Li (10.1016/j.softx.2024.101731_b13) 2022 Paszke (10.1016/j.softx.2024.101731_b15) 2019; vol. 32 Sumbul (10.1016/j.softx.2024.101731_b21) 2021; 9 Li (10.1016/j.softx.2024.101731_b3) 2019 Siebert (10.1016/j.softx.2024.101731_b7) 2022 Wang (10.1016/j.softx.2024.101731_b11) 2021 Otto (10.1016/j.softx.2024.101731_b18) 2017 Wightman (10.1016/j.softx.2024.101731_b16) 2019 10.1016/j.softx.2024.101731_b9 10.1016/j.softx.2024.101731_b8 10.1016/j.softx.2024.101731_b19 10.1016/j.softx.2024.101731_b5 Hackel (10.1016/j.softx.2024.101731_b24) 2023 10.1016/j.softx.2024.101731_b17 10.1016/j.softx.2024.101731_b6
References_xml	– year: 2022 ident: b13 article-title: LAVIS: A library for language-vision intelligence contributor: fullname: Hoi – start-page: 162 year: 2022 end-page: 170 ident: b7 article-title: Multi-modal fusion transformer for visual question answering in remote sensing publication-title: SPIE image and signal processing for remote sensing contributor: fullname: Demir – year: 2019 ident: b3 article-title: VisualBERT: A simple and performant baseline for vision and language contributor: fullname: Chang – year: 2022 ident: b10 article-title: LangChain contributor: fullname: Chase – year: 2023 ident: b23 article-title: EOLab-seminars: ConfigILM Python-library contributor: fullname: Hackel – year: 2023 ident: b22 article-title: AICube-project page contributor: fullname: EOLab – year: 2019 ident: b16 article-title: PyTorch image models contributor: fullname: Wightman – year: 2017 ident: b18 article-title: The MIT license contributor: fullname: Fong – volume: 58 start-page: 8555 year: 2020 end-page: 8566 ident: b4 article-title: RSVQA: Visual question answering for remote sensing data publication-title: IEEE Trans Geosci Remote Sens contributor: fullname: Tuia – volume: vol. 32 year: 2019 ident: b15 article-title: Pytorch: An imperative style, high-performance deep learning library publication-title: Conference on neural information processing systems contributor: fullname: Chanan – year: 2023 ident: b24 article-title: EOLab-seminars: ConfigILM Python-library (recording) contributor: fullname: Hackel – year: 2023 ident: b14 article-title: HuggingGPT: Solving AI tasks with ChatGPT and its friends in HuggingFace contributor: fullname: Zhuang – volume: 9 start-page: 174 year: 2021 end-page: 180 ident: b21 article-title: BigEarthNet-MM: A large scale multi-modal multi-label benchmark archive for remote sensing image classification and retrieval publication-title: IEEE Geosci Remote Sens Mag contributor: fullname: Benevides – year: 2021 ident: b12 article-title: LightSeq2: Accelerated training for transformer-based models on GPUs contributor: fullname: Wang – start-page: 8748 year: 2021 end-page: 8763 ident: b2 article-title: Learning transferable visual models from natural language supervision publication-title: International conference on machine learning contributor: fullname: Agarwal – volume: 39 start-page: 2772 year: 2020 end-page: 2781 ident: b20 article-title: Hi-net: Hybrid-fusion network for multi-modal MR image synthesis publication-title: IEEE Trans Med Imaging contributor: fullname: Shao – start-page: 113 year: 2021 end-page: 120 ident: b11 article-title: LightSeq: A high performance inference library for transformers publication-title: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers contributor: fullname: Li – ident: 10.1016/j.softx.2024.101731_b6 doi: 10.1109/IGARSS46834.2022.9884036 – ident: 10.1016/j.softx.2024.101731_b8 doi: 10.1109/CVPRW56347.2022.00143 – start-page: 162 year: 2022 ident: 10.1016/j.softx.2024.101731_b7 article-title: Multi-modal fusion transformer for visual question answering in remote sensing contributor: fullname: Siebert – year: 2019 ident: 10.1016/j.softx.2024.101731_b3 contributor: fullname: Li – year: 2022 ident: 10.1016/j.softx.2024.101731_b13 contributor: fullname: Li – year: 2022 ident: 10.1016/j.softx.2024.101731_b10 contributor: fullname: Chase – volume: 39 start-page: 2772 issue: 9 year: 2020 ident: 10.1016/j.softx.2024.101731_b20 article-title: Hi-net: Hybrid-fusion network for multi-modal MR image synthesis publication-title: IEEE Trans Med Imaging doi: 10.1109/TMI.2020.2975344 contributor: fullname: Zhou – start-page: 113 year: 2021 ident: 10.1016/j.softx.2024.101731_b11 article-title: LightSeq: A high performance inference library for transformers contributor: fullname: Wang – volume: vol. 32 year: 2019 ident: 10.1016/j.softx.2024.101731_b15 article-title: Pytorch: An imperative style, high-performance deep learning library contributor: fullname: Paszke – year: 2019 ident: 10.1016/j.softx.2024.101731_b16 contributor: fullname: Wightman – year: 2021 ident: 10.1016/j.softx.2024.101731_b12 contributor: fullname: Wang – start-page: 8748 year: 2021 ident: 10.1016/j.softx.2024.101731_b2 article-title: Learning transferable visual models from natural language supervision contributor: fullname: Radford – year: 2017 ident: 10.1016/j.softx.2024.101731_b18 contributor: fullname: Otto – ident: 10.1016/j.softx.2024.101731_b9 doi: 10.1109/IGARSS52108.2023.10281674 – volume: 9 start-page: 174 issue: 3 year: 2021 ident: 10.1016/j.softx.2024.101731_b21 article-title: BigEarthNet-MM: A large scale multi-modal multi-label benchmark archive for remote sensing image classification and retrieval publication-title: IEEE Geosci Remote Sens Mag doi: 10.1109/MGRS.2021.3089174 contributor: fullname: Sumbul – ident: 10.1016/j.softx.2024.101731_b5 doi: 10.1109/IGARSS47720.2021.9553307 – ident: 10.1016/j.softx.2024.101731_b1 doi: 10.1109/ICCV.2015.279 – ident: 10.1016/j.softx.2024.101731_b17 doi: 10.18653/v1/2020.emnlp-demos.6 – year: 2023 ident: 10.1016/j.softx.2024.101731_b22 contributor: fullname: EOLab – year: 2023 ident: 10.1016/j.softx.2024.101731_b24 contributor: fullname: Hackel – ident: 10.1016/j.softx.2024.101731_b19 doi: 10.1109/ICCV.2017.285 – volume: 58 start-page: 8555 year: 2020 ident: 10.1016/j.softx.2024.101731_b4 article-title: RSVQA: Visual question answering for remote sensing data publication-title: IEEE Trans Geosci Remote Sens doi: 10.1109/TGRS.2020.2988782 contributor: fullname: Lobry – year: 2023 ident: 10.1016/j.softx.2024.101731_b23 contributor: fullname: Hackel – year: 2023 ident: 10.1016/j.softx.2024.101731_b14 contributor: fullname: Shen
SSID	ssj0001528524
Score	2.3180764
Snippet	ConfigILM is an open-source Python library for rapid iterative development of image-language models for visual question answering in PyTorch. It provides a...
SourceID	doaj crossref elsevier
SourceType	Open Website Aggregation Database Publisher
StartPage	101731
SubjectTerms	Image analysis Machine learning Natural language processing Open source Python Visual question answering
SummonAdditionalLinks	– databaseName: DOAJ Directory of Open Access Journals dbid: DOA link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07T8MwELZQJxbeiPKSB0YiGj9ih60gKkCUiUrdIr9aFakPNS3l53O2ExQWWFhjy46-s_2drbvvELrKNCcdM1JJyg1JmOJ5kjvHEqA-54SA3eV8NnL_NXscsOchHzZKffmYsCgPHIG70ZSmlmVGaum1wZ1kXAtPy4Zngo5imm_KG5epmB9MJCeslhkKAV0lnGufcCMkLIgL0fQHFQXF_gYjNVimt4d2KvcQd-Nv7aMtNztAu3XpBVztxEO08Zl6k_HTS_8Wd_E4akfjBaA2Lx02oXG99GlRuHqnweCdQsNUh4oQeDKFgwSrmcX1iyUORXHK0O9jUq5hvMAZYDjoV26CZuERGvQe3u4fk6qGQmJgc60SyXOZO0ZSa5wR2hDqJfvSDDgKLi_aCkoEoCU5N0BlgloKPsJIMNXRHWmVpMeoNZvP3AnClFhlHVG5toxxZTRXwHtGW_BhlJKkja5rOItFlMoo6hiy9yKgX3j0i4h-G915yL-7ep3r8AGsX1TWL_6yfhtltcGKymWIrgAMNflt9tP_mP0MbfshY_zjOWqtlmt3AT7KSl-G5fgFMhzjhQ priority: 102 providerName: Directory of Open Access Journals
Title	ConfigILM: A general purpose configurable library for combining image and language models for visual question answering
URI	https://dx.doi.org/10.1016/j.softx.2024.101731 https://doaj.org/article/b331d46c8b80461e845b76125c5673f8
Volume	26
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LT-MwELYQe-HC8tSWR-UDR6I2fsTO3qBaBIjHYRcp4hL51Sor0VZNC_vzmXESHpc9cIwzdqLP9szYmvmGkJPMSjZ0Y5Ok0rFEGJkneQgiAdMXglKwuwJmI9_eZZcP4rqQxRoZdbkwGFbZ6v5Gp0dt3bYMWjQH86oa_GbgO8BgSNKFxGgF6GHk9sQkvuL8_Z5FMi1jbVuUT7BDRz4Uw7xq0Hb_4JzIRKQc4uknAxV5_D_YqQ-252KLbLZOIz1r_mubrIXpDvneFWSg7f7cJS-Yv1dNrm5uf9IzOmkYpekcsJzVgbr4crXAZCna3t5Q8FnhxZONdSJo9QTqhZqpp909Jo2lcuoo91zVKxgvWhKYTpCrXyKT4R55uPj1Z3SZtJUVEgdbbplomes8CJZ6F5yyjnEk8kszsFxwpLFecaYAOS2lAwOnuOfgOYyVMEM71N5ovk_Wp7Np-EEoZ974wExuvRDSOCsNWENnPXg2xmjWI6cdnOW8IdAou8iyv2VEv0T0ywb9HjlHyN9Ekf06NswWk7Kd_tJynnqROW018sUHLaRV6Ko5mSk-1j2SdRNWflpMMFT1v68ffLXjIdnApyYS8oisLxercAzeytL24ym_Hxdln3y7GxX3j68IAuon
link.rule.ids	315,786,790,870,2115,3525,27955,27956,45907
linkProvider	Elsevier
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LTxsxEB5ROLQXKH2oAVp86LGrZP1Ye7kBKgol4VKQcrP8SrSVSKJsAvx8xt7dApceevVrV5_t-caW5xuA74UVdOCmJsuFoxk3oszKEHiG1BeClLi7QoxGHl8Xw1v-ayImW3DexcLEZ5Wt7W9serLWbUm_RbO_rKr-b4q-Aw4WRbqiMNrkDexwgWe9GMU3OXu-aBFUiZTcNnbIYo9OfSi986rR3D3iQZHypDnE8lcMlYT8XxDVC_K5eA-7rddITpsf24etMP8Ae11GBtJu0I_wEAP4qtnlaHxCTsmskZQmSwRzUQfiUuVmFaOlSHt9Q9BpxYo7mxJFkOoO7Qsxc0-6i0yScuXUqd19VW9wvEQlOJ_Yrn5IUoaf4Pbi5835MGtTK2QO99w6U6JUZeA09y44aR1lUckvL5C68ExjvWRUInJKCIcMJ5ln6DpMJTcDO1DeKPYZtueLefgChFFvfKCmtJ5zYZwVBunQWY-ujTGK9uBHB6deNgoaunta9kcn9HVEXzfo9-AsQv63aZS_TgWL1Uy3868tY7nnhVNWRcH4oLiwMvpqThSSTVUPim7C9KvVhENV__r6wf92PIa3w5vxSI8ur68O4V2saZ5FHsH2erUJX9F1WdtvaWk-AQmm6qI
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=ConfigILM%3A+A+general+purpose+configurable+library+for+combining+image+and+language+models+for+visual+question+answering&rft.jtitle=SoftwareX&rft.au=Hackel%2C+Leonard&rft.au=Clasen%2C+Kai+Norman&rft.au=Demir%2C+Beg%C3%BCm&rft.date=2024-05-01&rft.pub=Elsevier+B.V&rft.issn=2352-7110&rft.eissn=2352-7110&rft.volume=26&rft_id=info:doi/10.1016%2Fj.softx.2024.101731&rft.externalDocID=S235271102400102X
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2352-7110&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2352-7110&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2352-7110&client=summon