ConfigILM: A general purpose configurable library for combining image and language models for visual question answering
ConfigILM is an open-source Python library for rapid iterative development of image-language models for visual question answering in PyTorch. It provides a convenient implementation for seamlessly combining image and language models from two popular PyTorch libraries that are timm and huggingface. T...
Saved in:
Published in | SoftwareX Vol. 26; p. 101731 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
01.05.2024
Elsevier |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | ConfigILM is an open-source Python library for rapid iterative development of image-language models for visual question answering in PyTorch. It provides a convenient implementation for seamlessly combining image and language models from two popular PyTorch libraries that are timm and huggingface. These libraries allow a variety of configurations of models without additional implementation effort. The monolithic interface provided by ConfigILM simplifies the exchange of components of a considered model and offers possibilities for developing new image-language models based on recombining the selected encoders. Additionally, the library provides pre-built and throughput-optimized PyTorch dataloaders. We also provide a guideline document that contains installation instructions, tutorial examples, and a complete discussion of the monolithic interface to the library. ConfigILM is released under the MIT License, encouraging its use in academic and commercial environments. The source code and documentation of ConfigILM are available at https://github.com/lhackel-tub/ConfigILM. |
---|---|
AbstractList | ConfigILM is an open-source Python library for rapid iterative development of image-language models for visual question answering in PyTorch. It provides a convenient implementation for seamlessly combining image and language models from two popular PyTorch libraries that are timm and huggingface. These libraries allow a variety of configurations of models without additional implementation effort. The monolithic interface provided by ConfigILM simplifies the exchange of components of a considered model and offers possibilities for developing new image-language models based on recombining the selected encoders. Additionally, the library provides pre-built and throughput-optimized PyTorch dataloaders. We also provide a guideline document that contains installation instructions, tutorial examples, and a complete discussion of the monolithic interface to the library. ConfigILM is released under the MIT License, encouraging its use in academic and commercial environments. The source code and documentation of ConfigILM are available at https://github.com/lhackel-tub/ConfigILM. |
ArticleNumber | 101731 |
Author | Clasen, Kai Norman Demir, Begüm Hackel, Leonard |
Author_xml | – sequence: 1 givenname: Leonard orcidid: 0000-0002-5831-1237 surname: Hackel fullname: Hackel, Leonard email: l.hackel@tu-berlin.de – sequence: 2 givenname: Kai Norman surname: Clasen fullname: Clasen, Kai Norman – sequence: 3 givenname: Begüm surname: Demir fullname: Demir, Begüm |
BookMark | eNp9kctqGzEUhkVJIGmSJ8hGL2BXl9HFhS6C6cXg0k27FrqcGWTGkivNJM3bV2OX0lVXkv5z_o9z9L9FVyknQOiRkjUlVL47rGvup19rRli3KIrTN-iWccFWilJy9c_9Bj3UeiCEUMG0YN0tetnm1Mdht__6Hj_hARIUO-LTXE65Avbn4lysGwGP0RVbXnGfSyscXUwxDTge7QDYpoBHm4Z5eRxzgLGe-55jnRvv5wx1ijm1vvoCpfnu0XVvxwoPf8479OPTx-_bL6v9t8-77dN-5TndTCstNnoDHaPBg1fOM86bQqVUmmykC4oz1ZbRQngumeKBE0171VniiA5W8zu0u3BDtgdzKm3c8mqyjeYs5DIYW6boRzCOcxo66bXTpJMUdCeckpQJL6Ti_cLiF5YvudYC_V8eJWaJwhzMOQqzRGEuUTTXh4ur_Qk8Ryim-gjJQ4gF_NTmiP_1_walW5RE |
Cites_doi | 10.1109/IGARSS46834.2022.9884036 10.1109/CVPRW56347.2022.00143 10.1109/TMI.2020.2975344 10.1109/IGARSS52108.2023.10281674 10.1109/MGRS.2021.3089174 10.1109/IGARSS47720.2021.9553307 10.1109/ICCV.2015.279 10.18653/v1/2020.emnlp-demos.6 10.1109/ICCV.2017.285 10.1109/TGRS.2020.2988782 |
ContentType | Journal Article |
Copyright | 2024 The Author(s) |
Copyright_xml | – notice: 2024 The Author(s) |
DBID | 6I. AAFTH AAYXX CITATION DOA |
DOI | 10.1016/j.softx.2024.101731 |
DatabaseName | ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef DOAJ Directory of Open Access Journals |
DatabaseTitle | CrossRef |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: DOA name: Directory of Open Access Journals url: https://www.doaj.org/ sourceTypes: Open Website |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISSN | 2352-7110 |
ExternalDocumentID | oai_doaj_org_article_b331d46c8b80461e845b76125c5673f8 10_1016_j_softx_2024_101731 S235271102400102X |
GroupedDBID | 0R~ 0SF 457 5VS 6I. AACTN AAEDW AAFTH AALRI AAXUO ABMAC ACGFS ADBBV ADEZE AEXQZ AFTJW AGHFR AITUG AKRWK ALMA_UNASSIGNED_HOLDINGS AMRAJ BCNDV EBS EJD FDB GROUPED_DOAJ IPNFZ IXB KQ8 M~E NCXOZ O9- OK1 RIG ROL SSZ AAYXX ADVLN AFJKZ CITATION |
ID | FETCH-LOGICAL-c319t-85989e421dcec7bc23359816678096bd7327015855c36273d3081f74a0b08da83 |
IEDL.DBID | IXB |
ISSN | 2352-7110 |
IngestDate | Fri Oct 04 13:15:50 EDT 2024 Thu Sep 26 21:30:51 EDT 2024 Tue Jun 18 08:51:56 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Keywords | 68T45 Image analysis Machine learning 68T07 Natural language processing Open source Visual question answering 68T50 Python |
Language | English |
License | This is an open access article under the CC BY license. |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c319t-85989e421dcec7bc23359816678096bd7327015855c36273d3081f74a0b08da83 |
ORCID | 0000-0002-5831-1237 |
OpenAccessLink | https://www.sciencedirect.com/science/article/pii/S235271102400102X |
ParticipantIDs | doaj_primary_oai_doaj_org_article_b331d46c8b80461e845b76125c5673f8 crossref_primary_10_1016_j_softx_2024_101731 elsevier_sciencedirect_doi_10_1016_j_softx_2024_101731 |
PublicationCentury | 2000 |
PublicationDate | May 2024 2024-05-00 2024-05-01 |
PublicationDateYYYYMMDD | 2024-05-01 |
PublicationDate_xml | – month: 05 year: 2024 text: May 2024 |
PublicationDecade | 2020 |
PublicationTitle | SoftwareX |
PublicationYear | 2024 |
Publisher | Elsevier B.V Elsevier |
Publisher_xml | – name: Elsevier B.V – name: Elsevier |
References | Otto, Fong (b18) 2017 Hackel (b23) 2023 Antol S, Agrawal A, Lu J, Mitchell M, Batra D, Zitnick CL, et al. VQA: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. 2015, p. 2425–33. Chase (b10) 2022 Hackel (b24) 2023 EOLab (b22) 2023 Wightman (b16) 2019 Wang, Xiong, Wei, Wang, Li (b11) 2021 Shen, Song, Tan, Li, Lu, Zhuang (b14) 2023 Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: State-of-the-Art Natural Language Processing. In: Conference on empirical methods in natural language processing: system demonstrations. Stroudsburg, PA, USA; 2020, p. 38–45. Zhou, Fu, Chen, Shen, Shao (b20) 2020; 39 Radford, Kim, Hallacy, Ramesh, Goh, Agarwal (b2) 2021 Chappuis C, Mendez V, Walt E, Lobry S, Le Saux B, Tuia D. Language Transformers for Remote Sensing Visual Question Answering. In: IEEE international symposium on geoscience and remote sensing. 2022, p. 4855–8. Hackel L, Clasen KN, Ravanbakhsh M, Demir B. LiT-4-RSVQA: Lightweight Transformer-Based Visual Question Answering in Remote Sensing. In: IEEE international symposium on geoscience and remote sensing. 2023, p. 2231–4. Wang, Xiong, Qian, Wei, Li, Wang (b12) 2021 Lobry, Marcos, Murray, Tuia (b4) 2020; 58 Ben-Younes H, Cadene R, Cord M, Thome N. Mutan: Multimodal tucker fusion for visual question answering. In: IEEE international conference on computer vision. 2017, p. 2612–20. Sumbul, de Wall, Kreuziger, Marcelino, Costa, Benevides (b21) 2021; 9 Li, Li, Le, Wang, Savarese, Hoi (b13) 2022 Siebert, Clasen, Ravanbakhsh, Demir (b7) 2022 Lobry S, Demir B, Tuia D. RSVQA Meets BigEarthNet: A New, Large-Scale, Visual Question Answering Dataset for Remote Sensing. In: IEEE international symposium on geoscience and remote sensing. 2021, p. 1218–21. Chappuis C, Zermatten V, Lobry S, Saux BL, Tuia D. Prompt–RSVQA: Prompting visual context to a language model for Remote Sensing Visual Question Answering. In: IEEE conference on computer vision and pattern recognition. 2022, p. 1371–80. Li, Yatskar, Yin, Hsieh, Chang (b3) 2019 Paszke, Gross, Massa, Lerer, Bradbury, Chanan (b15) 2019; vol. 32 10.1016/j.softx.2024.101731_b1 Hackel (10.1016/j.softx.2024.101731_b23) 2023 Zhou (10.1016/j.softx.2024.101731_b20) 2020; 39 EOLab (10.1016/j.softx.2024.101731_b22) 2023 Radford (10.1016/j.softx.2024.101731_b2) 2021 Chase (10.1016/j.softx.2024.101731_b10) 2022 Shen (10.1016/j.softx.2024.101731_b14) 2023 Wang (10.1016/j.softx.2024.101731_b12) 2021 Lobry (10.1016/j.softx.2024.101731_b4) 2020; 58 Li (10.1016/j.softx.2024.101731_b13) 2022 Paszke (10.1016/j.softx.2024.101731_b15) 2019; vol. 32 Sumbul (10.1016/j.softx.2024.101731_b21) 2021; 9 Li (10.1016/j.softx.2024.101731_b3) 2019 Siebert (10.1016/j.softx.2024.101731_b7) 2022 Wang (10.1016/j.softx.2024.101731_b11) 2021 Otto (10.1016/j.softx.2024.101731_b18) 2017 Wightman (10.1016/j.softx.2024.101731_b16) 2019 10.1016/j.softx.2024.101731_b9 10.1016/j.softx.2024.101731_b8 10.1016/j.softx.2024.101731_b19 10.1016/j.softx.2024.101731_b5 Hackel (10.1016/j.softx.2024.101731_b24) 2023 10.1016/j.softx.2024.101731_b17 10.1016/j.softx.2024.101731_b6 |
References_xml | – year: 2022 ident: b13 article-title: LAVIS: A library for language-vision intelligence contributor: fullname: Hoi – start-page: 162 year: 2022 end-page: 170 ident: b7 article-title: Multi-modal fusion transformer for visual question answering in remote sensing publication-title: SPIE image and signal processing for remote sensing contributor: fullname: Demir – year: 2019 ident: b3 article-title: VisualBERT: A simple and performant baseline for vision and language contributor: fullname: Chang – year: 2022 ident: b10 article-title: LangChain contributor: fullname: Chase – year: 2023 ident: b23 article-title: EOLab-seminars: ConfigILM Python-library contributor: fullname: Hackel – year: 2023 ident: b22 article-title: AICube-project page contributor: fullname: EOLab – year: 2019 ident: b16 article-title: PyTorch image models contributor: fullname: Wightman – year: 2017 ident: b18 article-title: The MIT license contributor: fullname: Fong – volume: 58 start-page: 8555 year: 2020 end-page: 8566 ident: b4 article-title: RSVQA: Visual question answering for remote sensing data publication-title: IEEE Trans Geosci Remote Sens contributor: fullname: Tuia – volume: vol. 32 year: 2019 ident: b15 article-title: Pytorch: An imperative style, high-performance deep learning library publication-title: Conference on neural information processing systems contributor: fullname: Chanan – year: 2023 ident: b24 article-title: EOLab-seminars: ConfigILM Python-library (recording) contributor: fullname: Hackel – year: 2023 ident: b14 article-title: HuggingGPT: Solving AI tasks with ChatGPT and its friends in HuggingFace contributor: fullname: Zhuang – volume: 9 start-page: 174 year: 2021 end-page: 180 ident: b21 article-title: BigEarthNet-MM: A large scale multi-modal multi-label benchmark archive for remote sensing image classification and retrieval publication-title: IEEE Geosci Remote Sens Mag contributor: fullname: Benevides – year: 2021 ident: b12 article-title: LightSeq2: Accelerated training for transformer-based models on GPUs contributor: fullname: Wang – start-page: 8748 year: 2021 end-page: 8763 ident: b2 article-title: Learning transferable visual models from natural language supervision publication-title: International conference on machine learning contributor: fullname: Agarwal – volume: 39 start-page: 2772 year: 2020 end-page: 2781 ident: b20 article-title: Hi-net: Hybrid-fusion network for multi-modal MR image synthesis publication-title: IEEE Trans Med Imaging contributor: fullname: Shao – start-page: 113 year: 2021 end-page: 120 ident: b11 article-title: LightSeq: A high performance inference library for transformers publication-title: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers contributor: fullname: Li – ident: 10.1016/j.softx.2024.101731_b6 doi: 10.1109/IGARSS46834.2022.9884036 – ident: 10.1016/j.softx.2024.101731_b8 doi: 10.1109/CVPRW56347.2022.00143 – start-page: 162 year: 2022 ident: 10.1016/j.softx.2024.101731_b7 article-title: Multi-modal fusion transformer for visual question answering in remote sensing contributor: fullname: Siebert – year: 2019 ident: 10.1016/j.softx.2024.101731_b3 contributor: fullname: Li – year: 2022 ident: 10.1016/j.softx.2024.101731_b13 contributor: fullname: Li – year: 2022 ident: 10.1016/j.softx.2024.101731_b10 contributor: fullname: Chase – volume: 39 start-page: 2772 issue: 9 year: 2020 ident: 10.1016/j.softx.2024.101731_b20 article-title: Hi-net: Hybrid-fusion network for multi-modal MR image synthesis publication-title: IEEE Trans Med Imaging doi: 10.1109/TMI.2020.2975344 contributor: fullname: Zhou – start-page: 113 year: 2021 ident: 10.1016/j.softx.2024.101731_b11 article-title: LightSeq: A high performance inference library for transformers contributor: fullname: Wang – volume: vol. 32 year: 2019 ident: 10.1016/j.softx.2024.101731_b15 article-title: Pytorch: An imperative style, high-performance deep learning library contributor: fullname: Paszke – year: 2019 ident: 10.1016/j.softx.2024.101731_b16 contributor: fullname: Wightman – year: 2021 ident: 10.1016/j.softx.2024.101731_b12 contributor: fullname: Wang – start-page: 8748 year: 2021 ident: 10.1016/j.softx.2024.101731_b2 article-title: Learning transferable visual models from natural language supervision contributor: fullname: Radford – year: 2017 ident: 10.1016/j.softx.2024.101731_b18 contributor: fullname: Otto – ident: 10.1016/j.softx.2024.101731_b9 doi: 10.1109/IGARSS52108.2023.10281674 – volume: 9 start-page: 174 issue: 3 year: 2021 ident: 10.1016/j.softx.2024.101731_b21 article-title: BigEarthNet-MM: A large scale multi-modal multi-label benchmark archive for remote sensing image classification and retrieval publication-title: IEEE Geosci Remote Sens Mag doi: 10.1109/MGRS.2021.3089174 contributor: fullname: Sumbul – ident: 10.1016/j.softx.2024.101731_b5 doi: 10.1109/IGARSS47720.2021.9553307 – ident: 10.1016/j.softx.2024.101731_b1 doi: 10.1109/ICCV.2015.279 – ident: 10.1016/j.softx.2024.101731_b17 doi: 10.18653/v1/2020.emnlp-demos.6 – year: 2023 ident: 10.1016/j.softx.2024.101731_b22 contributor: fullname: EOLab – year: 2023 ident: 10.1016/j.softx.2024.101731_b24 contributor: fullname: Hackel – ident: 10.1016/j.softx.2024.101731_b19 doi: 10.1109/ICCV.2017.285 – volume: 58 start-page: 8555 year: 2020 ident: 10.1016/j.softx.2024.101731_b4 article-title: RSVQA: Visual question answering for remote sensing data publication-title: IEEE Trans Geosci Remote Sens doi: 10.1109/TGRS.2020.2988782 contributor: fullname: Lobry – year: 2023 ident: 10.1016/j.softx.2024.101731_b23 contributor: fullname: Hackel – year: 2023 ident: 10.1016/j.softx.2024.101731_b14 contributor: fullname: Shen |
SSID | ssj0001528524 |
Score | 2.3180764 |
Snippet | ConfigILM is an open-source Python library for rapid iterative development of image-language models for visual question answering in PyTorch. It provides a... |
SourceID | doaj crossref elsevier |
SourceType | Open Website Aggregation Database Publisher |
StartPage | 101731 |
SubjectTerms | Image analysis Machine learning Natural language processing Open source Python Visual question answering |
SummonAdditionalLinks | – databaseName: DOAJ Directory of Open Access Journals dbid: DOA link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07T8MwELZQJxbeiPKSB0YiGj9ih60gKkCUiUrdIr9aFakPNS3l53O2ExQWWFhjy46-s_2drbvvELrKNCcdM1JJyg1JmOJ5kjvHEqA-54SA3eV8NnL_NXscsOchHzZKffmYsCgPHIG70ZSmlmVGaum1wZ1kXAtPy4Zngo5imm_KG5epmB9MJCeslhkKAV0lnGufcCMkLIgL0fQHFQXF_gYjNVimt4d2KvcQd-Nv7aMtNztAu3XpBVztxEO08Zl6k_HTS_8Wd_E4akfjBaA2Lx02oXG99GlRuHqnweCdQsNUh4oQeDKFgwSrmcX1iyUORXHK0O9jUq5hvMAZYDjoV26CZuERGvQe3u4fk6qGQmJgc60SyXOZO0ZSa5wR2hDqJfvSDDgKLi_aCkoEoCU5N0BlgloKPsJIMNXRHWmVpMeoNZvP3AnClFhlHVG5toxxZTRXwHtGW_BhlJKkja5rOItFlMoo6hiy9yKgX3j0i4h-G915yL-7ep3r8AGsX1TWL_6yfhtltcGKymWIrgAMNflt9tP_mP0MbfshY_zjOWqtlmt3AT7KSl-G5fgFMhzjhQ priority: 102 providerName: Directory of Open Access Journals |
Title | ConfigILM: A general purpose configurable library for combining image and language models for visual question answering |
URI | https://dx.doi.org/10.1016/j.softx.2024.101731 https://doaj.org/article/b331d46c8b80461e845b76125c5673f8 |
Volume | 26 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LT-MwELYQe-HC8tSWR-UDR6I2fsTO3qBaBIjHYRcp4hL51Sor0VZNC_vzmXESHpc9cIwzdqLP9szYmvmGkJPMSjZ0Y5Ok0rFEGJkneQgiAdMXglKwuwJmI9_eZZcP4rqQxRoZdbkwGFbZ6v5Gp0dt3bYMWjQH86oa_GbgO8BgSNKFxGgF6GHk9sQkvuL8_Z5FMi1jbVuUT7BDRz4Uw7xq0Hb_4JzIRKQc4uknAxV5_D_YqQ-252KLbLZOIz1r_mubrIXpDvneFWSg7f7cJS-Yv1dNrm5uf9IzOmkYpekcsJzVgbr4crXAZCna3t5Q8FnhxZONdSJo9QTqhZqpp909Jo2lcuoo91zVKxgvWhKYTpCrXyKT4R55uPj1Z3SZtJUVEgdbbplomes8CJZ6F5yyjnEk8kszsFxwpLFecaYAOS2lAwOnuOfgOYyVMEM71N5ovk_Wp7Np-EEoZ974wExuvRDSOCsNWENnPXg2xmjWI6cdnOW8IdAou8iyv2VEv0T0ywb9HjlHyN9Ekf06NswWk7Kd_tJynnqROW018sUHLaRV6Ko5mSk-1j2SdRNWflpMMFT1v68ffLXjIdnApyYS8oisLxercAzeytL24ym_Hxdln3y7GxX3j68IAuon |
link.rule.ids | 315,786,790,870,2115,3525,27955,27956,45907 |
linkProvider | Elsevier |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LTxsxEB5ROLQXKH2oAVp86LGrZP1Ye7kBKgol4VKQcrP8SrSVSKJsAvx8xt7dApceevVrV5_t-caW5xuA74UVdOCmJsuFoxk3oszKEHiG1BeClLi7QoxGHl8Xw1v-ayImW3DexcLEZ5Wt7W9serLWbUm_RbO_rKr-b4q-Aw4WRbqiMNrkDexwgWe9GMU3OXu-aBFUiZTcNnbIYo9OfSi986rR3D3iQZHypDnE8lcMlYT8XxDVC_K5eA-7rddITpsf24etMP8Ae11GBtJu0I_wEAP4qtnlaHxCTsmskZQmSwRzUQfiUuVmFaOlSHt9Q9BpxYo7mxJFkOoO7Qsxc0-6i0yScuXUqd19VW9wvEQlOJ_Yrn5IUoaf4Pbi5835MGtTK2QO99w6U6JUZeA09y44aR1lUckvL5C68ExjvWRUInJKCIcMJ5ln6DpMJTcDO1DeKPYZtueLefgChFFvfKCmtJ5zYZwVBunQWY-ujTGK9uBHB6deNgoaunta9kcn9HVEXzfo9-AsQv63aZS_TgWL1Uy3868tY7nnhVNWRcH4oLiwMvpqThSSTVUPim7C9KvVhENV__r6wf92PIa3w5vxSI8ur68O4V2saZ5FHsH2erUJX9F1WdtvaWk-AQmm6qI |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=ConfigILM%3A+A+general+purpose+configurable+library+for+combining+image+and+language+models+for+visual+question+answering&rft.jtitle=SoftwareX&rft.au=Hackel%2C+Leonard&rft.au=Clasen%2C+Kai+Norman&rft.au=Demir%2C+Beg%C3%BCm&rft.date=2024-05-01&rft.pub=Elsevier+B.V&rft.issn=2352-7110&rft.eissn=2352-7110&rft.volume=26&rft_id=info:doi/10.1016%2Fj.softx.2024.101731&rft.externalDocID=S235271102400102X |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2352-7110&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2352-7110&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2352-7110&client=summon |