SLIP: Self-supervision Meets Language-Image Pre-training
Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we e...
Saved in:
Published in | Computer Vision - ECCV 2022 Vol. 13686; pp. 529 - 544 |
---|---|
Main Authors | , , , |
Format | Book Chapter |
Language | English |
Published |
Switzerland
Springer
2022
Springer Nature Switzerland |
Series | Lecture Notes in Computer Science |
Online Access | Get full text |
ISBN | 9783031198083 3031198085 |
ISSN | 0302-9743 1611-3349 |
DOI | 10.1007/978-3-031-19809-0_30 |
Cover
Loading…
Abstract | Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning with Vision Transformers. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1% linear accuracy) and language supervision (+5.2% zero-shot accuracy). Our code is available at: github.com/facebookresearch/SLIP. |
---|---|
AbstractList | Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning with Vision Transformers. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1% linear accuracy) and language supervision (+5.2% zero-shot accuracy). Our code is available at: github.com/facebookresearch/SLIP. |
Author | Xie, Saining Kirillov, Alexander Mu, Norman Wagner, David |
Author_xml | – sequence: 1 givenname: Norman surname: Mu fullname: Mu, Norman email: thenorm@berkeley.edu – sequence: 2 givenname: Alexander surname: Kirillov fullname: Kirillov, Alexander – sequence: 3 givenname: David surname: Wagner fullname: Wagner, David – sequence: 4 givenname: Saining surname: Xie fullname: Xie, Saining |
BookMark | eNpVUMtOwzAQNFAQbekfcMgPGNZeJ7a5IcSjUhCVCmfLTTYlUJIQp3w_bsuFy640O7O7MxM2atqGGLsUcCUA9LXVhiMHFFxYA5aDQzhiswhjBPcYHLOxyITgiMqe_JsZHLExIEhutcIzNhGYgrYqleKczUL4AACpI1fYMTPLfL64SZa0qXjYdtT_1KFum-SZaAhJ7pv11q-Jz79iTRY98aH3dVM36wt2WvlNoNlfn7K3h_vXuyeevzzO725z3kmFA0-NklUJovTKwEpXnpTwqixUhpVJFZIvyVei8EWpM-9TiSWmWnob-TrLCKdMHvaGro9nqXertv0MToDbZeWicYcuWnf7XNwuqyhSB1HXt99bCoOjnaqgJr6_Kd59N1AfnBbSGq2cMtKlJsVfpYRoQQ |
ContentType | Book Chapter |
Copyright | The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 |
Copyright_xml | – notice: The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 |
DBID | FFUUA |
DEWEY | 006.37 |
DOI | 10.1007/978-3-031-19809-0_30 |
DatabaseName | ProQuest Ebook Central - Book Chapters - Demo use only |
DatabaseTitleList | |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Applied Sciences Computer Science |
EISBN | 9783031198090 3031198093 |
EISSN | 1611-3349 |
Editor | Farinella, Giovanni Maria Avidan, Shai Cissé, Moustapha Brostow, Gabriel Hassner, Tal |
Editor_xml | – sequence: 1 fullname: Avidan, Shai – sequence: 2 fullname: Cissé, Moustapha – sequence: 3 fullname: Farinella, Giovanni Maria – sequence: 4 fullname: Brostow, Gabriel – sequence: 5 fullname: Hassner, Tal |
EndPage | 544 |
ExternalDocumentID | EBC7129874_482_585 |
GroupedDBID | 38. AABBV AAZWU ABSVR ABTHU ABVND ACBPT ACHZO ACPMC ADNVS AEDXK AEJLV AEKFX AHVRR ALMA_UNASSIGNED_HOLDINGS BBABE CZZ FFUUA IEZ SBO TPJZQ TSXQS Z5O Z7R Z7S Z7U Z7W Z7X Z7Y Z7Z Z81 Z82 Z83 Z84 Z85 Z87 Z88 -DT -~X 29L 2HA 2HV ACGFS ADCXD EJD F5P LAS LDH P2P RSU ~02 |
ID | FETCH-LOGICAL-p243t-5842fd01da480b7fae41a4dc463f8543eadeaf1cacd76aa523d3572a9a48766e3 |
ISBN | 9783031198083 3031198085 |
ISSN | 0302-9743 |
IngestDate | Tue Jul 29 20:37:06 EDT 2025 Thu May 29 00:11:29 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
LCCallNum | TA1634 |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-p243t-5842fd01da480b7fae41a4dc463f8543eadeaf1cacd76aa523d3572a9a48766e3 |
Notes | Supplementary InformationThe online version contains supplementary material available at https://doi.org/10.1007/978-3-031-19809-0_30. |
OCLC | 1350794521 |
PQID | EBC7129874_482_585 |
PageCount | 16 |
ParticipantIDs | springer_books_10_1007_978_3_031_19809_0_30 proquest_ebookcentralchapters_7129874_482_585 |
PublicationCentury | 2000 |
PublicationDate | 2022 20221101 |
PublicationDateYYYYMMDD | 2022-01-01 2022-11-01 |
PublicationDate_xml | – year: 2022 text: 2022 |
PublicationDecade | 2020 |
PublicationPlace | Switzerland |
PublicationPlace_xml | – name: Switzerland – name: Cham |
PublicationSeriesTitle | Lecture Notes in Computer Science |
PublicationSeriesTitleAlternate | Lect.Notes Computer |
PublicationSubtitle | 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVI |
PublicationTitle | Computer Vision - ECCV 2022 |
PublicationYear | 2022 |
Publisher | Springer Springer Nature Switzerland |
Publisher_xml | – name: Springer – name: Springer Nature Switzerland |
RelatedPersons | Hartmanis, Juris Gao, Wen Steffen, Bernhard Bertino, Elisa Goos, Gerhard Yung, Moti |
RelatedPersons_xml | – sequence: 1 givenname: Gerhard surname: Goos fullname: Goos, Gerhard – sequence: 2 givenname: Juris surname: Hartmanis fullname: Hartmanis, Juris – sequence: 3 givenname: Elisa surname: Bertino fullname: Bertino, Elisa – sequence: 4 givenname: Wen surname: Gao fullname: Gao, Wen – sequence: 5 givenname: Bernhard orcidid: 0000-0001-9619-1558 surname: Steffen fullname: Steffen, Bernhard – sequence: 6 givenname: Moti orcidid: 0000-0003-0848-0873 surname: Yung fullname: Yung, Moti |
SSID | ssj0002731119 ssj0002792 |
Score | 2.5552847 |
Snippet | Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an... |
SourceID | springer proquest |
SourceType | Publisher |
StartPage | 529 |
Title | SLIP: Self-supervision Meets Language-Image Pre-training |
URI | http://ebookcentral.proquest.com/lib/SITE_ID/reader.action?docID=7129874&ppg=585 http://link.springer.com/10.1007/978-3-031-19809-0_30 |
Volume | 13686 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3Na9swFBdpdhk77Ju1XYcPuxkNy5ItpdDDCCltScsgbchNyLY8CmlSYmeF3vqf78mSEjvrpbsYI4Qlvyfe9-8Joe8p16AFU4oZYRwzoSnOOHitpcpYEueizJpE--VVenbDLmbJrNd7aqNL6uxH_vgsruR_uApjwFeDkn0BZzcfhQF4B_7CEzgMzx3jtxtmtX0F3H0M4bSBh4e-bIGGo-FwGsZRvAX8rV2K5k61ku6r2_l8-aeDctmG1387IEyr5n19MrPJjIm9VaJ92ibj818muDDR8xJX63sjgZpdXWpdV-HYRUXx-V1TI7TS2N9NYcWaabdcnYxdQuNqWTd1YuHmH50IascowL0lnRiFj1GaAmzzlcnDbf1owcwddxbUKSEDEdmrbTysC0Q2OD12SFspnZrei9T2OnWSN3GBE6vEE9tU8h_90C4JgcWwWW2AI0mjPbTHRdJHr36OLsbTTZgOrDvQBoONcjf9Fm1iyu7KwIX8rhPb0Gn7Fy2o5nNLdpyanTx8Y95cv0NvDOQlMFgUIPZ71NOLD-itc1ACR_sKhjw__NhHJAznj4NdvgcN34Mu34M23z-hm9PR9fAMu9s48H3MaI3BUo3LIiKFYiLKeKk0I4oVOUtpKRJGTeW9Kkmu8oKnSiUxLWjCYzWA-TxNNf2M-ovlQn9BASlIXOSE5SaHr0QGOqOguQLjEeRJmWX7CHvCyKZmwBUq55YMleRgpQrOJBOxBH93H4WeetJMr6Rvxg1kl1QC2WVDdmnIfvCi2Yfo9fZEf0X9erXWR2CH1tk3d1b-Aq2neos |
linkProvider | Library Specific Holdings |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Computer+Vision+%E2%80%93+ECCV+2022&rft.au=Mu%2C+Norman&rft.au=Kirillov%2C+Alexander&rft.au=Wagner%2C+David&rft.au=Xie%2C+Saining&rft.atitle=SLIP%3A+Self-supervision+Meets+Language-Image+Pre-training&rft.series=Lecture+Notes+in+Computer+Science&rft.date=2022-11-01&rft.pub=Springer+Nature+Switzerland&rft.isbn=9783031198083&rft.issn=0302-9743&rft.eissn=1611-3349&rft.spage=529&rft.epage=544&rft_id=info:doi/10.1007%2F978-3-031-19809-0_30 |
thumbnail_s | http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Febookcentral.proquest.com%2Fcovers%2F7129874-l.jpg |