SLIP: Self-supervision Meets Language-Image Pre-training

Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we e...

Full description

Saved in:

Bibliographic Details
Published in	Computer Vision - ECCV 2022 Vol. 13686; pp. 529 - 544
Main Authors	Mu, Norman, Kirillov, Alexander, Wagner, David, Xie, Saining
Format	Book Chapter
Language	English
Published	Switzerland Springer 2022 Springer Nature Switzerland
Series	Lecture Notes in Computer Science
Online Access	Get full text
ISBN	9783031198083 3031198085
ISSN	0302-9743 1611-3349
DOI	10.1007/978-3-031-19809-0_30

Cover

Loading…

Abstract	Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning with Vision Transformers. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1% linear accuracy) and language supervision (+5.2% zero-shot accuracy). Our code is available at: github.com/facebookresearch/SLIP.
AbstractList	Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning with Vision Transformers. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1% linear accuracy) and language supervision (+5.2% zero-shot accuracy). Our code is available at: github.com/facebookresearch/SLIP.
Author	Xie, Saining Kirillov, Alexander Mu, Norman Wagner, David
Author_xml	– sequence: 1 givenname: Norman surname: Mu fullname: Mu, Norman email: thenorm@berkeley.edu – sequence: 2 givenname: Alexander surname: Kirillov fullname: Kirillov, Alexander – sequence: 3 givenname: David surname: Wagner fullname: Wagner, David – sequence: 4 givenname: Saining surname: Xie fullname: Xie, Saining
BookMark	eNpVUMtOwzAQNFAQbekfcMgPGNZeJ7a5IcSjUhCVCmfLTTYlUJIQp3w_bsuFy640O7O7MxM2atqGGLsUcCUA9LXVhiMHFFxYA5aDQzhiswhjBPcYHLOxyITgiMqe_JsZHLExIEhutcIzNhGYgrYqleKczUL4AACpI1fYMTPLfL64SZa0qXjYdtT_1KFum-SZaAhJ7pv11q-Jz79iTRY98aH3dVM36wt2WvlNoNlfn7K3h_vXuyeevzzO725z3kmFA0-NklUJovTKwEpXnpTwqixUhpVJFZIvyVei8EWpM-9TiSWmWnob-TrLCKdMHvaGro9nqXertv0MToDbZeWicYcuWnf7XNwuqyhSB1HXt99bCoOjnaqgJr6_Kd59N1AfnBbSGq2cMtKlJsVfpYRoQQ
ContentType	Book Chapter
Copyright	The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
Copyright_xml	– notice: The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
DBID	FFUUA
DEWEY	006.37
DOI	10.1007/978-3-031-19809-0_30
DatabaseName	ProQuest Ebook Central - Book Chapters - Demo use only
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences Computer Science
EISBN	9783031198090 3031198093
EISSN	1611-3349
Editor	Farinella, Giovanni Maria Avidan, Shai Cissé, Moustapha Brostow, Gabriel Hassner, Tal
Editor_xml	– sequence: 1 fullname: Avidan, Shai – sequence: 2 fullname: Cissé, Moustapha – sequence: 3 fullname: Farinella, Giovanni Maria – sequence: 4 fullname: Brostow, Gabriel – sequence: 5 fullname: Hassner, Tal
EndPage	544
ExternalDocumentID	EBC7129874_482_585
GroupedDBID	38. AABBV AAZWU ABSVR ABTHU ABVND ACBPT ACHZO ACPMC ADNVS AEDXK AEJLV AEKFX AHVRR ALMA_UNASSIGNED_HOLDINGS BBABE CZZ FFUUA IEZ SBO TPJZQ TSXQS Z5O Z7R Z7S Z7U Z7W Z7X Z7Y Z7Z Z81 Z82 Z83 Z84 Z85 Z87 Z88 -DT -~X 29L 2HA 2HV ACGFS ADCXD EJD F5P LAS LDH P2P RSU ~02
ID	FETCH-LOGICAL-p243t-5842fd01da480b7fae41a4dc463f8543eadeaf1cacd76aa523d3572a9a48766e3
ISBN	9783031198083 3031198085
ISSN	0302-9743
IngestDate	Tue Jul 29 20:37:06 EDT 2025 Thu May 29 00:11:29 EDT 2025
IsPeerReviewed	true
IsScholarly	true
LCCallNum	TA1634
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-p243t-5842fd01da480b7fae41a4dc463f8543eadeaf1cacd76aa523d3572a9a48766e3
Notes	Supplementary InformationThe online version contains supplementary material available at https://doi.org/10.1007/978-3-031-19809-0_30.
OCLC	1350794521
PQID	EBC7129874_482_585
PageCount	16
ParticipantIDs	springer_books_10_1007_978_3_031_19809_0_30 proquest_ebookcentralchapters_7129874_482_585
PublicationCentury	2000
PublicationDate	2022 20221101
PublicationDateYYYYMMDD	2022-01-01 2022-11-01
PublicationDate_xml	– year: 2022 text: 2022
PublicationDecade	2020
PublicationPlace	Switzerland
PublicationPlace_xml	– name: Switzerland – name: Cham
PublicationSeriesTitle	Lecture Notes in Computer Science
PublicationSeriesTitleAlternate	Lect.Notes Computer
PublicationSubtitle	17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVI
PublicationTitle	Computer Vision - ECCV 2022
PublicationYear	2022
Publisher	Springer Springer Nature Switzerland
Publisher_xml	– name: Springer – name: Springer Nature Switzerland
RelatedPersons	Hartmanis, Juris Gao, Wen Steffen, Bernhard Bertino, Elisa Goos, Gerhard Yung, Moti
RelatedPersons_xml	– sequence: 1 givenname: Gerhard surname: Goos fullname: Goos, Gerhard – sequence: 2 givenname: Juris surname: Hartmanis fullname: Hartmanis, Juris – sequence: 3 givenname: Elisa surname: Bertino fullname: Bertino, Elisa – sequence: 4 givenname: Wen surname: Gao fullname: Gao, Wen – sequence: 5 givenname: Bernhard orcidid: 0000-0001-9619-1558 surname: Steffen fullname: Steffen, Bernhard – sequence: 6 givenname: Moti orcidid: 0000-0003-0848-0873 surname: Yung fullname: Yung, Moti
SSID	ssj0002731119 ssj0002792
Score	2.5552847
Snippet	Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an...
SourceID	springer proquest
SourceType	Publisher
StartPage	529
Title	SLIP: Self-supervision Meets Language-Image Pre-training
URI	http://ebookcentral.proquest.com/lib/SITE_ID/reader.action?docID=7129874&ppg=585 http://link.springer.com/10.1007/978-3-031-19809-0_30
Volume	13686
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3Na9swFBdpdhk77Ju1XYcPuxkNy5ItpdDDCCltScsgbchNyLY8CmlSYmeF3vqf78mSEjvrpbsYI4Qlvyfe9-8Joe8p16AFU4oZYRwzoSnOOHitpcpYEueizJpE--VVenbDLmbJrNd7aqNL6uxH_vgsruR_uApjwFeDkn0BZzcfhQF4B_7CEzgMzx3jtxtmtX0F3H0M4bSBh4e-bIGGo-FwGsZRvAX8rV2K5k61ku6r2_l8-aeDctmG1387IEyr5n19MrPJjIm9VaJ92ibj818muDDR8xJX63sjgZpdXWpdV-HYRUXx-V1TI7TS2N9NYcWaabdcnYxdQuNqWTd1YuHmH50IascowL0lnRiFj1GaAmzzlcnDbf1owcwddxbUKSEDEdmrbTysC0Q2OD12SFspnZrei9T2OnWSN3GBE6vEE9tU8h_90C4JgcWwWW2AI0mjPbTHRdJHr36OLsbTTZgOrDvQBoONcjf9Fm1iyu7KwIX8rhPb0Gn7Fy2o5nNLdpyanTx8Y95cv0NvDOQlMFgUIPZ71NOLD-itc1ACR_sKhjw__NhHJAznj4NdvgcN34Mu34M23z-hm9PR9fAMu9s48H3MaI3BUo3LIiKFYiLKeKk0I4oVOUtpKRJGTeW9Kkmu8oKnSiUxLWjCYzWA-TxNNf2M-ovlQn9BASlIXOSE5SaHr0QGOqOguQLjEeRJmWX7CHvCyKZmwBUq55YMleRgpQrOJBOxBH93H4WeetJMr6Rvxg1kl1QC2WVDdmnIfvCi2Yfo9fZEf0X9erXWR2CH1tk3d1b-Aq2neos
linkProvider	Library Specific Holdings
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Computer+Vision+%E2%80%93+ECCV+2022&rft.au=Mu%2C+Norman&rft.au=Kirillov%2C+Alexander&rft.au=Wagner%2C+David&rft.au=Xie%2C+Saining&rft.atitle=SLIP%3A+Self-supervision+Meets+Language-Image+Pre-training&rft.series=Lecture+Notes+in+Computer+Science&rft.date=2022-11-01&rft.pub=Springer+Nature+Switzerland&rft.isbn=9783031198083&rft.issn=0302-9743&rft.eissn=1611-3349&rft.spage=529&rft.epage=544&rft_id=info:doi/10.1007%2F978-3-031-19809-0_30
thumbnail_s	http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Febookcentral.proquest.com%2Fcovers%2F7129874-l.jpg