SLIP: Self-supervision Meets Language-Image Pre-training

Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we e...

Full description

Saved in:
Bibliographic Details
Published inComputer Vision - ECCV 2022 Vol. 13686; pp. 529 - 544
Main Authors Mu, Norman, Kirillov, Alexander, Wagner, David, Xie, Saining
Format Book Chapter
LanguageEnglish
Published Switzerland Springer 2022
Springer Nature Switzerland
SeriesLecture Notes in Computer Science
Online AccessGet full text
ISBN9783031198083
3031198085
ISSN0302-9743
1611-3349
DOI10.1007/978-3-031-19809-0_30

Cover

Loading…
Abstract Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning with Vision Transformers. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1% linear accuracy) and language supervision (+5.2% zero-shot accuracy). Our code is available at: github.com/facebookresearch/SLIP.
AbstractList Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning with Vision Transformers. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1% linear accuracy) and language supervision (+5.2% zero-shot accuracy). Our code is available at: github.com/facebookresearch/SLIP.
Author Xie, Saining
Kirillov, Alexander
Mu, Norman
Wagner, David
Author_xml – sequence: 1
  givenname: Norman
  surname: Mu
  fullname: Mu, Norman
  email: thenorm@berkeley.edu
– sequence: 2
  givenname: Alexander
  surname: Kirillov
  fullname: Kirillov, Alexander
– sequence: 3
  givenname: David
  surname: Wagner
  fullname: Wagner, David
– sequence: 4
  givenname: Saining
  surname: Xie
  fullname: Xie, Saining
BookMark eNpVUMtOwzAQNFAQbekfcMgPGNZeJ7a5IcSjUhCVCmfLTTYlUJIQp3w_bsuFy640O7O7MxM2atqGGLsUcCUA9LXVhiMHFFxYA5aDQzhiswhjBPcYHLOxyITgiMqe_JsZHLExIEhutcIzNhGYgrYqleKczUL4AACpI1fYMTPLfL64SZa0qXjYdtT_1KFum-SZaAhJ7pv11q-Jz79iTRY98aH3dVM36wt2WvlNoNlfn7K3h_vXuyeevzzO725z3kmFA0-NklUJovTKwEpXnpTwqixUhpVJFZIvyVei8EWpM-9TiSWmWnob-TrLCKdMHvaGro9nqXertv0MToDbZeWicYcuWnf7XNwuqyhSB1HXt99bCoOjnaqgJr6_Kd59N1AfnBbSGq2cMtKlJsVfpYRoQQ
ContentType Book Chapter
Copyright The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
Copyright_xml – notice: The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
DBID FFUUA
DEWEY 006.37
DOI 10.1007/978-3-031-19809-0_30
DatabaseName ProQuest Ebook Central - Book Chapters - Demo use only
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Computer Science
EISBN 9783031198090
3031198093
EISSN 1611-3349
Editor Farinella, Giovanni Maria
Avidan, Shai
Cissé, Moustapha
Brostow, Gabriel
Hassner, Tal
Editor_xml – sequence: 1
  fullname: Avidan, Shai
– sequence: 2
  fullname: Cissé, Moustapha
– sequence: 3
  fullname: Farinella, Giovanni Maria
– sequence: 4
  fullname: Brostow, Gabriel
– sequence: 5
  fullname: Hassner, Tal
EndPage 544
ExternalDocumentID EBC7129874_482_585
GroupedDBID 38.
AABBV
AAZWU
ABSVR
ABTHU
ABVND
ACBPT
ACHZO
ACPMC
ADNVS
AEDXK
AEJLV
AEKFX
AHVRR
ALMA_UNASSIGNED_HOLDINGS
BBABE
CZZ
FFUUA
IEZ
SBO
TPJZQ
TSXQS
Z5O
Z7R
Z7S
Z7U
Z7W
Z7X
Z7Y
Z7Z
Z81
Z82
Z83
Z84
Z85
Z87
Z88
-DT
-~X
29L
2HA
2HV
ACGFS
ADCXD
EJD
F5P
LAS
LDH
P2P
RSU
~02
ID FETCH-LOGICAL-p243t-5842fd01da480b7fae41a4dc463f8543eadeaf1cacd76aa523d3572a9a48766e3
ISBN 9783031198083
3031198085
ISSN 0302-9743
IngestDate Tue Jul 29 20:37:06 EDT 2025
Thu May 29 00:11:29 EDT 2025
IsPeerReviewed true
IsScholarly true
LCCallNum TA1634
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-p243t-5842fd01da480b7fae41a4dc463f8543eadeaf1cacd76aa523d3572a9a48766e3
Notes Supplementary InformationThe online version contains supplementary material available at https://doi.org/10.1007/978-3-031-19809-0_30.
OCLC 1350794521
PQID EBC7129874_482_585
PageCount 16
ParticipantIDs springer_books_10_1007_978_3_031_19809_0_30
proquest_ebookcentralchapters_7129874_482_585
PublicationCentury 2000
PublicationDate 2022
20221101
PublicationDateYYYYMMDD 2022-01-01
2022-11-01
PublicationDate_xml – year: 2022
  text: 2022
PublicationDecade 2020
PublicationPlace Switzerland
PublicationPlace_xml – name: Switzerland
– name: Cham
PublicationSeriesTitle Lecture Notes in Computer Science
PublicationSeriesTitleAlternate Lect.Notes Computer
PublicationSubtitle 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXVI
PublicationTitle Computer Vision - ECCV 2022
PublicationYear 2022
Publisher Springer
Springer Nature Switzerland
Publisher_xml – name: Springer
– name: Springer Nature Switzerland
RelatedPersons Hartmanis, Juris
Gao, Wen
Steffen, Bernhard
Bertino, Elisa
Goos, Gerhard
Yung, Moti
RelatedPersons_xml – sequence: 1
  givenname: Gerhard
  surname: Goos
  fullname: Goos, Gerhard
– sequence: 2
  givenname: Juris
  surname: Hartmanis
  fullname: Hartmanis, Juris
– sequence: 3
  givenname: Elisa
  surname: Bertino
  fullname: Bertino, Elisa
– sequence: 4
  givenname: Wen
  surname: Gao
  fullname: Gao, Wen
– sequence: 5
  givenname: Bernhard
  orcidid: 0000-0001-9619-1558
  surname: Steffen
  fullname: Steffen, Bernhard
– sequence: 6
  givenname: Moti
  orcidid: 0000-0003-0848-0873
  surname: Yung
  fullname: Yung, Moti
SSID ssj0002731119
ssj0002792
Score 2.5552847
Snippet Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an...
SourceID springer
proquest
SourceType Publisher
StartPage 529
Title SLIP: Self-supervision Meets Language-Image Pre-training
URI http://ebookcentral.proquest.com/lib/SITE_ID/reader.action?docID=7129874&ppg=585
http://link.springer.com/10.1007/978-3-031-19809-0_30
Volume 13686
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3Na9swFBdpdhk77Ju1XYcPuxkNy5ItpdDDCCltScsgbchNyLY8CmlSYmeF3vqf78mSEjvrpbsYI4Qlvyfe9-8Joe8p16AFU4oZYRwzoSnOOHitpcpYEueizJpE--VVenbDLmbJrNd7aqNL6uxH_vgsruR_uApjwFeDkn0BZzcfhQF4B_7CEzgMzx3jtxtmtX0F3H0M4bSBh4e-bIGGo-FwGsZRvAX8rV2K5k61ku6r2_l8-aeDctmG1387IEyr5n19MrPJjIm9VaJ92ibj818muDDR8xJX63sjgZpdXWpdV-HYRUXx-V1TI7TS2N9NYcWaabdcnYxdQuNqWTd1YuHmH50IascowL0lnRiFj1GaAmzzlcnDbf1owcwddxbUKSEDEdmrbTysC0Q2OD12SFspnZrei9T2OnWSN3GBE6vEE9tU8h_90C4JgcWwWW2AI0mjPbTHRdJHr36OLsbTTZgOrDvQBoONcjf9Fm1iyu7KwIX8rhPb0Gn7Fy2o5nNLdpyanTx8Y95cv0NvDOQlMFgUIPZ71NOLD-itc1ACR_sKhjw__NhHJAznj4NdvgcN34Mu34M23z-hm9PR9fAMu9s48H3MaI3BUo3LIiKFYiLKeKk0I4oVOUtpKRJGTeW9Kkmu8oKnSiUxLWjCYzWA-TxNNf2M-ovlQn9BASlIXOSE5SaHr0QGOqOguQLjEeRJmWX7CHvCyKZmwBUq55YMleRgpQrOJBOxBH93H4WeetJMr6Rvxg1kl1QC2WVDdmnIfvCi2Yfo9fZEf0X9erXWR2CH1tk3d1b-Aq2neos
linkProvider Library Specific Holdings
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Computer+Vision+%E2%80%93+ECCV+2022&rft.au=Mu%2C+Norman&rft.au=Kirillov%2C+Alexander&rft.au=Wagner%2C+David&rft.au=Xie%2C+Saining&rft.atitle=SLIP%3A+Self-supervision+Meets+Language-Image+Pre-training&rft.series=Lecture+Notes+in+Computer+Science&rft.date=2022-11-01&rft.pub=Springer+Nature+Switzerland&rft.isbn=9783031198083&rft.issn=0302-9743&rft.eissn=1611-3349&rft.spage=529&rft.epage=544&rft_id=info:doi/10.1007%2F978-3-031-19809-0_30
thumbnail_s http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Febookcentral.proquest.com%2Fcovers%2F7129874-l.jpg