MotionCLIP: Exposing Human Motion Generation to CLIP Space

We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) m...

Full description

Saved in:
Bibliographic Details
Published inComputer Vision - ECCV 2022 Vol. 13682; pp. 358 - 374
Main Authors Tevet, Guy, Gordon, Brian, Hertz, Amir, Bermano, Amit H., Cohen-Or, Daniel
Format Book Chapter
LanguageEnglish
Published Switzerland Springer 01.01.2022
Springer Nature Switzerland
SeriesLecture Notes in Computer Science
Online AccessGet full text

Cover

Loading…
Abstract We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) model. Aligning the human motion manifold to CLIP space implicitly infuses the extremely rich semantic knowledge of CLIP into the manifold. In particular, it helps continuity by placing semantically similar motions close to one another, and disentanglement, which is inherited from the CLIP-space structure. MotionCLIP comprises a transformer-based motion auto-encoder, trained to reconstruct motion while being aligned to its text label’s position in CLIP-space. We further leverage CLIP’s unique visual understanding and inject an even stronger signal through aligning motion to rendered frames in a self-supervised manner. We show that although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abilities, allowing out-of-domain actions, disentangled editing, and abstract language specification. For example, the text prompt “couch” is decoded into a sitting down motion, due to lingual similarity, and the prompt “Spiderman” results in a web-swinging-like solution that is far from seen during training. In addition, we show how the introduced latent space can be leveraged for motion interpolation, editing and recognition (See our project page: https://guytevet.github.io/motionclip-page/.
AbstractList We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) model. Aligning the human motion manifold to CLIP space implicitly infuses the extremely rich semantic knowledge of CLIP into the manifold. In particular, it helps continuity by placing semantically similar motions close to one another, and disentanglement, which is inherited from the CLIP-space structure. MotionCLIP comprises a transformer-based motion auto-encoder, trained to reconstruct motion while being aligned to its text label’s position in CLIP-space. We further leverage CLIP’s unique visual understanding and inject an even stronger signal through aligning motion to rendered frames in a self-supervised manner. We show that although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abilities, allowing out-of-domain actions, disentangled editing, and abstract language specification. For example, the text prompt “couch” is decoded into a sitting down motion, due to lingual similarity, and the prompt “Spiderman” results in a web-swinging-like solution that is far from seen during training. In addition, we show how the introduced latent space can be leveraged for motion interpolation, editing and recognition (See our project page: https://guytevet.github.io/motionclip-page/.
Author Cohen-Or, Daniel
Hertz, Amir
Tevet, Guy
Bermano, Amit H.
Gordon, Brian
Author_xml – sequence: 1
  givenname: Guy
  surname: Tevet
  fullname: Tevet, Guy
  email: guytevet@mail.tau.ac.il
– sequence: 2
  givenname: Brian
  surname: Gordon
  fullname: Gordon, Brian
– sequence: 3
  givenname: Amir
  surname: Hertz
  fullname: Hertz, Amir
– sequence: 4
  givenname: Amit H.
  surname: Bermano
  fullname: Bermano, Amit H.
– sequence: 5
  givenname: Daniel
  surname: Cohen-Or
  fullname: Cohen-Or, Daniel
BookMark eNpVkMFOwzAMhgMMxDb2Bhz6AgE7TptkNzSNDWkIJOAcpW0KG6MpTSfx-HQbF062_t-_ZX8jNqhD7Rm7RrhBAHVrlObEgZALAKm4sAJP2KSXqRcPmjhlQ8wQOZE0Z_-8LB2wIRAIbpSkCzZCSlEYpaS-ZJMYNwAgFCGiHLLpY-jWoZ6tHp6nyfynCXFdvyfL3Zerk6OVLHztW3dou5DsJ5OXxhX-ip1Xbhv95K-O2dv9_HW25KunxcPsbsU3BKbjJCtMS52VmOfSG_Q5plqUUvsqo7JSRlRpkSnSWlOBzlPuXIaVdgZS4ypHYyaOe2PT9sf51uYhfEaLYPe0bP-6Jds_bw9k7J5WH5LHUNOG752PnfX7VOHrrnXb4sM1nW-jVT0xpdCSBCtR0i--w2fZ
ContentType Book Chapter
Copyright The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
Copyright_xml – notice: The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
DBID FFUUA
DEWEY 006.4
DOI 10.1007/978-3-031-20047-2_21
DatabaseName ProQuest Ebook Central - Book Chapters - Demo use only
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Computer Science
EISBN 9783031200472
3031200470
EISSN 1611-3349
Editor Farinella, Giovanni Maria
Avidan, Shai
Cissé, Moustapha
Brostow, Gabriel
Hassner, Tal
Editor_xml – sequence: 1
  fullname: Avidan, Shai
– sequence: 2
  fullname: Cissé, Moustapha
– sequence: 3
  fullname: Farinella, Giovanni Maria
– sequence: 4
  fullname: Brostow, Gabriel
– sequence: 5
  fullname: Hassner, Tal
EndPage 374
ExternalDocumentID EBC7120771_340_414
GroupedDBID 38.
AABBV
AAZWU
ABSVR
ABTHU
ABVND
ACBPT
ACHZO
ACPMC
ADNVS
AEDXK
AEJLV
AEKFX
AHVRR
ALMA_UNASSIGNED_HOLDINGS
BBABE
CZZ
FFUUA
IEZ
SBO
TPJZQ
TSXQS
Z5O
Z7R
Z7S
Z7U
Z7W
Z7X
Z7Y
Z7Z
Z81
Z82
Z83
Z84
Z85
Z87
Z88
-DT
-~X
29L
2HA
2HV
ACGFS
ADCXD
EJD
F5P
LAS
LDH
P2P
RSU
~02
ID FETCH-LOGICAL-j309t-34f15d86d1bb4e91eb1582d48ef63df792f5c6738883c1ae3baa61f8a9059afa3
ISBN 9783031200465
3031200462
ISSN 0302-9743
IngestDate Tue Jul 29 20:34:50 EDT 2025
Tue Jul 22 07:44:20 EDT 2025
IsPeerReviewed true
IsScholarly true
LCCallNum TA1634
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-j309t-34f15d86d1bb4e91eb1582d48ef63df792f5c6738883c1ae3baa61f8a9059afa3
Notes G. Tevet and B. Gordon—The authors contributed equally.
Supplementary InformationThe online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20047-2_21.
OCLC 1351297748
PQID EBC7120771_340_414
PageCount 17
ParticipantIDs springer_books_10_1007_978_3_031_20047_2_21
proquest_ebookcentralchapters_7120771_340_414
PublicationCentury 2000
PublicationDate 2022-01-01
PublicationDateYYYYMMDD 2022-01-01
PublicationDate_xml – month: 01
  year: 2022
  text: 2022-01-01
  day: 01
PublicationDecade 2020
PublicationPlace Switzerland
PublicationPlace_xml – name: Switzerland
– name: Cham
PublicationSeriesTitle Lecture Notes in Computer Science
PublicationSeriesTitleAlternate Lect.Notes Computer
PublicationSubtitle 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXII
PublicationTitle Computer Vision - ECCV 2022
PublicationYear 2022
Publisher Springer
Springer Nature Switzerland
Publisher_xml – name: Springer
– name: Springer Nature Switzerland
RelatedPersons Hartmanis, Juris
Gao, Wen
Steffen, Bernhard
Bertino, Elisa
Goos, Gerhard
Yung, Moti
RelatedPersons_xml – sequence: 1
  givenname: Gerhard
  surname: Goos
  fullname: Goos, Gerhard
– sequence: 2
  givenname: Juris
  surname: Hartmanis
  fullname: Hartmanis, Juris
– sequence: 3
  givenname: Elisa
  surname: Bertino
  fullname: Bertino, Elisa
– sequence: 4
  givenname: Wen
  surname: Gao
  fullname: Gao, Wen
– sequence: 5
  givenname: Bernhard
  orcidid: 0000-0001-9619-1558
  surname: Steffen
  fullname: Steffen, Bernhard
– sequence: 6
  givenname: Moti
  orcidid: 0000-0003-0848-0873
  surname: Yung
  fullname: Yung, Moti
SSID ssj0002731114
ssj0002792
Score 2.618062
Snippet We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual...
SourceID springer
proquest
SourceType Publisher
StartPage 358
Title MotionCLIP: Exposing Human Motion Generation to CLIP Space
URI http://ebookcentral.proquest.com/lib/SITE_ID/reader.action?docID=7120771&ppg=414&c=UERG
http://link.springer.com/10.1007/978-3-031-20047-2_21
Volume 13682
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1NT9wwELVguSAOUD4EhSIfuEVG8Udih1sbBSHU9gKsuFl2nCBxWCQ2SFV_fcdOstmkXOASRY43a82z7JmJ3xuELpRUsWOOkrJMLBGZlcQIaogwsFkb66hzQe3zd3rzIG4fk8dB6zmwSxp7Wf59l1fyGVShDXD1LNkPILt6KTTAPeALV0AYrhPnd5xmbXUFunoM0TzQwyMSFXk-j1jM2Po8-BXK9OQQwPvgv_jjT2ktnrrkffuw054Ot-CK-r7RHcTS1XpKgLFJSqBPCY5CRdiqaIiGk9Hax9O29s9_K-n64Qn4abC7JEy3fOaxcLVoeaAT4eriRy7hL6WkmotYh3Ljm1IlM7T1vbj9OV-lwsCDghXXF-BaDZK12kjDoNdYj--NaRQfTD5pB0_hfg_tePYI9rQOGOUXtFEt9tFu5-vjbiVdQlMPX992gK4GqK5wDxQOQOH2ER6Aws0L9j1xAOoQPVwX9_kN6UpbkGceZw3hoqaJU6mj1ooqo7BjJoo5oao65a6WGauT0hdkVYqX1FTcGpPSWpkM3GFTG36EZouXRXWMsFe0S4Wsnagh2jOxFYmLpeMuFalXWDtBpDeNDh_gu1O_ZWuIpZ6AdIKi3n7ad1_qXtkaDK-5BsPrYHjtDf_1g28_RdvDjD1Ds-b1rfoGbl1jz7tp8Q_lekHQ
linkProvider Library Specific Holdings
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Computer+Vision+-+ECCV+2022&rft.atitle=MotionCLIP%3A+Exposing+Human+Motion+Generation+to+CLIP+Space&rft.date=2022-01-01&rft.pub=Springer&rft.isbn=9783031200465&rft.volume=13682&rft_id=info:doi/10.1007%2F978-3-031-20047-2_21&rft.externalDBID=414&rft.externalDocID=EBC7120771_340_414
thumbnail_s http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Febookcentral.proquest.com%2Fcovers%2F7120771-l.jpg