MotionCLIP: Exposing Human Motion Generation to CLIP Space

We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) m...

Full description

Saved in:

Bibliographic Details
Published in	Computer Vision - ECCV 2022 Vol. 13682; pp. 358 - 374
Main Authors	Tevet, Guy, Gordon, Brian, Hertz, Amir, Bermano, Amit H., Cohen-Or, Daniel
Format	Book Chapter
Language	English
Published	Switzerland Springer 01.01.2022 Springer Nature Switzerland
Series	Lecture Notes in Computer Science
Online Access	Get full text

Cover

Loading…

Abstract	We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) model. Aligning the human motion manifold to CLIP space implicitly infuses the extremely rich semantic knowledge of CLIP into the manifold. In particular, it helps continuity by placing semantically similar motions close to one another, and disentanglement, which is inherited from the CLIP-space structure. MotionCLIP comprises a transformer-based motion auto-encoder, trained to reconstruct motion while being aligned to its text label’s position in CLIP-space. We further leverage CLIP’s unique visual understanding and inject an even stronger signal through aligning motion to rendered frames in a self-supervised manner. We show that although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abilities, allowing out-of-domain actions, disentangled editing, and abstract language specification. For example, the text prompt “couch” is decoded into a sitting down motion, due to lingual similarity, and the prompt “Spiderman” results in a web-swinging-like solution that is far from seen during training. In addition, we show how the introduced latent space can be leveraged for motion interpolation, editing and recognition (See our project page: https://guytevet.github.io/motionclip-page/.
AbstractList	We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) model. Aligning the human motion manifold to CLIP space implicitly infuses the extremely rich semantic knowledge of CLIP into the manifold. In particular, it helps continuity by placing semantically similar motions close to one another, and disentanglement, which is inherited from the CLIP-space structure. MotionCLIP comprises a transformer-based motion auto-encoder, trained to reconstruct motion while being aligned to its text label’s position in CLIP-space. We further leverage CLIP’s unique visual understanding and inject an even stronger signal through aligning motion to rendered frames in a self-supervised manner. We show that although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abilities, allowing out-of-domain actions, disentangled editing, and abstract language specification. For example, the text prompt “couch” is decoded into a sitting down motion, due to lingual similarity, and the prompt “Spiderman” results in a web-swinging-like solution that is far from seen during training. In addition, we show how the introduced latent space can be leveraged for motion interpolation, editing and recognition (See our project page: https://guytevet.github.io/motionclip-page/.
Author	Cohen-Or, Daniel Hertz, Amir Tevet, Guy Bermano, Amit H. Gordon, Brian
Author_xml	– sequence: 1 givenname: Guy surname: Tevet fullname: Tevet, Guy email: guytevet@mail.tau.ac.il – sequence: 2 givenname: Brian surname: Gordon fullname: Gordon, Brian – sequence: 3 givenname: Amir surname: Hertz fullname: Hertz, Amir – sequence: 4 givenname: Amit H. surname: Bermano fullname: Bermano, Amit H. – sequence: 5 givenname: Daniel surname: Cohen-Or fullname: Cohen-Or, Daniel
BookMark	eNpVkMFOwzAMhgMMxDb2Bhz6AgE7TptkNzSNDWkIJOAcpW0KG6MpTSfx-HQbF062_t-_ZX8jNqhD7Rm7RrhBAHVrlObEgZALAKm4sAJP2KSXqRcPmjhlQ8wQOZE0Z_-8LB2wIRAIbpSkCzZCSlEYpaS-ZJMYNwAgFCGiHLLpY-jWoZ6tHp6nyfynCXFdvyfL3Zerk6OVLHztW3dou5DsJ5OXxhX-ip1Xbhv95K-O2dv9_HW25KunxcPsbsU3BKbjJCtMS52VmOfSG_Q5plqUUvsqo7JSRlRpkSnSWlOBzlPuXIaVdgZS4ypHYyaOe2PT9sf51uYhfEaLYPe0bP-6Jds_bw9k7J5WH5LHUNOG752PnfX7VOHrrnXb4sM1nW-jVT0xpdCSBCtR0i--w2fZ
ContentType	Book Chapter
Copyright	The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
Copyright_xml	– notice: The Author(s), under exclusive license to Springer Nature Switzerland AG 2022
DBID	FFUUA
DEWEY	006.4
DOI	10.1007/978-3-031-20047-2_21
DatabaseName	ProQuest Ebook Central - Book Chapters - Demo use only
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences Computer Science
EISBN	9783031200472 3031200470
EISSN	1611-3349
Editor	Farinella, Giovanni Maria Avidan, Shai Cissé, Moustapha Brostow, Gabriel Hassner, Tal
Editor_xml	– sequence: 1 fullname: Avidan, Shai – sequence: 2 fullname: Cissé, Moustapha – sequence: 3 fullname: Farinella, Giovanni Maria – sequence: 4 fullname: Brostow, Gabriel – sequence: 5 fullname: Hassner, Tal
EndPage	374
ExternalDocumentID	EBC7120771_340_414
GroupedDBID	38. AABBV AAZWU ABSVR ABTHU ABVND ACBPT ACHZO ACPMC ADNVS AEDXK AEJLV AEKFX AHVRR ALMA_UNASSIGNED_HOLDINGS BBABE CZZ FFUUA IEZ SBO TPJZQ TSXQS Z5O Z7R Z7S Z7U Z7W Z7X Z7Y Z7Z Z81 Z82 Z83 Z84 Z85 Z87 Z88 -DT -~X 29L 2HA 2HV ACGFS ADCXD EJD F5P LAS LDH P2P RSU ~02
ID	FETCH-LOGICAL-j309t-34f15d86d1bb4e91eb1582d48ef63df792f5c6738883c1ae3baa61f8a9059afa3
ISBN	9783031200465 3031200462
ISSN	0302-9743
IngestDate	Tue Jul 29 20:34:50 EDT 2025 Tue Jul 22 07:44:20 EDT 2025
IsPeerReviewed	true
IsScholarly	true
LCCallNum	TA1634
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-j309t-34f15d86d1bb4e91eb1582d48ef63df792f5c6738883c1ae3baa61f8a9059afa3
Notes	G. Tevet and B. Gordon—The authors contributed equally. Supplementary InformationThe online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20047-2_21.
OCLC	1351297748
PQID	EBC7120771_340_414
PageCount	17
ParticipantIDs	springer_books_10_1007_978_3_031_20047_2_21 proquest_ebookcentralchapters_7120771_340_414
PublicationCentury	2000
PublicationDate	2022-01-01
PublicationDateYYYYMMDD	2022-01-01
PublicationDate_xml	– month: 01 year: 2022 text: 2022-01-01 day: 01
PublicationDecade	2020
PublicationPlace	Switzerland
PublicationPlace_xml	– name: Switzerland – name: Cham
PublicationSeriesTitle	Lecture Notes in Computer Science
PublicationSeriesTitleAlternate	Lect.Notes Computer
PublicationSubtitle	17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXII
PublicationTitle	Computer Vision - ECCV 2022
PublicationYear	2022
Publisher	Springer Springer Nature Switzerland
Publisher_xml	– name: Springer – name: Springer Nature Switzerland
RelatedPersons	Hartmanis, Juris Gao, Wen Steffen, Bernhard Bertino, Elisa Goos, Gerhard Yung, Moti
RelatedPersons_xml	– sequence: 1 givenname: Gerhard surname: Goos fullname: Goos, Gerhard – sequence: 2 givenname: Juris surname: Hartmanis fullname: Hartmanis, Juris – sequence: 3 givenname: Elisa surname: Bertino fullname: Bertino, Elisa – sequence: 4 givenname: Wen surname: Gao fullname: Gao, Wen – sequence: 5 givenname: Bernhard orcidid: 0000-0001-9619-1558 surname: Steffen fullname: Steffen, Bernhard – sequence: 6 givenname: Moti orcidid: 0000-0003-0848-0873 surname: Yung fullname: Yung, Moti
SSID	ssj0002731114 ssj0002792
Score	2.618062
Snippet	We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual...
SourceID	springer proquest
SourceType	Publisher
StartPage	358
Title	MotionCLIP: Exposing Human Motion Generation to CLIP Space
URI	http://ebookcentral.proquest.com/lib/SITE_ID/reader.action?docID=7120771&ppg=414&c=UERG http://link.springer.com/10.1007/978-3-031-20047-2_21
Volume	13682
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1NT9wwELVguSAOUD4EhSIfuEVG8Udih1sbBSHU9gKsuFl2nCBxWCQ2SFV_fcdOstmkXOASRY43a82z7JmJ3xuELpRUsWOOkrJMLBGZlcQIaogwsFkb66hzQe3zd3rzIG4fk8dB6zmwSxp7Wf59l1fyGVShDXD1LNkPILt6KTTAPeALV0AYrhPnd5xmbXUFunoM0TzQwyMSFXk-j1jM2Po8-BXK9OQQwPvgv_jjT2ktnrrkffuw054Ot-CK-r7RHcTS1XpKgLFJSqBPCY5CRdiqaIiGk9Hax9O29s9_K-n64Qn4abC7JEy3fOaxcLVoeaAT4eriRy7hL6WkmotYh3Ljm1IlM7T1vbj9OV-lwsCDghXXF-BaDZK12kjDoNdYj--NaRQfTD5pB0_hfg_tePYI9rQOGOUXtFEt9tFu5-vjbiVdQlMPX992gK4GqK5wDxQOQOH2ER6Aws0L9j1xAOoQPVwX9_kN6UpbkGceZw3hoqaJU6mj1ooqo7BjJoo5oao65a6WGauT0hdkVYqX1FTcGpPSWpkM3GFTG36EZouXRXWMsFe0S4Wsnagh2jOxFYmLpeMuFalXWDtBpDeNDh_gu1O_ZWuIpZ6AdIKi3n7ad1_qXtkaDK-5BsPrYHjtDf_1g28_RdvDjD1Ds-b1rfoGbl1jz7tp8Q_lekHQ
linkProvider	Library Specific Holdings
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Computer+Vision+-+ECCV+2022&rft.atitle=MotionCLIP%3A+Exposing+Human+Motion+Generation+to+CLIP+Space&rft.date=2022-01-01&rft.pub=Springer&rft.isbn=9783031200465&rft.volume=13682&rft_id=info:doi/10.1007%2F978-3-031-20047-2_21&rft.externalDBID=414&rft.externalDocID=EBC7120771_340_414
thumbnail_s	http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Febookcentral.proquest.com%2Fcovers%2F7120771-l.jpg