MotionCLIP: Exposing Human Motion Generation to CLIP Space
We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) m...
Saved in:
Published in | Computer Vision - ECCV 2022 Vol. 13682; pp. 358 - 374 |
---|---|
Main Authors | , , , , |
Format | Book Chapter |
Language | English |
Published |
Switzerland
Springer
01.01.2022
Springer Nature Switzerland |
Series | Lecture Notes in Computer Science |
Online Access | Get full text |
Cover
Loading…
Abstract | We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) model. Aligning the human motion manifold to CLIP space implicitly infuses the extremely rich semantic knowledge of CLIP into the manifold. In particular, it helps continuity by placing semantically similar motions close to one another, and disentanglement, which is inherited from the CLIP-space structure. MotionCLIP comprises a transformer-based motion auto-encoder, trained to reconstruct motion while being aligned to its text label’s position in CLIP-space. We further leverage CLIP’s unique visual understanding and inject an even stronger signal through aligning motion to rendered frames in a self-supervised manner. We show that although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abilities, allowing out-of-domain actions, disentangled editing, and abstract language specification. For example, the text prompt “couch” is decoded into a sitting down motion, due to lingual similarity, and the prompt “Spiderman” results in a web-swinging-like solution that is far from seen during training. In addition, we show how the introduced latent space can be leveraged for motion interpolation, editing and recognition (See our project page: https://guytevet.github.io/motionclip-page/. |
---|---|
AbstractList | We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual descriptions. MotionCLIP gains its unique power by aligning its latent space with that of the Contrastive Language-Image Pre-training (CLIP) model. Aligning the human motion manifold to CLIP space implicitly infuses the extremely rich semantic knowledge of CLIP into the manifold. In particular, it helps continuity by placing semantically similar motions close to one another, and disentanglement, which is inherited from the CLIP-space structure. MotionCLIP comprises a transformer-based motion auto-encoder, trained to reconstruct motion while being aligned to its text label’s position in CLIP-space. We further leverage CLIP’s unique visual understanding and inject an even stronger signal through aligning motion to rendered frames in a self-supervised manner. We show that although CLIP has never seen the motion domain, MotionCLIP offers unprecedented text-to-motion abilities, allowing out-of-domain actions, disentangled editing, and abstract language specification. For example, the text prompt “couch” is decoded into a sitting down motion, due to lingual similarity, and the prompt “Spiderman” results in a web-swinging-like solution that is far from seen during training. In addition, we show how the introduced latent space can be leveraged for motion interpolation, editing and recognition (See our project page: https://guytevet.github.io/motionclip-page/. |
Author | Cohen-Or, Daniel Hertz, Amir Tevet, Guy Bermano, Amit H. Gordon, Brian |
Author_xml | – sequence: 1 givenname: Guy surname: Tevet fullname: Tevet, Guy email: guytevet@mail.tau.ac.il – sequence: 2 givenname: Brian surname: Gordon fullname: Gordon, Brian – sequence: 3 givenname: Amir surname: Hertz fullname: Hertz, Amir – sequence: 4 givenname: Amit H. surname: Bermano fullname: Bermano, Amit H. – sequence: 5 givenname: Daniel surname: Cohen-Or fullname: Cohen-Or, Daniel |
BookMark | eNpVkMFOwzAMhgMMxDb2Bhz6AgE7TptkNzSNDWkIJOAcpW0KG6MpTSfx-HQbF062_t-_ZX8jNqhD7Rm7RrhBAHVrlObEgZALAKm4sAJP2KSXqRcPmjhlQ8wQOZE0Z_-8LB2wIRAIbpSkCzZCSlEYpaS-ZJMYNwAgFCGiHLLpY-jWoZ6tHp6nyfynCXFdvyfL3Zerk6OVLHztW3dou5DsJ5OXxhX-ip1Xbhv95K-O2dv9_HW25KunxcPsbsU3BKbjJCtMS52VmOfSG_Q5plqUUvsqo7JSRlRpkSnSWlOBzlPuXIaVdgZS4ypHYyaOe2PT9sf51uYhfEaLYPe0bP-6Jds_bw9k7J5WH5LHUNOG752PnfX7VOHrrnXb4sM1nW-jVT0xpdCSBCtR0i--w2fZ |
ContentType | Book Chapter |
Copyright | The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 |
Copyright_xml | – notice: The Author(s), under exclusive license to Springer Nature Switzerland AG 2022 |
DBID | FFUUA |
DEWEY | 006.4 |
DOI | 10.1007/978-3-031-20047-2_21 |
DatabaseName | ProQuest Ebook Central - Book Chapters - Demo use only |
DatabaseTitleList | |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Applied Sciences Computer Science |
EISBN | 9783031200472 3031200470 |
EISSN | 1611-3349 |
Editor | Farinella, Giovanni Maria Avidan, Shai Cissé, Moustapha Brostow, Gabriel Hassner, Tal |
Editor_xml | – sequence: 1 fullname: Avidan, Shai – sequence: 2 fullname: Cissé, Moustapha – sequence: 3 fullname: Farinella, Giovanni Maria – sequence: 4 fullname: Brostow, Gabriel – sequence: 5 fullname: Hassner, Tal |
EndPage | 374 |
ExternalDocumentID | EBC7120771_340_414 |
GroupedDBID | 38. AABBV AAZWU ABSVR ABTHU ABVND ACBPT ACHZO ACPMC ADNVS AEDXK AEJLV AEKFX AHVRR ALMA_UNASSIGNED_HOLDINGS BBABE CZZ FFUUA IEZ SBO TPJZQ TSXQS Z5O Z7R Z7S Z7U Z7W Z7X Z7Y Z7Z Z81 Z82 Z83 Z84 Z85 Z87 Z88 -DT -~X 29L 2HA 2HV ACGFS ADCXD EJD F5P LAS LDH P2P RSU ~02 |
ID | FETCH-LOGICAL-j309t-34f15d86d1bb4e91eb1582d48ef63df792f5c6738883c1ae3baa61f8a9059afa3 |
ISBN | 9783031200465 3031200462 |
ISSN | 0302-9743 |
IngestDate | Tue Jul 29 20:34:50 EDT 2025 Tue Jul 22 07:44:20 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
LCCallNum | TA1634 |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-j309t-34f15d86d1bb4e91eb1582d48ef63df792f5c6738883c1ae3baa61f8a9059afa3 |
Notes | G. Tevet and B. Gordon—The authors contributed equally. Supplementary InformationThe online version contains supplementary material available at https://doi.org/10.1007/978-3-031-20047-2_21. |
OCLC | 1351297748 |
PQID | EBC7120771_340_414 |
PageCount | 17 |
ParticipantIDs | springer_books_10_1007_978_3_031_20047_2_21 proquest_ebookcentralchapters_7120771_340_414 |
PublicationCentury | 2000 |
PublicationDate | 2022-01-01 |
PublicationDateYYYYMMDD | 2022-01-01 |
PublicationDate_xml | – month: 01 year: 2022 text: 2022-01-01 day: 01 |
PublicationDecade | 2020 |
PublicationPlace | Switzerland |
PublicationPlace_xml | – name: Switzerland – name: Cham |
PublicationSeriesTitle | Lecture Notes in Computer Science |
PublicationSeriesTitleAlternate | Lect.Notes Computer |
PublicationSubtitle | 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXII |
PublicationTitle | Computer Vision - ECCV 2022 |
PublicationYear | 2022 |
Publisher | Springer Springer Nature Switzerland |
Publisher_xml | – name: Springer – name: Springer Nature Switzerland |
RelatedPersons | Hartmanis, Juris Gao, Wen Steffen, Bernhard Bertino, Elisa Goos, Gerhard Yung, Moti |
RelatedPersons_xml | – sequence: 1 givenname: Gerhard surname: Goos fullname: Goos, Gerhard – sequence: 2 givenname: Juris surname: Hartmanis fullname: Hartmanis, Juris – sequence: 3 givenname: Elisa surname: Bertino fullname: Bertino, Elisa – sequence: 4 givenname: Wen surname: Gao fullname: Gao, Wen – sequence: 5 givenname: Bernhard orcidid: 0000-0001-9619-1558 surname: Steffen fullname: Steffen, Bernhard – sequence: 6 givenname: Moti orcidid: 0000-0003-0848-0873 surname: Yung fullname: Yung, Moti |
SSID | ssj0002731114 ssj0002792 |
Score | 2.618062 |
Snippet | We introduce MotionCLIP, a 3D human motion auto-encoder featuring a latent embedding that is disentangled, well behaved, and supports highly semantic textual... |
SourceID | springer proquest |
SourceType | Publisher |
StartPage | 358 |
Title | MotionCLIP: Exposing Human Motion Generation to CLIP Space |
URI | http://ebookcentral.proquest.com/lib/SITE_ID/reader.action?docID=7120771&ppg=414&c=UERG http://link.springer.com/10.1007/978-3-031-20047-2_21 |
Volume | 13682 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1NT9wwELVguSAOUD4EhSIfuEVG8Udih1sbBSHU9gKsuFl2nCBxWCQ2SFV_fcdOstmkXOASRY43a82z7JmJ3xuELpRUsWOOkrJMLBGZlcQIaogwsFkb66hzQe3zd3rzIG4fk8dB6zmwSxp7Wf59l1fyGVShDXD1LNkPILt6KTTAPeALV0AYrhPnd5xmbXUFunoM0TzQwyMSFXk-j1jM2Po8-BXK9OQQwPvgv_jjT2ktnrrkffuw054Ot-CK-r7RHcTS1XpKgLFJSqBPCY5CRdiqaIiGk9Hax9O29s9_K-n64Qn4abC7JEy3fOaxcLVoeaAT4eriRy7hL6WkmotYh3Ljm1IlM7T1vbj9OV-lwsCDghXXF-BaDZK12kjDoNdYj--NaRQfTD5pB0_hfg_tePYI9rQOGOUXtFEt9tFu5-vjbiVdQlMPX992gK4GqK5wDxQOQOH2ER6Aws0L9j1xAOoQPVwX9_kN6UpbkGceZw3hoqaJU6mj1ooqo7BjJoo5oao65a6WGauT0hdkVYqX1FTcGpPSWpkM3GFTG36EZouXRXWMsFe0S4Wsnagh2jOxFYmLpeMuFalXWDtBpDeNDh_gu1O_ZWuIpZ6AdIKi3n7ad1_qXtkaDK-5BsPrYHjtDf_1g28_RdvDjD1Ds-b1rfoGbl1jz7tp8Q_lekHQ |
linkProvider | Library Specific Holdings |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=bookitem&rft.title=Computer+Vision+-+ECCV+2022&rft.atitle=MotionCLIP%3A+Exposing+Human+Motion+Generation+to+CLIP+Space&rft.date=2022-01-01&rft.pub=Springer&rft.isbn=9783031200465&rft.volume=13682&rft_id=info:doi/10.1007%2F978-3-031-20047-2_21&rft.externalDBID=414&rft.externalDocID=EBC7120771_340_414 |
thumbnail_s | http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=https%3A%2F%2Febookcentral.proquest.com%2Fcovers%2F7120771-l.jpg |