SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models
Yue, Tongtian, Cheng, Jie, Guo, Longteng, Dai, Xingyuan, Zhao, Zijia, He, Xingjian, Xiong, Gang, Lv, Yisheng, Liu, Jing
Year of Publication 19.03.2024
Year of Publication 19.03.2024
Get full text
Journal Article
Exploring the Design Space of Visual Context Representation in Video MLLMs
Du, Yifan, Huo, Yuqi, Zhou, Kun, Zhao, Zijia, Lu, Haoyu, Huang, Han, Zhao, Wayne Xin, Wang, Bingning, Chen, Weipeng, Wen, Ji-Rong
Year of Publication 17.10.2024
Year of Publication 17.10.2024
Get full text
Journal Article
Towards Event-oriented Long Video Understanding
Du, Yifan, Zhou, Kun, Huo, Yuqi, Li, Yifan, Zhao, Wayne Xin, Lu, Haoyu, Zhao, Zijia, Wang, Bingning, Chen, Weipeng, Wen, Ji-Rong
Year of Publication 20.06.2024
Year of Publication 20.06.2024
Get full text
Journal Article
OneDiff: A Generalist Model for Image Difference Captioning
Hu, Erdong, Guo, Longteng, Tongtian Yue, Zhao, Zijia, Xue, Shuning, Liu, Jing
Published in arXiv.org (16.07.2024)
Get full text
Published in arXiv.org (16.07.2024)
Paper
VL-Mamba: Exploring State Space Models for Multimodal Learning
Qiao, Yanyuan, Zheng, Yu, Guo, Longteng, Chen, Sihan, Zhao, Zijia, Sun, Mingzhen, Wu, Qi, Liu, Jing
Published in arXiv.org (20.03.2024)
Get full text
Published in arXiv.org (20.03.2024)
Paper
Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs
Zhao, Zijia, Lu, Haoyu, Huo, Yuqi, Du, Yifan, Tongtian Yue, Guo, Longteng, Wang, Bingning, Chen, Weipeng, Liu, Jing
Published in arXiv.org (24.10.2024)
Get full text
Published in arXiv.org (24.10.2024)
Paper
Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining
Huang, Han, Huo, Yuqi, Zhao, Zijia, Lu, Haoyu, Wu, Shu, Wang, Bingning, Liu, Qiang, Chen, Weipeng, Wang, Liang
Published in arXiv.org (21.10.2024)
Get full text
Published in arXiv.org (21.10.2024)
Paper
OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation
Liu, Jing, Zhu, Xinxin, Liu, Fei, Guo, Longteng, Zhao, Zijia, Sun, Mingzhen, Wang, Weining, Lu, Hanqing, Zhou, Shiyu, Zhang, Jiajun, Wang, Jinqiao
Year of Publication 01.07.2021
Year of Publication 01.07.2021
Get full text
Journal Article
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Chen, Sihan, Li, Handong, Wang, Qunbo, Zhao, Zijia, Sun, Mingzhen, Zhu, Xinxin, Liu, Jing
Published in arXiv.org (29.05.2023)
Get full text
Published in arXiv.org (29.05.2023)
Paper
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Zhao, Zijia, Guo, Longteng, Tongtian Yue, Chen, Sihan, Shao, Shuai, Zhu, Xinxin, Yuan, Zehuan, Liu, Jing
Published in arXiv.org (25.05.2023)
Get full text
Published in arXiv.org (25.05.2023)
Paper
Exploring the Design Space of Visual Context Representation in Video MLLMs
Du, Yifan, Huo, Yuqi, Zhou, Kun, Zhao, Zijia, Lu, Haoyu, Huang, Han, Wayne Xin Zhao, Wang, Bingning, Chen, Weipeng, Ji-Rong, Wen
Published in arXiv.org (17.10.2024)
Get full text
Published in arXiv.org (17.10.2024)
Paper
Towards Event-oriented Long Video Understanding
Du, Yifan, Zhou, Kun, Huo, Yuqi, Li, Yifan, Wayne Xin Zhao, Lu, Haoyu, Zhao, Zijia, Wang, Bingning, Chen, Weipeng, Ji-Rong, Wen
Published in arXiv.org (20.06.2024)
Get full text
Published in arXiv.org (20.06.2024)
Paper