Cross-Modal Processing For Vision And Language

According to implementations of the present disclosure, a solution for cross-modal processing is provided. In this solution, a set of visual features of a training image is extracted according to a visual feature extraction sub-model in a target model. Each visual feature is corresponding to a pixel...

Full description

Saved in:
Bibliographic Details
Main Authors LIU, Bei, FU, Jianlong
Format Patent
LanguageEnglish
Published 06.06.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:According to implementations of the present disclosure, a solution for cross-modal processing is provided. In this solution, a set of visual features of a training image is extracted according to a visual feature extraction sub-model in a target model. Each visual feature is corresponding to a pixel block in the training image. A set of visual semantic features corresponding to the set of visual features is determined based on a visual semantic dictionary. A set of text features of a training text corresponding to the training image is extracted according to a text feature extraction sub-model in the target model. Each text feature is corresponding to at least one word in the training text. The target model is trained based on the set of visual semantic features and the set of text features to determine association information between an input text and an input image.
Bibliography:Application Number: US202218278356