Cross-Modal Processing For Vision And Language

According to implementations of the present disclosure, a solution for cross-modal processing is provided. In this solution, a set of visual features of a training image is extracted according to a visual feature extraction sub-model in a target model. Each visual feature is corresponding to a pixel...

Full description

Saved in:

Bibliographic Details
Main Authors	LIU, Bei, FU, Jianlong
Format	Patent
Language	English
Published	06.06.2024
Subjects	CALCULATING COMPUTING COUNTING ELECTRIC DIGITAL DATA PROCESSING PHYSICS
Online Access	Get full text

Cover

Loading…

More Information
Summary:	According to implementations of the present disclosure, a solution for cross-modal processing is provided. In this solution, a set of visual features of a training image is extracted according to a visual feature extraction sub-model in a target model. Each visual feature is corresponding to a pixel block in the training image. A set of visual semantic features corresponding to the set of visual features is determined based on a visual semantic dictionary. A set of text features of a training text corresponding to the training image is extracted according to a text feature extraction sub-model in the target model. Each text feature is corresponding to at least one word in the training text. The target model is trained based on the set of visual semantic features and the set of text features to determine association information between an input text and an input image.
Bibliography:	Application Number: US202218278356