Multimodal Machine Learning for Natural Language Processing: Disambiguating Prepositional Phrase Attachments with Images
Although documents are increasingly multimodal, their automatic processing is often monomodal. In particular, natural language processing tasks are typically performed based on the textual modality only. This work extends the syntactic parsing task to the image modality in addition to text. In parti...
Saved in:
Published in | Neural processing letters Vol. 53; no. 5; pp. 3095 - 3121 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
New York
Springer US
01.10.2021
Springer Nature B.V Springer Verlag |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Although documents are increasingly multimodal, their automatic processing is often monomodal. In particular, natural language processing tasks are typically performed based on the textual modality only. This work extends the syntactic parsing task to the image modality in addition to text. In particular, we address the prepositional phrase attachment problem, a hard and semantic problem for syntactic parsers. Given an image and a caption, the proposed approach resolves syntactic attachment of prepositions in the parse tree according to both visual and lexical features. Visual features are derived from the nature and position of detected objects in the image that are aligned to textual phrases in the caption. A reranker uses this information to reorder syntactic trees produced by a shift-reduce syntactic parser. Trained on the Flickr-PP corpus which contains multimodal gold-standard attachments, this approach yields improvements over a text-only syntactic parser, in particular for the subset of prepositions that encode location, leading to an increase of up to 17 points of attachment accuracy. |
---|---|
ISSN: | 1370-4621 1573-773X |
DOI: | 10.1007/s11063-020-10314-8 |