Multimodal Machine Learning for Natural Language Processing: Disambiguating Prepositional Phrase Attachments with Images

Although documents are increasingly multimodal, their automatic processing is often monomodal. In particular, natural language processing tasks are typically performed based on the textual modality only. This work extends the syntactic parsing task to the image modality in addition to text. In parti...

Full description

Saved in:
Bibliographic Details
Published inNeural processing letters Vol. 53; no. 5; pp. 3095 - 3121
Main Authors Delecraz, Sebastien, Becerra-Bonache, Leonor, Favre, Benoit, Nasr, Alexis, Bechet, Frederic
Format Journal Article
LanguageEnglish
Published New York Springer US 01.10.2021
Springer Nature B.V
Springer Verlag
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Although documents are increasingly multimodal, their automatic processing is often monomodal. In particular, natural language processing tasks are typically performed based on the textual modality only. This work extends the syntactic parsing task to the image modality in addition to text. In particular, we address the prepositional phrase attachment problem, a hard and semantic problem for syntactic parsers. Given an image and a caption, the proposed approach resolves syntactic attachment of prepositions in the parse tree according to both visual and lexical features. Visual features are derived from the nature and position of detected objects in the image that are aligned to textual phrases in the caption. A reranker uses this information to reorder syntactic trees produced by a shift-reduce syntactic parser. Trained on the Flickr-PP corpus which contains multimodal gold-standard attachments, this approach yields improvements over a text-only syntactic parser, in particular for the subset of prepositions that encode location, leading to an increase of up to 17 points of attachment accuracy.
ISSN:1370-4621
1573-773X
DOI:10.1007/s11063-020-10314-8