Multimodal Machine Learning for Natural Language Processing: Disambiguating Prepositional Phrase Attachments with Images

Although documents are increasingly multimodal, their automatic processing is often monomodal. In particular, natural language processing tasks are typically performed based on the textual modality only. This work extends the syntactic parsing task to the image modality in addition to text. In parti...

Full description

Saved in:

Bibliographic Details
Published in	Neural processing letters Vol. 53; no. 5; pp. 3095 - 3121
Main Authors	Delecraz, Sebastien, Becerra-Bonache, Leonor, Favre, Benoit, Nasr, Alexis, Bechet, Frederic
Format	Journal Article
Language	English
Published	New York Springer US 01.10.2021 Springer Nature B.V Springer Verlag
Subjects	Accessories Ambiguity Artificial Intelligence Attachment Complex Systems Computational Intelligence Computer Science Document and Text Processing Hypotheses Image retrieval Linguistics Machine learning Natural language processing Parsers Semantics Signal and Image Processing Prepositional phrase attachment resolution Natural language processing Deep neural networks Multimodal machine learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Although documents are increasingly multimodal, their automatic processing is often monomodal. In particular, natural language processing tasks are typically performed based on the textual modality only. This work extends the syntactic parsing task to the image modality in addition to text. In particular, we address the prepositional phrase attachment problem, a hard and semantic problem for syntactic parsers. Given an image and a caption, the proposed approach resolves syntactic attachment of prepositions in the parse tree according to both visual and lexical features. Visual features are derived from the nature and position of detected objects in the image that are aligned to textual phrases in the caption. A reranker uses this information to reorder syntactic trees produced by a shift-reduce syntactic parser. Trained on the Flickr-PP corpus which contains multimodal gold-standard attachments, this approach yields improvements over a text-only syntactic parser, in particular for the subset of prepositions that encode location, leading to an increase of up to 17 points of attachment accuracy.
ISSN:	1370-4621 1573-773X
DOI:	10.1007/s11063-020-10314-8