Characterizing Intrinsic Compositionality in Transformers with Tree Projections
When trained on language data, do transformers learn some arbitrary computation that utilizes the full capacity of the architecture or do they learn a simpler, tree-like computation, hypothesized to underlie compositional meaning systems like human languages? There is an apparent tension between com...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
02.11.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | When trained on language data, do transformers learn some arbitrary
computation that utilizes the full capacity of the architecture or do they
learn a simpler, tree-like computation, hypothesized to underlie compositional
meaning systems like human languages? There is an apparent tension between
compositional accounts of human language understanding, which are based on a
restricted bottom-up computational process, and the enormous success of neural
models like transformers, which can route information arbitrarily between
different parts of their input. One possibility is that these models, while
extremely flexible in principle, in practice learn to interpret language
hierarchically, ultimately building sentence representations close to those
predictable by a bottom-up, tree-structured model. To evaluate this
possibility, we describe an unsupervised and parameter-free method to
\emph{functionally project} the behavior of any transformer into the space of
tree-structured networks. Given an input sentence, we produce a binary tree
that approximates the transformer's representation-building process and a score
that captures how "tree-like" the transformer's behavior is on the input. While
calculation of this score does not require training any additional models, it
provably upper-bounds the fit between a transformer and any tree-structured
approximation. Using this method, we show that transformers for three different
tasks become more tree-like over the course of training, in some cases
unsupervisedly recovering the same trees as supervised parsers. These trees, in
turn, are predictive of model behavior, with more tree-like models generalizing
better on tests of compositional generalization. |
---|---|
DOI: | 10.48550/arxiv.2211.01288 |