Unsupervised Discovery of Biographical Structure from Text
We present a method for discovering abstract event classes in biographies, based on a probabilistic latent-variable model. Taking as input timestamped text, we exploit latent correlations among events to learn a set of event classes (such as B , G H S , and B C ), along with the typical times in a p...
Saved in:
Published in | Transactions of the Association for Computational Linguistics Vol. 2; pp. 363 - 376 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
One Rogers Street, Cambridge, MA 02142-1209, USA
MIT Press
01.12.2014
MIT Press Journals, The The MIT Press |
Subjects | |
Online Access | Get full text |
ISSN | 2307-387X 2307-387X |
DOI | 10.1162/tacl_a_00189 |
Cover
Loading…
Summary: | We present a method for discovering abstract event classes in biographies, based
on a probabilistic latent-variable model. Taking as input timestamped text, we
exploit latent correlations among events to learn a set of event classes (such
as B
, G
H
S
, and
B
C
), along with the typical times in a
person’s life when those events occur. In a quantitative evaluation at the task
of predicting a person’s age for a given event, we find that our generative
model outperforms a strong linear regression baseline, along with simpler
variants of the model that ablate some features. The abstract event classes that
we learn allow us to perform a large-scale analysis of 242,970 Wikipedia
biographies. Though it is known that women are greatly underrepresented on
Wikipedia—not only as editors (Wikipedia, 2011) but also as subjects of articles
(Reagle and Rhue, 2011)—we find that there is a bias in their
as well, with biographies of women
containing significantly more emphasis on events of marriage and divorce than
biographies of men. |
---|---|
Bibliography: | Volume, 2014 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 2307-387X 2307-387X |
DOI: | 10.1162/tacl_a_00189 |