Treebank-Based Acquisition of Multilingual Unification Grammar Resources
Deep unification- (constraint-)based grammars are usually hand-crafted. Scaling such grammars from fragments to unrestricted text is time-consuming & expensive. This problem can be exacerbated in multilingual broad-coverage grammar development scenarios. Cahill et al (2002, 2004) & O'Do...
Saved in:
Published in | Research on language and computation Vol. 3; no. 2-3; pp. 247 - 279 |
---|---|
Main Authors | , , , , , , |
Format | Journal Article |
Language | English |
Published |
01.07.2005
|
Online Access | Get full text |
Cover
Loading…
Summary: | Deep unification- (constraint-)based grammars are usually hand-crafted. Scaling such grammars from fragments to unrestricted text is time-consuming & expensive. This problem can be exacerbated in multilingual broad-coverage grammar development scenarios. Cahill et al (2002, 2004) & O'Donovan et al (2004) present an automatic f-structure annotation-based methodology to acquire broad-coverage, deep, Lexical-Functional Grammar (LFG) resources for English from the Penn-II Treebank. In this paper, we show how this model can be adapted to a multilingual grammar development scenario to induce robust, wide-coverage, PCFG-based LFG approximations for German from the TIGER Treebank. We show how the architecture of LFG, in particular the distinction between c-structure & f-structure representations, facilitates multilingual, treebank-based unification grammar induction, allowing us to cross-linguistically reuse the lexical extraction & parsing modules from O'Donovan et al (2004) & Cahill et al (2004), respectively. We evaluate our grammars against the PARC 700 Dependency Bank (King et al, 2003), against dependency structures for 2000 held-out sentences from the TIGER Corpus as well as against a hand-crafted dependency gold standard for 100 TIGER trees. Currently, our resources achieve 81.79% f-score against the PARC 700, a 2.19% improvement over the best result reported for a hand-crafted grammar in Kaplan et al (2004), 74.6% against the 2000 held-out TIGER dependency structures & 71.08% against the 100-sentence TIGER gold standard, with substantially improved coverage compared to hand-crafted resources. We have since applied our methodology to induce wide-coverage LFG resources for Chinese (Burke et al, 2004b) from the Penn Chinese Treebank (Xue et al, 2002) & for Spanish from the CAST3LB Treebank (Civit, 2003). 14 Tables, 12 Figures, 49 References. Adapted from the source document |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 1570-7075 1572-8706 |
DOI: | 10.1007/s11168-005-1296-y |