Treebank-Based Acquisition of Multilingual Unification Grammar Resources

Deep unification- (constraint-)based grammars are usually hand-crafted. Scaling such grammars from fragments to unrestricted text is time-consuming & expensive. This problem can be exacerbated in multilingual broad-coverage grammar development scenarios. Cahill et al (2002, 2004) & O'Do...

Full description

Saved in:
Bibliographic Details
Published inResearch on language and computation Vol. 3; no. 2-3; pp. 247 - 279
Main Authors Cahill, Aoife, Burke, Michael, Forst, Martin, O’donovan, Ruth, Rohrer, Christian, van Genabith, Josef, Way, Andy
Format Journal Article
LanguageEnglish
Published 01.07.2005
Online AccessGet full text

Cover

Loading…
More Information
Summary:Deep unification- (constraint-)based grammars are usually hand-crafted. Scaling such grammars from fragments to unrestricted text is time-consuming & expensive. This problem can be exacerbated in multilingual broad-coverage grammar development scenarios. Cahill et al (2002, 2004) & O'Donovan et al (2004) present an automatic f-structure annotation-based methodology to acquire broad-coverage, deep, Lexical-Functional Grammar (LFG) resources for English from the Penn-II Treebank. In this paper, we show how this model can be adapted to a multilingual grammar development scenario to induce robust, wide-coverage, PCFG-based LFG approximations for German from the TIGER Treebank. We show how the architecture of LFG, in particular the distinction between c-structure & f-structure representations, facilitates multilingual, treebank-based unification grammar induction, allowing us to cross-linguistically reuse the lexical extraction & parsing modules from O'Donovan et al (2004) & Cahill et al (2004), respectively. We evaluate our grammars against the PARC 700 Dependency Bank (King et al, 2003), against dependency structures for 2000 held-out sentences from the TIGER Corpus as well as against a hand-crafted dependency gold standard for 100 TIGER trees. Currently, our resources achieve 81.79% f-score against the PARC 700, a 2.19% improvement over the best result reported for a hand-crafted grammar in Kaplan et al (2004), 74.6% against the 2000 held-out TIGER dependency structures & 71.08% against the 100-sentence TIGER gold standard, with substantially improved coverage compared to hand-crafted resources. We have since applied our methodology to induce wide-coverage LFG resources for Chinese (Burke et al, 2004b) from the Penn Chinese Treebank (Xue et al, 2002) & for Spanish from the CAST3LB Treebank (Civit, 2003). 14 Tables, 12 Figures, 49 References. Adapted from the source document
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1570-7075
1572-8706
DOI:10.1007/s11168-005-1296-y