Automatically Mining Parallel Corpora for Minority Languages from Web Pages

Parallel corpora are indispensable resources for a variety of multilingual natural language processing. This paper describes a system, which mines automatically parallel corpora from web pages. It attempts to overcome the shortage of parallel corpora in minority languages. Learning from the existing...

Full description

Saved in:

Bibliographic Details
Published in	2012 International Conference on Asian Language Processing pp. 97 - 100
Main Authors	Zede Zhu, Miao Li, Lei Chen, Weihui Zeng
Format	Conference Proceeding
Language	English
Published	IEEE 01.11.2012
Subjects	Data mining extracting content Feature extraction HTML identifying parallel pairs minority languages Natural language processing parallel corpora Support vector machines web mining Web pages
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Parallel corpora are indispensable resources for a variety of multilingual natural language processing. This paper describes a system, which mines automatically parallel corpora from web pages. It attempts to overcome the shortage of parallel corpora in minority languages. Learning from the existing technology of mining web bilingual corpora, and combining with the characteristics of minority languages bilingual websites, a method, mining parallel corpora in minority languages based on heuristic information extracted from content, is proposed. Experiments, carried out on the Chinese-Mongolian language pair, show that the system is successful in automatically identifying a significant amount of parallel texts from the World Wide Web.
ISBN:	9781467361132 1467361135
DOI:	10.1109/IALP.2012.29