Making choices in Russian: pros and cons of statistical methods for rival forms / Выбор вариантных форм в русском языке: плюсы и минусы различных моделей статистического анализа

Sometimes languages present speakers with choices among rival forms, such as the Russian forms ostrič' vs. obstrič' 'cut hair' and proniknuv vs. pronikši 'having penetrated'. The choice of a given form is often influenced by various considerations involving the meaning...

Full description

Saved in:

Bibliographic Details
Published in	Russian linguistics Vol. 37; no. 3; pp. 253 - 291
Main Authors	Baayen, R. Harald, Endresen, Anna, Janda, Laura A., Makarova, Anastasia, Nesset, Tore
Format	Journal Article
Language	English
Published	Springer 01.01.2013
Subjects	Coefficients Datasets Discrimination learning Logistic regression Logistics Modeling P values Regression analysis Trees Verbs
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Sometimes languages present speakers with choices among rival forms, such as the Russian forms ostrič' vs. obstrič' 'cut hair' and proniknuv vs. pronikši 'having penetrated'. The choice of a given form is often influenced by various considerations involving the meaning and the environment (syntax, morphology, phonology). Understanding the behavior of rival forms is crucial to understanding the form-meaning relationship of language, yet this topic has not received as much attention as it deserves. Given the variety of factors that can influence the choice of rival forms, it is necessary to use statistical models in order to accurately discover which factors are significant and to what extent. The traditional model for this kind of data is logistical regression, but recently two new models, called 'tree & forest' and 'naive discriminative learning' have emerged as alternatives. We compare the performance of logistical regression against the two new models on the basis of four datasets reflecting rival forms in Russian. We find that the three models generally provide converging analyses, with complementary advantages. After identifying the significant factors for each dataset, we show that different sets of rival forms occupy different regions in a space defined by variance in meaning and environment. Носители языка часто сталкиваются с ситуацией выбора вариантных форм, таких как рус. остричь и обстричь или проникнув и проникши. На выбор варианта могут влиять различные факторы, включая семантику и контекстное окружение (синтаксическое, морфологическое и фонологическое). Изучение поведения вариантных форм необходимо для понимания соотношения означающего и означаемого в языке, однако этот вопрос до сих пор не получил должного внимания. Ввиду того, что выбор вариантной формы может зависеть от факторов различного рода, необходимо использовать методы статистического анализа: они позволяют точно определить, какие факторы являются главными и какова доля их влияния. Обычно для такого типа языковых данных применяется модель логистической регрессии, однако недавно появились две альтернативные модели—'случайный лес’ и 'наивное различительное обучение’. Мы сравнили эффективность логистической регрессии и двух новых моделей статистического анализа на материале четырех баз данных, собранных для ряда вариантных форм русского языка. Все три модели дают в целом схожие результаты, но каждая имеет свои преимущества. В статье выявлены определяющие факторы для каждого набора данных, а также показано, что исследованные нами вариантные формы размещаются в различных зонах системы двух осей координат—оси различия по значению и оси различия по контекстным условиям.
ISSN:	0304-3487 1572-8714