Quantitative Data on POS Distribution in the Beginnings and the Ends of Utterances in Everyday Russian Speech

The paper presents statistical data on POS distribution in the beginnings and the ends of everyday Russian utterances. The material for this study was a morphologically annotated subcorpus of the ORD corpus of spoken Russian with volume of 149737 tokens and containing fragments of everyday speech of...

Full description

Saved in:
Bibliographic Details
Published inSpeech and Computer Vol. 11096; pp. 596 - 605
Main Author Sherstinova, Tatiana
Format Book Chapter
LanguageEnglish
Published Switzerland Springer International Publishing AG 01.01.2018
Springer International Publishing
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text
ISBN3319995782
9783319995786
ISSN0302-9743
1611-3349
DOI10.1007/978-3-319-99579-3_61

Cover

Loading…
More Information
Summary:The paper presents statistical data on POS distribution in the beginnings and the ends of everyday Russian utterances. The material for this study was a morphologically annotated subcorpus of the ORD corpus of spoken Russian with volume of 149737 tokens and containing fragments of everyday speech of 213 people of different gender, age, and professional groups. In the proposed study, the method of n-gram analysis, which is typically employed in computational linguistics to construct probabilistic language models, was used. In the subcorpus as a whole, the most frequent POS turned out to be verbs (17.23%), personal pronouns (15.60%), nouns (14%), particles (13%), and conjunctions (9%). However, in the initial position of spoken utterances the most frequent POS are particles (19.99%) and conjunctions (12%), and in the final position of utterances the verbs and nouns are used more often than others. The former are more typical for interrogative (27.66%) and narrative (25.42%) utterances, and the latter are frequently used in exclamative (29.95%) and narrative (24.28%) utterances. Besides, the most typical bigrams and trigrams in the beginning of utterances started with a particle and their probabilities are presented. A high percentage of syntactic models containing particles in the initial position of utterances leads us to the assumption that these units have special pragmatic functions, associated with marking phrase boundaries. Statistical data obtained here may be used for modeling of everyday utterances for the variety of dialogue systems and for improvement of Russian speech recognition systems.
ISBN:3319995782
9783319995786
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-319-99579-3_61