Quantitative Data on POS Distribution in the Beginnings and the Ends of Utterances in Everyday Russian Speech

The paper presents statistical data on POS distribution in the beginnings and the ends of everyday Russian utterances. The material for this study was a morphologically annotated subcorpus of the ORD corpus of spoken Russian with volume of 149737 tokens and containing fragments of everyday speech of...

Full description

Saved in:

Bibliographic Details
Published in	Speech and Computer Vol. 11096; pp. 596 - 605
Main Author	Sherstinova, Tatiana
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 01.01.2018 Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Corpus linguistics Everyday speech N-gram analysis Parts of speech Pragmatic Markers Probability Russian Syntax
Online Access	Get full text
ISBN	3319995782 9783319995786
ISSN	0302-9743 1611-3349
DOI	10.1007/978-3-319-99579-3_61

Cover

Loading…

More Information
Summary:	The paper presents statistical data on POS distribution in the beginnings and the ends of everyday Russian utterances. The material for this study was a morphologically annotated subcorpus of the ORD corpus of spoken Russian with volume of 149737 tokens and containing fragments of everyday speech of 213 people of different gender, age, and professional groups. In the proposed study, the method of n-gram analysis, which is typically employed in computational linguistics to construct probabilistic language models, was used. In the subcorpus as a whole, the most frequent POS turned out to be verbs (17.23%), personal pronouns (15.60%), nouns (14%), particles (13%), and conjunctions (9%). However, in the initial position of spoken utterances the most frequent POS are particles (19.99%) and conjunctions (12%), and in the final position of utterances the verbs and nouns are used more often than others. The former are more typical for interrogative (27.66%) and narrative (25.42%) utterances, and the latter are frequently used in exclamative (29.95%) and narrative (24.28%) utterances. Besides, the most typical bigrams and trigrams in the beginning of utterances started with a particle and their probabilities are presented. A high percentage of syntactic models containing particles in the initial position of utterances leads us to the assumption that these units have special pragmatic functions, associated with marking phrase boundaries. Statistical data obtained here may be used for modeling of everyday utterances for the variety of dialogue systems and for improvement of Russian speech recognition systems.
ISBN:	3319995782 9783319995786
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-319-99579-3_61