The Tomsk Dialect Corpus: a comprehensively annotated database of a Siberian Russian dialect from material collected over the last 70 years
The paper offers the first full description of the Tomsk Dialect Corpus – an electronic resource based on recordings of the Russian dialect speech of the Tomsk and Kemerovo regions (West Siberia), which has been collected since 1946. The corpus counts 3,350,272 tokens, which makes it the largest electronic collection of dialect speech in Russia. The originality of this resource consists in the uniqueness of the materials collected and their multifaceted annotation. Topic and pragmatic annotations were created manually. Topic annotation is available for the whole data, whereas pragmatic annotation is available for 45,445 speech acts. Grammatical annotation was performed automatically with the PhpMorphy parser, with additional manual correction for some dialect words. Metalinguistic annotation includes the recording’s year and place, and the speakers’ age, gender, and educational level. All annotated parameters are searchable. The corpus also includes a lexicographic component, i.e. definitions of dialect lexemes.