К проблеме лемматизации несловарных слов
The article discusses the most recent trends in the development of the English progressive. A corpus-based approach to linguistic research is seen as an effective means of determining reliability of the data retrieved and helps track the major diachronic dynamic in the increasing frequency of the progressive aspect that has taken place since the beginning of the 20th century. The article specifically deals with the extension of the progressive to new constructions, such as modal, present perfect and past perfect passive progressive, and also accounts for the use of progressive forms in the contextual environment not generally characteristic of them.
The paper discusses sociolinguistic implementations of statistical analysis of the spoken subcorpus of the Russian National Corpus. Given the considerable size of the corpus (about 10 mln tokens), an analysis of co-variation of various linguistic parameters with one of the few sociolinguistic parameters available – the speaker’s gender – may give rich and interesting results. One specific example of co-variation is considered in detail: the mean length of the utterance (in tokens). Comparing this parameter in public communication shows statistically significant difference between the speech of men and women (men talk more), while the same difference is absent in private communication. Another important parameter is the gender of the addressee. Again, co-variation is quite different in public and private discourse. In private communication, the utterances are longer when addressing someone of the same sex, the difference between men and women is not statistically significant. In public communication, the utterances are longer when addressing a woman, whether the speaker herself is a man or woman. These conclusions are consistent with the results of sociolinguistic gender studies obtained elsewhere and by other methods. Linguistic difference between men and women are not absolute but depend on the communicative situation (public vs. private). Public discourse is a playground for linguistic competition in which men are the winning party. In private discourse, competition dissolves.
Four electronic corpora created in 2011 within the framework of the “Corpus Linguistics: the Albanian, Kalmyk, Lezgian, and Ossetic Languages” Program of Fundamental Research of the RAS are presented. The interface and functionalities of these corpora are described, engineering problems to be solved in their creation are elucidated, and the promises of their development are discussed. A particular emphasis is made on the compilation of dictionaries and automatic grammatical markup of the corpora.