?
WORD VECTOR MODELS AS AN OBJECT OF LINGUISTIC RESEARCH
This article launches a series of studies in which popular vector word2vec models are considered not as an element of the architecture of an NLP application, but as an independent object of linguistic research. The linguist's view on the surrogate of contexts on the corpus, as which vector models can be considered, makes it possible to reveal new information about the distribution of individual semantic groups of vocabulary and new knowledge about the corpus from which these models are derived. In particular, it is shown that such layers of English and Russian vocabulary, such as the names of professions, nationalities, toponyms, personal qualities, time periods, have the greatest independence from changing the model and retain their position relative to their neighbour words—that is, they have the most stable contexts regardless of the corpus; it is shown that the vocabulary from the Swadesh list is statistically more resistant to changing the model than the frequency vocabulary is; it is shown which word2vec models for the Russian language preserve best the ontological structures in vocabulary.