Corpora as indicators of (non-)existence

A. Piperski

?

Corpora as indicators of (non-)existence

P. 494–500.

Piperski A.

This paper discusses the notions of acceptability, occurrence, grammaticality and existence, and focuses on the relationship between corpus linguistics and the question of the existence of lexical items. Since corpora are almost exclusively samples from larger populations, it is claimed that they cannot provide evidence for non-existence of words, collocations or constructions. This is because the upper limit of a confidence interval for frequency based on a sample is always greater than zero regardless of the sample frequency. The rule of thumb goes as follows: anything that does not occur in a corpus might have occurred in a similar same-sized corpus zero to five times. If an item occurs in a corpus, this fact can serve as a proof of its existence in the language, but the final decision depends on whether the relevant contexts from the corpus are judged representative of the language variety of interest. In conclusion, I claim that a corpus-based study cannot prove the non-existence of a linguistic item, although it can be used to prove its existence. However, the latter type of proof includes assessing the representativeness of a corpus, which might lead to subjectivity and value judgments.

Language: English

Full text

Text on another site

Keywords: существование корпусная лингвистика existence corpus linguistics acceptability приемлемость встречаемость occurrence grammaticality грамматичность

In book

Компьютерная лингвистика и интеллектуальные технологии. По материалам ежегодной Международной конференции "Диалог" (2015)

М.: Изд-во РГГУ, 2015.

Два подхода к дифференциации терминов миграционных исследований (по данным корпусного анализа)

Permyakova T. M., Smirnova E. A., Новые исследования Тувы 2025 № 4 С. 122–136

The article presents a quantitative and qualitative analysis of English-language terms related to the study of migration.The sources used were research articles in the social sciences published between 2018 and 2020 in international first-quartile journals indexed in the Scopus database. The corpus-linguistic study addresses two objectives: to identify functioning systems of terms in scientific articles ...

Added: December 1, 2025

Preposition drop in Russian spoken by Mari and Beserman bilinguals

Yakovleva A., Kosheliuk N., Moroz G., International Journal of Bilingualism 2025 P. 1–19

Aims and Research Questions: In this paper, we present a corpus-based study of preposition drop (p-drop) in the speech of Mari-Russian and Beserman-Russian bilinguals compared to the speech of Russian monolinguals. Based on data from spoken corpora, we demonstrate that the prepositions v ‘in’, k ‘to’, s ‘with’ are omitted in the speech of bilinguals ...

Added: November 26, 2025

Вариативность годов vs. лет в русских говорах: корпусное исследование

Zemicheva S., Moroz G., Naccarato C., Вопросы языкознания 2025 № 6 С. 7–34

Наличие супплетивной формы лет в парадигме существительного год отличает русский язык от других восточнославянских. При этом в русских говорах вместо лет может использоваться вариант годов. Данные панхронического подкорпуса НКРЯ показывают, что форма годов, зафиксированная впервые в XV в., на всем протяжении истории русского языка была периферийной, в XVII–XVIII вв. использовалась преимущественно в нехудожественных текстах, а в ...

Added: November 12, 2025

Automatic Annotation of Discourse and Speech Formulas in Internet Communication: A Telegram Comment Corpus

Maslenikova A., Tatiana I. Popova, , in: 27th International Conference, SPECOM 2025, Szeged, Hungary, October 13–15, 2025, Proceedings, Part I. Speech and Computer. Lecture Notes in Artificial Intelligence 16187Vol. 16187: Lecture Notes in Artificial Intelligence.: Springer, 2025. P. 278–292.

This article presents a system for the automatic processing of user comments aimed at annotating speech and discourse formulas that actively function in everyday interaction, including digital communication. A Python-based program using the Telegram API was developed to automate the collection, filtering, and annotation of empirical data. In addition to building a user corpus, the ...

Added: October 19, 2025

27th International Conference, SPECOM 2025, Szeged, Hungary, October 13–15, 2025, Proceedings, Part II. Speech and Computer. Lecture Notes in Artificial Intelligence 16188

Springer, 2025.

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or ...

Added: October 19, 2025

Variation in a Narrative Corpus of Mano and Kpelle: Contact-Induced or Not?.

Khachaturyan M., Konoshenko M., Moroz G. et al., , in: N’yng-dyuumgu, n’yng-ngafq: Festschrift for Ekaterina GruzdevaVol. 126.: Helsinki: Studia Orientalia, 2025. P. 35–59.

This paper explores a corpus of spontaneous narratives and narrative retellings told by children and adults in Mano and Kpelle, two contacting Mande languages. It focuses on quotative constructions as a key point of grammatical dissimilarity between Mano and Kpelle. In the Mano speech of some bilingual children, however, these constructions are found to manifest ...

Added: September 5, 2025

Анализ тематики повседневных разговоров: экспертный подход и автоматические методы

Sherstinova T., Вепринцева Д. А., Человек: образ и сущность. Гуманитарные аспекты 2025 № 2(62) С. 89–108

В статье рассматриваются три разных подхода к изучению тематики повседневных разговоров: экспертная тематическая разметка и два автоматических метода (тематическое моделирование и кластеризация). Материалом для исследования послужили расшифровки русской устной повседневной речи из корпуса ОРД, подготовленные на основе звукозаписей спонтанных разговоров, выполненных в естественных коммуникативных ситуациях (дома, на работе, в учебном заведении, в магазине, в поликлинике ...

Added: September 3, 2025

Русская и зарубежная филология в диалоге культур : материалы Всероссийской научно-практической конференции с международным участием (г. Ростов-на-Дону, 19–21 октября 2023 г.)

Издательство Южного федерального университета, 2024.

В сборнике представлены статьи, посвящённые актуальным вопросам лингвистики, литературоведения, цифровой филологии и лингвистики, журналистики и медиакоммуникаций, перевода и переводоведения. ...

Added: July 31, 2025

Переписка Н. С. Хрущева и Ф. Кастро периода Карибского кризиса: опыт компьютеризованного анализа

Герцен А. С., В кн.: Четвёртая зимняя школа по гуманитарной информатике.: Балтийский федеральный университет им. Иммануила Канта, 2020. С. 92–97.

The article analyzes the 1st Secretary of the Central Committee of the CPSU and Chairman of the Council of Ministers of the USSR N. S. Khrushchev and the leader of the Cuban revolution F. Castro Ruz’s letters written in the period from October 26 to 31, 1962 on the topic of the Caribbean crisis and ...

Added: July 15, 2025

An overview of morphosyntactic variation in the speech of Russian-Chuvash bilinguals: number, gender, case assignment and preposition drop

Grishanova A., Russian linguistics 2025 Vol. 49 Article 10

The purpose of this study is to present a summary of morphosyntactic variation and a detailed analysis of the phenomenon of preposition drop in the Russian speech of Chuvash bilinguals. Specifically, I investigate what underlying factors might condition the variation. I conduct a qualitative analysis of the data extracted from the corpus of Russian spoken ...

Added: July 10, 2025

Do Formal Stance Strategies Reveal Disciplinary Variation in Professional Scientific Writing?

Smirnova E. A., Pérez-Guerra J., International Journal of Applied Linguistics 2025 Vol. 35 No. 3 P. 1242–1261

Stance in academic discourse has been extensively studied, with numerous investigations indicating that its expression varies across disciplines, depending on the authors’ intention to either enhance or diminish their voice or presence (e.g. It seems fairly certain versus This is based on the belief that...). This paper hypothesises that stance can be viewed as a ...

Added: April 10, 2025

Русский язык в условиях контактирования: тюркско-русское языковое взаимодействие. Часть 1. Социолингвистическое и корпусное исследование

Резанова З. И., Artemenko E., Диброва В. С. et al., Томск: Издательство Томского государственного университета, 2024.

В монографии представлены собственно лингвистические, социолингвистические и психолингвистические аспекты взаимодействия русского и трех тюркских языков – шорского, хакасского, татарского (сибирского варианта). Охарактеризованы варианты влияния тюркских языков на речевую практику и когнитивные процессы порождения и восприятия речи русскоязычными билингвами. Представлены методики сбора данных, их обработки при формировании социолингвистической базы данных и морфологически размеченного бимодального корпуса русской устной речи билингвов, ...

Added: April 7, 2025

The ‘adverb-ly adjective’ construction in English: meanings, distribution and discourse functions

Taboada M., Goddard C., Trnavac R., English Language and Linguistics 2025 Vol. 29 No. 1 P. 102–131

We investigate a class of adjective phrases composed of a deadjectival adverb ending in -ly and an adjective head (e.g. staggeringly incompetent, absolutely terrific, fiscally responsible), a compact construction whereby two adjectives may jointly contribute to evaluative meaning. Using corpus methodologies on more than 1 million examples and relying on semantic analyses of about 1,000 instances, we propose that the ...

Added: April 4, 2025

О национальном корпусе русского языка

Rakhilina E. V., Вестник Российской академии наук 2024 Т. 94 № 9 С. 795–803

Статья посвящена проекту создания Национального корпуса русского языка (НКРЯ) – мощной справочно-информационной системы по русскому языку, которая была разработана консорциумом организаций РАН с участием компании “Яндекс”. Описаны история создания Корпуса, основной его функционал и пути совершенствования, а также наиболее технологичные подкорпуса – поэтический, параллельный, мультимедийный; приведены примеры их работы. Особое внимание уделено последним разработкам, которые ...

Added: February 25, 2025

Creation and Analysis of the Multimedia Russian Corpus for Gesture Research

Rakhilina E. V., Cienki A., , in: The Cambridge Handbook of Gesture Studies.: Cambridge University Press, 2024. P. 249–272.

The chapter considers gesture studies in relation to corpus linguistic work. The focus is on the Multimedia Russian Corpus (MURCO), part of the Russian National Corpus. The chapter includes a brief biography of the creator of this corpus, Elena Grishina. The compilation of the corpus out of a set of Russian classic feature films and ...

Added: February 13, 2025

ИСПОЛЬЗОВАНИЕ МЕТОДОВ КОМПЬЮТЕРНОЙ ЛИНГВИСТИКИ ДЛЯ АНАЛИЗА ЛИТЕРАТУРЫХ ТЕКСТОВ

Аванесян Н. Л., Fokina A., Chepovskiy A., В кн.: Инжиниринг предприятий и управление знаниями (ИП&УЗ-2024) : сборник научных трудов XXVII Российской научной конференции. 28–29 ноября 2024 г. / под науч. ред. Ю. Ф. Тельнова. – Москва : ФГБОУ ВО «РЭУ им. Г. В. Плеханова», 2024.: М.: ФГБОУ ВО "РЭУ им. Г.В. Плеханова", 2024. С. 15–18.

Статья посвящена применению математических методов корпусного анализа для исследований литературных текстов. На примере созданных корпусов продемонстрированы возможности применения метода анализа соответствий и анализ коэффициентов попарной ранговой корреляции для сравнения частотных характеристик текстов различных подкорпусов. Описанные методики дают коррелированные результаты. Они могут использоваться как для лингвистических исследований, так и создания корректных обучающих текстовых наборов для задач искусственного интеллекта. ...

Added: December 19, 2024