Data-driven models and computational tools for neurolinguistics: a language technology perspective
In this paper, our focus is the connection and influence of language technologies on the research in neurolinguistics. We present a review of brain imaging-based neurolinguistics studies with a focus on the natural language representations, such as word embeddings and pre-trained language model. Mutual enrichment of neurolinguistics and language technologies leads to development of brain-aware natural language representations. The importance of the research area is emphasized by medical applications
he 19th Annual Meeting of the Organization for Human Brain Mapping was held June 16-20, 2013 at the Washington State Convention Center in Seattle, WA, USA. OHBM draws attendance between 2500-3000 attendees each year. Membership in the organization is growing and the meeting continues to be one of the most significant neuroimaging conferences for those in the field. The OHBM meeting boasts a combination of exciting scientific programs and social events, all tailored to the city the meeting is being held. Unique, innovative and full of surprises, Seattle is a diverse city with a laid-back approach to life. To experience Seattle is to experience the quiet confidence and balanced urban and natural lifestyles. Seattle is a world-class metropolis with a fast-paced city life within wild, beautiful natural surroundings.
Navigated transcranial magnetic stimulation (nTMS) can be applied to locate and outline cortical motor representations. This may be important, e.g., when planning neurosurgery or focused nTMS therapy, or when assessing plastic changes during neurorehabilitation. Conventionally, a cortical location is considered to belong to the motor cortex if the maximum electric field (E-field) targeted there evokes a motor-evoked potential in a muscle. However, the cortex is affected by a broad E-field distribution, which tends to broaden estimates of representation areas by stimulating also the neighboring areas in addition to the maximum E-field location. Our aim was to improve the estimation of nTMS-based motor maps by taking into account the E-field distribution of the stimulation pulse. The effect of the E-field distribution was considered by calculating the minimum-norm estimate (MNE) of the motor representation area. We tested the method on simulated data and then applied it to recordings from six healthy volunteers and one stroke patient. We compared the motor representation areas obtained with the MNE method and a previously introduced interpolation method. The MNE hotspots and centers of gravity were close to those obtained with the interpolation method. The areas of the maps, however, depend on the thresholds used for outlining the areas. The MNE method may improve the definition of cortical motor areas, but its accuracy should be validated by comparing the results with maps obtained with direct cortical stimulation of the cortex where the E-field distribution can be better focused.
This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The results show that the information about PoS affiliation contained in the distributional vectors allows us to discover groups of words with distributional patterns that differ from other words of the same part of speech. This data often reveals hidden inconsistencies of the annotation process or guidelines. At the same time, it supports the notion of ‘soft’ or ‘graded’ part of speech affiliations. Finally, we show that information about PoS is distributed among dozens of vector components, not limited to only one or two features.
The problem of functional localization in the brain is one of the most fundamental in neuroscience. For this problem two opposite ideologies: "modular" versus "holistic" nature of the brain also known as "localism" and "holism" have been discussed for a long time (Flourens 1825; Luria 1967). The debate in favor of one or another ideology still can be traced at all methodological levels - from a cell to a system. In this opinion paper we want to raise a question - what is nowadays meant by mapping of the brain? In addition we want to highlight the necessity of being aware of occasionally occurring discontinuity in the research at different methodological scales.
The efficiency and scalability of plagiarism detection systems have become a major challenge due to the vast amount of available textual data in several languages over the Internet. Plagiarism occurs in different levels of obfuscation, ranging from the exact copy of original materials to text summarization. Consequently, designed algorithms to detect plagiarism should be robust to the diverse languages and different types of obfuscation in plagiarism cases. In this paper, we employ text embedding vectors to compare similarity among documents to detect plagiarism. Word vectors are combined by a simple aggregation function to represent a text document. This representation comprises semantic and syntactic information of the text and leads to efficient text alignment among suspicious and original documents. By comparing representations of sentences in source and suspicious documents, pair sentences with the highest similarity are considered as the candidates or seeds of plagiarism cases. To filter and merge these seeds, a set of parameters, including Jaccard similarity and merging threshold, are tuned by two different approaches: offline tuning and online tuning. The offline method, which is used as the benchmark, regulates a unique set of parameters for all types of plagiarism by several trials on the training corpus. Experiments show improvements in performance by considering obfuscation type during threshold tuning. In this regard, our proposed online approach uses two statistical methods to filter outlier candidates automatically by their scale of obfuscation. By employing the online tuning approach, no distinct training dataset is required to train the system. We applied our proposed method on available datasets in English, Persian and Arabic languages on the text alignment task to evaluate the robustness of the proposed methods from the language perspective as well. As our experimental results confirm, our efficient approach can achieve considerable performance on the different datasets in various languages. Our online threshold tuning approach without any training datasets works as well as, or even in some cases better than, the training-base method.
We present an approach to detect differences in lexical semantics across English language registers, using word embedding models from distributional semantics paradigm. Models trained on register-specific subcorpora of the BNC corpus are employed to compare lists of nearest associates for particular words and draw conclusions about their semantic shifts depending on register in which they are used. The models are evaluated on the task of register classification with the help of the deep inverse regression approach.
Additionally, we present a demo web service featuring most of the described models and allowing to explore word meanings in different English registers and to detect register affiliation for arbitrary texts. The code for the service can be easily adapted to any set of underlying models.
In natural language processing, distributional semantic models are known as an efficient data driven approach to word and text representation, which allows computing meaning directly from large text corpora into word embeddings in a vector space. This paper addresses the role of linguistic preprocessing in enhancing performance of distributional models, and particularly studies pronominal anaphora resolution as a way to exploit more co-occurrence data without directly increasing the size of the training corpus. We replace three different types of anaphoric pronouns with their antecedents in the training corpus and evaluate the extent to which this affects the performance of the resulting models in lexical similarity tasks. CBOW and SkipGram distributed models trained on Russian National Corpus are in the focus of our research, although the results are potentially applicable to other distributional semantic frameworks and languages as well. The trained models are evaluated against RUSSE '15 and SimLex-999 gold standard data sets. As a result, we find that models trained on corpora with pronominal anaphora resolved perform significantly better than their counterparts trained on baseline corpora.