Normalization Issues in Digital Literary Studies: Spelling, Literary Themes and Biographical Description of Writers
Digital literary studies are a branch of digital humanities, which deals with national or world literatures. In this paper, we discuss normalization issues which are crucial for compiling eCulture resources, designed for cultural analytics, social and literary studies, as well as various aspects of digital humanities. One of such resources is the Corpus of Russian short stories of 1900–1930s with the detailed information about Russian writers of the epoch in concern intended for stylometric, linguistic and literary studies of Russian prose. We see our task to create a literary resource based on a system approach to the literature of a certain time period, which implies inclusion into consideration literary texts of the maximum number of writers, who created their works in the given period, both well-known and peripheral. The paper concerns the problem of data normalization, which is a necessary requirement for statistical processing of data of any kind. We describe how we deal with the problem of different spelling, how we normalize manual annotation of literary themes made by an expert and how we tackle the problem of standardization of biographical descriptions of authors. The obtained normalized data can be used for various kinds of research in the field of literary studies, digital humanities, computational linguistics, and cultural heritage studies.