Revised entries in the multi-volume edition and TEI encoding: a case of the historical dictionary of Russian
The Dictionary of Russian Language of the 11th 17th centuries (DRL1117), which covers both Old and Middle Russian periods, is an ongoing project of the Russian Academy of Sciences, with volumes 131 published in hardcopy in 19752019). Up to now, only volumes 28-30 were converted into the database and published free online (http://web-corpora.net/wsgi/oldrus.wsgi/). The online edition allows one to search for entries that contain particular grammatical properties, phraseological units, sources of etymology, texts and sources attested in the entry, historical periods they represent, etc. (Aksyonov et al. 2015, Vechkaeva 2016). This paper presents a new initiative aimed at the digitization of earlier volumes, which includes OCR, encoding the dictionary according to a TEI- compatible XML scheme, improving the integrity of entries, and additional data mining and enrichment using external resources. We focus on the issue of how to represent the revised entries, namely, those that were added, deleted, and corrected in subsequent volumes and in a supplementary volume.
The changes in the entries are usually powered by new sources taken into consideration, by new interpretations of the source documents, or by changes in editorial policy. The typology of revisions made by the authors and editors of later volumes includes: adding or deleting entries; adding or deleting certain parts of the entry (senses, examples, etymology, etc.); correcting one or several fields of the entry (definition, example, grammatical properties, bibliographic description of citations, etc.). More complex changes are decomposed into the components listed above.
The TEI-based scheme of the dictionary addresses two ways of presenting the content: (i) an online searchable version and (ii) a retro- digitized version that preserves the layout of the published volumes. In the first case, the revised entry is represented as one merged entry (Target) that incorporates data from Source (entry published in an earlier volume) and Revision (entry published as addendum in later or supplementary volumes). As neither Source no Revision presents the correct content of the entry in full, the TEI-based representation of the Target should be generated. Besides that, advanced users may have access to the history of changes made by editors and to deleted entries. We use the critical apparatus module of TEI to track the history of changes, in which the lemma contains a “preferred”, corrected reading and another reading corresponds to the content provided in earlier volumes. Taking the perspective of the retro-digitized version, Source and Revision are two separate entries with different metadata. Nevertheless, these two entries are linked to each other using the reference tags. Taken as a whole, the proposed schema outlines the principles for documenting the genetic relationships between different versions of edited lexicographic material.