Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features
Authorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, as they encode both style and content information. We evaluate different types of character n-gram features in an authorship attribution task in a real-world noisy dataset of Russian forum posts. We also supplement them with a number of new simple n-gram features capturing syntactic and discourse patterns. We perform authorship attribution in a single-topic and a cross-topic setting, as the research question is whether character n-grams capture both style and content information. Our results show that character n-grams are indeed very successful in Russian forum post authorship attribution. However, there is no clear distinction of style and content n-grams, as the same types of n-grams work well for both single-topic and cross-topic settings. In our experiments the generalized simple n-gram features which reveals syntactic and discourse patterns were proved to be also very important in authorship attribution of short informal Russian texts. They represent a different kind of authorship information and are a successful addition to the character n-grams in authorship attribution of forum texts in the Russian language.