Automatic Arabic Dialect Classification
The paper presents work on automatic Arabic dialect classification and proposes machine learning classification method where training dataset consists of two corpora. The first one is a small corpus of manually dialectannotated instances. The second one contains big amount of instances that were grabbed from the Web automatically using word-marks—most unique and frequent dialectal words used as dialect identifiers. In the paper we considered four dialects that are mostly used by Arabic people: Levantine, Egyptian, Saudi and Iraq. The most important benefit of that approach is the fact that it reduces time expenses on manual annotation of data from social media, because the accent is made on the corpus created automatically. Best results that we got were achieved with Naïve Bayes classifier trained using character-based bigrams, trigrams and word-marks vocabulary: precision of classification reaches 0.92 with F1 -measure equal to 0.91 on the test set of instances taken from manually annotated corpus.