Критерий MRMR и уменьшение размерности пространства признаков в задаче классификации спама поисковой системы
Today web spam is the one of the key problems of modern web search engines. In this paper we investigate the efficiency of various dimensionality reduction methods applying to the spam classifier of go.mail.ru search system. Effective utilization of such techniques can significantly increase the number of features and the quality of the classifier without loss of training and classification speed. We have conducted a series of experiments with PCA (Principal Component Analysis) и RP (Random Projection) dimensionality reduction methods. Unfortunately, these methods are shown to be ineffective applying to such issues, basically because of low-dimensional feature space. However this experiment led to the need for a detailed analysis of features, participating in the education process. For this analysis, we have chosen MRMR (Minimum Redundancy Maximum Relevance) criterion. Application of this criterion has allowed us to detect redundant features and estimate the efficiency of each of participating in education process feature. This research has allowed us significantly increase the quality of our web spam classifier without increasing number of features. This paper shows us all the efficiency of feature selection criterions in practice, and once again emphasizes the importance of a detailed analysis of the data and informative features, which are selected for training.