Machine Learning Applications for Genomic Pattern Recognition Problem
DNAsecondary structures are important functional elements thatmay influence cellular processes. One of theirpossible functions is regulation of nucleosome positioning. Here MNAse-seq and ssDNA-seq data were used to define patterns of positional relationship of DNA structures such as Z-DNA, H-DNA and G-quadruplexes with nucleosomes. Three types of patterns werefound: a structure is surrounded by nucleosomes from both sides, from one side, or nucleosome free region. Machine-learning models based on Random forest algorithm and XGBoost weretrained to recognize DNA region of 500 bp length containing a pattern of nucleosome positioning for three types of DNA struc-tures (Z-DNA, H-DNA and G-quadruplexes) based on DNAsequence composi-tional properties. The best performance (more than 86% for ROC-AUC, accu-racy, recall and presicion scores) wasreached for G-quadruplexes. 500 bp re-gions containing G-quadruplexes have distinct compositional properties and point to the preferential locations of the defined patterns, which regulatory functions require further investigation. For other DNA structures a region com-position is less powerful predictive factor and one should take into account oth-er physical and structural DNA properties to improve nucleosome-DNA-structure pattern recognition.