Сборник трудов 42-й междисциплинарной школы-конференции ИППИ РАН "Информационные технологии и системы 2018"
With the advances in the sequencing technology the International Cancer Genome Consortium (ICGC)  and The Cancer Genome Atlas (TCGA)  collected data on more than 16 000 genome-wide pairs tumor-normal tissue providing a valuable resource to study cancer mutations. In this research we focus on pre- evaluation of the relationship between cancer breakpoint hotspots and DNA regions potentially forming secondary structures such as stem-loops (cruciforms) and quadru- plexes. We performed analysis of 2 234 samples covering 10 cancer types and built machine-learning models predicting cancer breakpoint distribution over chromosome based on the density distribution of stem-loops and quadruplexes. We developed pro- cedure for machine learning models building and evaluation as the considered data are extremely imbalanced and it is needed to get reliable estimate of prediction power. We conducted a set of experiments to select the best appropriate resampling scheme, class balancing technique and parameters of machine learning algorithms. The best final models were applied to cancer breakpoints data. From the performed analysis it could be concluded that the relationship between cancer breakpoints hotspots and studied DNA secondary structures exists, however, generally, this relationship is weak for stem-loops, but higher for quadruplexes. We also found differences in model predictive power depending on cancer types. Thus, stem-loop-based model performs better for pancreatic, prostate, ovary, uterus, brain and liver cancer, and quadruplex- based model works better for blood, bone, skin and breast cancer.
Non-B DNA structures have a great potential to form and influence various genomic processes including transcription. One of the mechanisms of transcription regulation is nucleo- some positioning. Even though only B-DNA can be wrapped around a nucleosome, non-B DNA structures can compete with a nucleosome for a genomic location. Here we used perman- ganate/S1 nuclease footprinting data on non-B DNA structures, such as Z-DNA, H-DNA, G- quadruplexes and stress-induced duplex destabilization (SIDD) sites, together with MNase-seq data on nucleosome positioning in the mouse genome. We found three types of patterns of nucleosome positioning around non-B DNA structures: a structure is surrounded by nucleo- somes from both sides, from one side, or nucleosome free region. Machine learning models based on random forest and XGBoost algorithms were constructed to recognize DNA regions of 1kB length containing a particular pattern of nucleosome positioning for four types of DNA structures (Z-DNA, H-DNA, G-quadruplexes and SIDD sites) based on statistics of di- and tri- nucleotides. The best performance (94% of accuracy) was reached for G-quadruplexes while for other types of structures the accuracy was under 70%. We conclude that 1kB regions con- taining G-quadruplexes have distinct compositional properties, and this fact points to preferen- tial locations of such pattern in the genome and requires further investigation. For other DNA structures a region composition is not a sufficient predictive factor and one should take into account other physical and structural DNA properties to improve nucleosome-DNA-structure pattern recognition.
We found earlier that L1-Alu transposons in human genome contain a conservative stem-loop structure at their 3’UTR . We built a machine- learning model that could distinguish L1 3’-UTR stem-loop structures from stem-loops from different genomic locations. Later we found that all LINE transposons contain stem-loops at their 3’-end. Since 3’-end stem-loop structure was experimentally shown to play an important role in recognition of transpos- on RNA by the LINE encoded reverse transcriptase in several species [2-4], we hypothesize that this structure could be preserved for that purpose in other spe- cies. Here we built machine learning model using random forest algorithm to study structural properties of 3’-end transposon stem-loops. The constructed model is based on physical, chemical and structural RNA characteristics such as entalphy, enthropy, Gibbs free energy, hydrophilicity, and helical structural pa- rameters of dinucleotides - Shift, Roll, Slide, Rise, Tilt, Bend . Each stem- loop structure was split into 30 positions and each position was characterized by 23 characteristics so that the final property vector contained 602 position- specific characteristics for each stem-loop. 2200 sequences of all available LINE transposons from different species across the tree of life were extracted from RepBase database . We constructed machine-learning model using ran- dom forest that was able to distinguish 3’-end LINE stem-loops from random stem-loops with 78% of accuracy. Analysis of predictor importance revealed that enthalpy and entropy in loop positions and hydrophilicity and stacking en- ergy in stem positions were the major influential factors for model prediction power. The obtained results support the idea that 3’-end transposon stem-loops share similar structural properties, which are probably required for transposi- tion.