Machine-learning models for cancer breakpoints prediction based on DNA structure distributions
With the advances in the sequencing technology the International Cancer Genome Consortium (ICGC)  and The Cancer Genome Atlas (TCGA)  collected data on more than 16 000 genome-wide pairs tumor-normal tissue providing a valuable resource to study cancer mutations. In this research we focus on pre- evaluation of the relationship between cancer breakpoint hotspots and DNA regions potentially forming secondary structures such as stem-loops (cruciforms) and quadru- plexes. We performed analysis of 2 234 samples covering 10 cancer types and built machine-learning models predicting cancer breakpoint distribution over chromosome based on the density distribution of stem-loops and quadruplexes. We developed pro- cedure for machine learning models building and evaluation as the considered data are extremely imbalanced and it is needed to get reliable estimate of prediction power. We conducted a set of experiments to select the best appropriate resampling scheme, class balancing technique and parameters of machine learning algorithms. The best final models were applied to cancer breakpoints data. From the performed analysis it could be concluded that the relationship between cancer breakpoints hotspots and studied DNA secondary structures exists, however, generally, this relationship is weak for stem-loops, but higher for quadruplexes. We also found differences in model predictive power depending on cancer types. Thus, stem-loop-based model performs better for pancreatic, prostate, ovary, uterus, brain and liver cancer, and quadruplex- based model works better for blood, bone, skin and breast cancer.