Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction

?

Classification of Arabidopsis thaliana gene sequences: clustering of coding sequences into two groups according to codon usage improves gene prediction

Journal of Molecular Biology. 1999. Vol. 285. No. 5. P. 1977–1991.

Mathe C., Peresetsky A., Dehais P., Van Montagu M., Rouze P.

While genomic sequences are accumulating, finding the location of the genes remains a major issue that can be solved only for about a half of them by homology searches. Prediction methods are thus required, but unfortunately are not fully satisfying. Most prediction methods implicitly assume a unique model for genes. This is an oversimplification as demonstrated by the possibility to group coding sequences into several classes in Escherichia coliand other genomes. As no classification existed for Arabidopsis thaliana, we classified genes according to the statistical features of their coding sequences. A clustering algorithm using a codon usage model was developed and applied to coding sequences from A. thaliana, E. coli, and a mixture of both. By using it, Arabidopsis sequences were clustered into two classes. The CU1and CU2 classes differed essentially by the choice of pyrimidine bases at the codon silent sites: CU2 genes often use C whereas CU1 genes prefer T. This classification discriminated the Arabidopsis genes according to their expressiveness, highly expressed genes being clustered in CU2 and genes expected to have a lower expression, such as the regulatory genes, in CU1. The algorithm separated the sequences of the Escherichia-Arabidopsis mixed data set into five classes according to the species, except for one class. This mixed class contained 89 % Arabidopsis genes from CU1 and 11 % E. coli genes, mostly horizontally transferred. Interestingly, most genes encoding organelle-targeted proteins, except the photosynthetic and photoassimilatory ones, were clustered in CU1. By tailoring the GeneMark CDS prediction algorithm to the observed coding sequence classes, its quality of prediction was greatly improved. Similar improvement can be expected with other prediction systems.

Language: English

DOI

Text on another site

Flexible Stock Market Algorithm

Rubchinskiy A., Chubarova D., Technology and Investment 2025 Vol. 16 No. 4 P. 211–240

The article considers one of the most famous examples of socio-economic systems characterized by significant uncertainty—the S&P-500 stock market, where shares of 500 largest US companies are traded. The flexible algorithm for daily trading has been developed. It is based on known fixed data about cost of shares in previous days as well as on ...

Added: December 19, 2025

Эффективный алгоритм торговли на фондовом рынке: ретроспективный анализ, основанный на данных по S&P-500.

Rubchinskiy A., Chubarova D., / Series WP7 "Математические методы анализа решений в экономике, бизнесе и политике". 2025. No. WP7/2025/01.

The article examines one of the most famous examples of socio-economic systems, characterized by significant uncertainty – the S&P-500 stock market, where shares of 500 largest US companies are traded. No assumptions are made about the probabilistic characteristics of the stock market. A flexible algorithm for daily trading has been developed, based on both known fixed data ...

Added: November 9, 2025

Computer tools in mental disorders diagnostics by oral speech

Khomenko A., Komratova A., Isakov D. et al., , in: Computational linguistics and intellectual technologies. Papers from the Annual International Conference "Dialogue" (2025)Vol. 23.: [б.и.], 2025. P. 147–157.

The integration of automated speech analysis in diagnosing mental health disorders is becoming increasingly significant in both clinical and computational linguistics. This study aims to construct linguistic profiles for individuals with neurocognitive and affective mental disorders. Using speech transcriptions and relevant to the study computational techniques like lexical clustering and stylostatistical analysis, this research looks ...

Added: October 19, 2025

ОТСЛЕЖИВАНИЕ РАЗВИТИЯ РАЗРУШЕНИЯ С ПОМОЩЬЮ КЛАСТЕРИЗАЦИИ ИМПУЛЬСОВ ТЕРМИЧЕСКИ СТИМУЛИРОВАННОЙ АКУСТИЧЕСКОЙ ЭМИССИИ ПРИ ОТСУТСТВИИ ЛОКАЦИИ

Индаков Г. С., Казначеев П. А., Майбук З. Я. et al., Геофизические исследования 2025 Т. 26 № 2 С. 99–124

The paper studies the clusterability of acoustic emission pulses during high-temperature heating of sandstone sample preliminarily subjected to mechanical loading. Mechanical loading was applied in uniaxial mode up to load close to destructive with appearance of signs of large cracks on the surface. After that, samples were subjected to thermal treatment up to 650 °C ...

Added: September 19, 2025

Анализ тематики повседневных разговоров: экспертный подход и автоматические методы

Sherstinova T., Вепринцева Д. А., Человек: образ и сущность. Гуманитарные аспекты 2025 № 2(62) С. 89–108

В статье рассматриваются три разных подхода к изучению тематики повседневных разговоров: экспертная тематическая разметка и два автоматических метода (тематическое моделирование и кластеризация). Материалом для исследования послужили расшифровки русской устной повседневной речи из корпуса ОРД, подготовленные на основе звукозаписей спонтанных разговоров, выполненных в естественных коммуникативных ситуациях (дома, на работе, в учебном заведении, в магазине, в поликлинике ...

Added: September 3, 2025

Polyvinylpyrrolidone–Alginate Film Barriers for Abdominal Surgery: Anti-Adhesion Effect in Murine Model

Forysenkova Anna A., Konovalova M., Fadeeva I. et al., Materials 2023 Vol. 16 No. 16 P. 5532–5549

Surgical operations on the peritoneum are often associated with the formation of adhesions, which can interfere with the normal functioning of the internal organs. The effectiveness of existing barrier materials is relatively low. In this work, the effectiveness of soluble alginate–polyvinylpyrrolidone (PVP-Alg) and non-soluble Ca ion cross-linked (PVP-Alg-Ca) films in preventing these adhesions was evaluated. ...

Added: August 12, 2025

Changes in the Expression of Genes Regulating the Response to Hypoxia, Inflammation, Cell Cycle, Apoptosis, and Epithelial Barrier Functioning during Colitis-Associated Colorectal Cancer Depend on Individual Hypoxia Tolerance

Dzhalilova D., Silina M., Tsvetkov I. et al., International Journal of Molecular Sciences 2024 Vol. 25 No. 14 Article 7801

One of the factors contributing to colorectal cancer (CRC) development is inflammation, which is mostly hypoxia-associated. This study aimed to characterize the morphological and molecular biological features of colon tumors in mice that were tolerant and susceptible to hypoxia based on colitis-associated CRC (CAC). Hypoxia tolerance was assessed through a gasping time evaluation in a ...

Added: March 25, 2025

Maksimov A. G., Telezhkina M., / NRU Higher School of Economics. Series EC "Economics". 2024. No. 271.

The paper examines similarity of models with structural changes among heterogeneous panel data units. We propose applying a cosine metric to compare angles between vectors of weighted coefficients as a measure of closeness of economic models. Testing whether the cosine metric value is zero against nonzero, positive, and negative alternatives enriches traditional testing results. The ...

Added: March 10, 2025

Метод туннельной кластеризации

Aleskerov F. T., Myachin A. L., Yakuba V. I., Доклады Российской академии наук. Математика, информатика, процессы управления (ранее - Доклады Академии Наук. Математика) 2024 Т. 520 № 1 С. 29–34

Предлагается новый метод быстрого поиска закономерностей в числовых данных большой раз-мерности, названный “туннельной кластеризацией”. Основными преимуществами нового методаявляются: относительно невысокая вычислительная сложность; эндогенное определение составаи количества кластеров; высокая степень интерпретируемости конечных результатов. Приведеноописание трех различных вариаций: с фиксированными гиперпараметрами, адаптивными, а так-же комбинированный подход. Рассмотрены три основных свойства туннельной кластеризации.Практическое применение приведено как на синтетических ...

Added: March 3, 2025

Tunnel Clustering Method

F. T. Aleskerov, A. L. Myachin, V. I. Yakuba, Doklady Mathematics 2024 Vol. 110 No. 3 P. 474–479

We propose a novel method for rapid pattern analysis of high-dimensional numerical data, termed tunnel clustering. The main advantages of the method are its relatively low computational complexity, endogenous determination of cluster composition and number, and a high degree of interpretability of final results. We present descriptions of three different variations: one with fixed hyperparameters, ...

Added: March 3, 2025

Использование Z-чисел для описания набора данных

Гусейнов О., Degtyarev K. Y., IRETC MTÜ PAHTEI - Proceedings of Azerbaijan High Technical Educational Institutions 2025 Т. 48 № 1 С. 360–370

The concept of Z-number was proposed by Prof. Lotfi Zadeh to describe partial reliability of information, and it is a kind of fusion of fuzziness and probabilistic uncertainty. Z-number can be presented as a pair of fuzzy numbers Z(A,B) used to describe a value of a random variable X. The first component (A) is a ...

Added: February 20, 2025

International Legal Framework for the Application of Genetic Technologies: Main Features and Issues Open for Discussion

Gazina N., Teymurov E., Zakharova L., Kutafin Law Review 2022 Vol. 9 No. 1 P. 39–72

The objective of the present article is to determine the specific characteristics of the established international legal framework for the application of genetic technologies and to identify general guidelines that influence states' policies in this area.Genetic technologies evolve rapidly, raising a number of ethical and legal issues and directly affecting human rights. At the universal ...

Added: January 27, 2025

Gradient descent clustering with regularization to recover communities in transformed attributed networks

Shalileh S., Social Network Analysis and Mining 2025 Vol. 15212 P. 137–148

Community detection in attributed networks aims to recover clusters in which the within-community nodes are as interconnected and as homogeneous as possible, while the between-communities nodes are as disconnected and as heterogeneous as possible. The current research proposes a straightforward data-driven model with an integrated regularization term to recover communities. For further improvement of the ...

Added: November 30, 2024

An empirical scrutinization of four crisp clustering methods with four distance metrics and one straightforward interpretation rule

T. A. Alvandyan, S. Shalileh, Doklady Mathematics 2024 Vol. 110 No. S1 P. S236–S250

Clustering has always been in great demand by scientific and industrial communities. However, due to the lack of ground truth, interpreting its obtained results can be debatable. The current research provides an empirical benchmark on the efficiency of three popular and one recently proposed crisp clustering methods. To this end, we extensively analyzed these (four) ...

Added: November 30, 2024