Background: Affymetrix microarray technology allows one to investigate expression of thousands of genes simultaneously upon a variety of conditions. In a popular U133A microarray platform, the expression of 37% of genes is measured by more than one probeset. The discordant expression observed for two different probesets that match the same gene is a widespread phenomenon which is usually underestimated, ignored or disregarded. Results: Here we evaluate the prevalence of discordant expression in data collected using Affymetrix HG-U133A microarray platform. In U133A, about 30% of genes annotated by two different probesets demonstrate a substantial correlation between independently measured expression values. To our surprise, sorting the probesets according to the nature of the discrepancy in their expression levels allowed the classification of the respective genes according to their fundamental functional properties, including observed enrichment by tissue-specific transcripts and alternatively spliced variants. On another hand, an absence of discrepancies in probesets that simultaneously match several different genes allowed us to pinpoint non-expressed pseudogenes and gene groups with highly correlated expression patterns. Nevertheless, in many cases, the nature of discordant expression of two probesets that match the same transcript remains unexplained. It is possible that these probesets report differently regulated sets of transcripts, or, in best case scenario, two different sets of transcripts that represent the same gene. Conclusion: The majority of absolute gene expression values collected using Affymetrix microarrays may not be suitable for typical interpretative downstream analysis.
Background. Restriction-modification (R-M) systems protect bacteria and archaea from attacks by bacteriophages and archaeal viruses. An R-M system specifically recognizes short sites in foreign DNA and cleaves it, while such sites in the host DNA are protected by methylation. Prokaryotic viruses have developed a number of strategies to overcome this host defense. The simplest anti-restriction strategy is the elimination of recognition sites in the viral genome: no sites, no DNA cleavage. Even a decrease of the number of recognition sites can help a virus to overcome this type of host defense. Recognition site avoidance has been a known anti-restriction strategy of prokaryotic viruses for decades. However, recognition site avoidance has not been systematically studied with the currently available sequence data. We analyzed the complete genomes of almost 4000 prokaryotic viruses with known host species and more than 17,000 restriction endonucleases with known specificities in terms of recognition site avoidance.
Results. We observed considerable limitations of recognition site avoidance as an anti-restriction strategy. Namely, the avoidance of recognition sites is specific for dsDNA and ssDNA prokaryotic viruses. Avoidance is much more pronounced in the genomes of non-temperate bacteriophages than in the genomes of temperate ones. Avoidance is not observed for the sites of Type I and Type IIG systems and is very rarely observed for the sites of Type III systems. The vast majority of avoidance cases concern recognition sites of orthodox Type II restriction-modification systems. Even under these constraints, complete or almost complete elimination of sites is observed for approximately one-tenth of viral genomes and a significant under-representation for approximately one-fourth of them.
Conclusions. Avoidance of recognition sites of restriction-modification systems is a widespread but not universal anti-restriction strategy of prokaryotic viruses.
Error correction of sequenced reads remains a difficult task, especially in single-cell sequencing projects with extremely non-uniform coverage. While existing error correction tools designed for standard (multi-cell) sequencing data usually come up short in single-cell sequencing projects, algorithms actually used for single-cell error correction have been so far very simplistic.
We introduce several novel algorithms based on Hamming graphs and Bayesian subclustering in our new error correction tool BAYESHAMMER. While BAYESHAMMER was designed for single-cell sequencing, we demonstrate that it also improves on existing error correction tools for multi-cell sequencing data while working much faster on real-life datasets. We benchmark BAYESHAMMER on both k-mer counts and actual assembly results with the SPADES genome assembler.
BACKGROUND: Chlamydia are ancient intracellular pathogens with reduced, though strikingly conserved genome. Despite their parasitic lifestyle and isolated intracellular environment, these bacteria managed to avoid accumulation of deleterious mutations leading to subsequent genome degradation characteristic for many parasitic bacteria. RESULTS: We report pan-genomic analysis of sixteen species from genus Chlamydia including identification and functional annotation of orthologous genes, and characterization of gene gains, losses, and rearrangements. We demonstrate the overall genome stability of these bacteria as indicated by a large fraction of common genes with conserved genomic locations. On the other hand, extreme evolvability is confined to several paralogous gene families such as polymorphic membrane proteins and phospholipase D, and likely is caused by the pressure from the host immune system. CONCLUSIONS: This combination of a large, conserved core genome and a small, evolvable periphery likely reflect the balance between the selective pressure towards genome reduction and the need to adapt to escape from the host immunity.
A vast amount of DNA variation is being identified by increasingly large-scale exome and genome sequencing projects. To be useful, variants require accurate functional annotation and a wide range of tools are available to this end. McCarthy et al recently demonstrated the large differences in prediction of loss-of-function (LoF) variation when RefSeq and Ensembl transcripts are used for annotation, highlighting the importance of the reference transcripts on which variant functional annotation is based.
We describe a detailed analysis of the similarities and differences between the gene and transcript annotation in the GENCODE and RefSeq genesets. We demonstrate that the GENCODE Comprehensive set is richer in alternative splicing, novel CDSs, novel exons and has higher genomic coverage than RefSeq, while the GENCODE Basic set is very similar to RefSeq. Using RNAseq data we show that exons and introns unique to one geneset are expressed at a similar level to those common to both. We present evidence that the differences in gene annotation lead to large differences in variant annotation where GENCODE and RefSeq are used as reference transcripts, although this is predominantly confined to non-coding transcripts and UTR sequence, with at most ~30% of LoF variants annotated discordantly. We also describe an investigation of dominant transcript expression, showing that it both supports the utility of the GENCODE Basic set in providing a smaller set of more highly expressed transcripts and provides a useful, biologically-relevant filter for further reducing the complexity of the transcriptome.
The reference transcripts selected for variant functional annotation do have a large effect on the outcome. The GENCODE Comprehensive transcripts contain more exons, have greater genomic coverage and capture many more variants than RefSeq in both genome and exome datasets, while the GENCODE Basic set shows a higher degree of concordance with RefSeq and has fewer unique features. We propose that the GENCODE Comprehensive set has great utility for the discovery of new variants with functional potential, while the GENCODE Basic set is more suitable for applications demanding less complex interpretation of functional variants.
In the process of retrotransposition LINEs use their own machinery for copying and inserting themselves into new genomic locations, while SINEs are parasitic and require the machinery of LINEs. The exact mechanism of how a LINE-encoded reverse transcriptase (RT) recognizes its own and SINE RNA remains unclear. However it was shown for the stringent-type LINEs that recognition of a stem-loop at the 3'UTR by RT is essential for retrotransposition. For the relaxed-type LINEs it is believed that the poly-A tail is a common recognition element between LINE and SINE RNA. However polyadenylation is a property of any messenger RNA, and how the LINE RT recognizes transposon and non-transposon RNAs remains an open question. It is likely that RNA secondary structures play an important role in RNA recognition by LINE encoded proteins.
Here we selected a set of L1 and Alu elements from the human genome and investigated their sequences for the presence of position-specific stem-loop structures. We found highly conserved stem-loop positions at the 3'UTR. Comparative structural analyses of a human L1 3'UTR stem-loop showed a similarity to 3'UTR stem-loops of the stringent-type LINEs, which were experimentally shown to be recognized by LINE RT. The consensus stem-loop structure consists of 5-7 bp loop, 8-10 bp stem with a bulge at a distance of 4-6 bp from the loop. The results show that a stem loop with a bulge exists at the 3'-end of Alu. We also found conserved stem-loop positions at 5'UTR and at the end of ORF2 and discuss their possible role.
Here we presented an evidence for the presence of a highly conserved 3'UTR stem-loop structure in L1 and Alu retrotransposons in the human genome. Both stem-loops show structural similarity to the stem-loops of the stringent-type LINEs experimentally confirmed as essential for retrotransposition. Here we hypothesize that both L1 and Alu RNA are recognized by L1 RT via the 3'-end RNA stem-loop structure. Other conserved stem-loop positions in L1 suggest their possible functions in protein-RNA interactions but to date no experimental evidence has been reported.
Background: Rust fungi are biotrophic basidiomycete plant pathogens that cause major diseases on plants and trees world-wide, affecting agriculture and forestry. Their biotrophic nature precludes many established molecular genetic manipulations and lines of research. The generation of genomic resources for these microbes is leading to novel insights into biology such as interactions with the hosts and guiding directions for breakthrough research in plant pathology.Results: To support gene discovery and gene model verification in the genome of the wheat leaf rust fungus, Puccinia triticina (Pt), we have generated Expressed Sequence Tags (ESTs) by sampling several life cycle stages. We focused on several spore stages and isolated haustorial structures from infected wheat, generating 17,684 ESTs. We produced sequences from both the sexual (pycniospores, aeciospores and teliospores) and asexual (germinated urediniospores) stages of the life cycle. From pycniospores and aeciospores, produced by infecting the alternate host, meadow rue (Thalictrum speciosissimum), 4,869 and 1,292 reads were generated, respectively. We generated 3,703 ESTs from teliospores produced on the senescent primary wheat host. Finally, we generated 6,817 reads from haustoria isolated from infected wheat as well as 1,003 sequences from germinated urediniospores. Along with 25,558 previously generated ESTs, we compiled a database of 13,328 non-redundant sequences (4,506 singlets and 8,822 contigs). Fungal genes were predicted using the EST version of the self-training GeneMarkS algorithm. To refine the EST database, we compared EST sequences by BLASTN to a set of 454 pyrosequencing-generated contigs and Sanger BAC-end sequences derived both from the Pt genome, and to ESTs and genome reads from wheat. A collection of 6,308 fungal genes was identified and compared to sequences of the cereal rusts, Puccinia graminis f. sp. tritici (Pgt) and stripe rust, P. striiformis f. sp. tritici (Pst), and poplar leaf rust Melampsora species, and the corn smut fungus, Ustilago maydis (Um). While extensive homologies were found, many genes appeared novel and species-specific; over 40% of genes did not match any known sequence in existing databases. Focusing on spore stages, direct comparison to Um identified potential functional homologs, possibly allowing heterologous functional analysis in that model fungus. Many potentially secreted protein genes were identified by similarity searches against genes and proteins of Pgt and Melampsora spp., revealing apparent orthologs.Conclusions: The current set of Pt unigenes contributes to gene discovery in this major cereal pathogen and will be invaluable for gene model verification in the genome sequence.
BACKGROUND: Escherichia coli (E. coli) has been increasingly implicated in the pathogenesis of Crohn's disease (CD). The phylogeny of E. coli isolated from Crohn's disease patients (CDEC) was controversial, and while genotyping results suggested heterogeneity, the sequenced strains of E. coli from CD patients were closely related. RESULTS: We performed the shotgun genome sequencing of 28 E. coli isolates from ten CD patients and compared genomes from these isolates with already published genomes of CD strains and other pathogenic and non-pathogenic strains. CDEC was shown to belong to A, B1, B2 and D phylogenetic groups. The plasmid and several operons from the reference CD-associated E. coli strain LF82 were demonstrated to be more often present in CDEC genomes belonging to different phylogenetic groups than in genomes of commensal strains. The operons include carbon-source induced invasion GimA island, prophage I, iron uptake operons I and II, capsular assembly pathogenetic island IV and propanediol and galactitol utilization operons. CONCLUSIONS: Our findings suggest that CDEC are phylogenetically diverse. However, some strains isolated from independent sources possess highly similar chromosome or plasmids. Though no CD-specific genes or functional domains were present in all CD-associated strains, some genes and operons are more often found in the genomes of CDEC than in commensal E. coli. They are principally linked to gut colonization and utilization of propanediol and other sugar alcohols.
BACKGROUND: The genus Burkholderia consists of species that occupy remarkably diverse ecological niches. Its best known members are important pathogens, B. mallei and B. pseudomallei, which cause glanders and melioidosis, respectively. Burkholderia genomes are unusual due to their multichromosomal organization, generally comprised of 2-3 chromosomes. RESULTS: We performed integrated genomic analysis of 127 Burkholderia strains. The pan-genome is open with the saturation to be reached between 86,000 and 88,000 genes. The reconstructed rearrangements indicate a strong avoidance of intra-replichore inversions that is likely caused by selection against the transfer of large groups of genes between the leading and the lagging strands. Translocated genes also tend to retain their position in the leading or the lagging strand, and this selection is stronger for large syntenies. Integrated reconstruction of chromosome rearrangements in the context of strains phylogeny reveals parallel rearrangements that may indicate inversion-based phase variation and integration of new genomic islands. In particular, we detected parallel inversions in the second chromosomes of B. pseudomallei with breakpoints formed by genes encoding membrane components of multidrug resistance complex, that may be linked to a phase variation mechanism. Two genomic islands, spreading horizontally between chromosomes, were detected in the B. cepacia group. CONCLUSIONS: This study demonstrates the power of integrated analysis of pan-genomes, chromosome rearrangements, and selection regimes. Non-random inversion patterns indicate selective pressure, inversions are particularly frequent in a recent pathogen B. mallei, and, together with periods of positive selection at other branches, may indicate adaptation to new niches. One such adaptation could be a possible phase variation mechanism in B. pseudomallei.
Differential RNA-sequencing (dRNA-seq) is indispensable for determination of primary transcriptomes. However, using dRNA-seq data to map transcriptional start sites (TSSs) and promoters genome-wide is a bioinformatics challenge. We performed dRNA-seq of Bradyrhizobium japonicum USDA 110, the nitrogen-fixing symbiont of soybean, and developed algorithms to map TSSs and promoters.
A specialized machine learning procedure for TSS recognition allowed us to map 15,923 TSSs: 14,360 in free-living bacteria, 4329 in symbiosis with soybean and 2766 in both conditions. Further, we provide proteomic evidence for 4090 proteins, among them 107 proteins corresponding to new genes and 178 proteins with N-termini different from the existing annotation (72 and 109 of them with TSS support, respectively). Guided by proteomics evidence, previously identified TSSs and TSSs experimentally validated here, we assign a score threshold to flag 14 % of the mapped TSSs as a class of lower confidence. However, this class of lower confidence contains valid TSSs of low-abundant transcripts. Moreover, we developed a de novo algorithm to identify promoter motifs upstream of mapped TSSs, which is publicly available, and found motifs mainly used in symbiosis (similar to RpoN-dependent promoters) or under both conditions (similar to RpoD-dependent promoters). Mapped TSSs and putative promoters, proteomic evidence and updated gene annotation were combined into an annotation file.
The genome-wide TSS and promoter maps along with the extended genome annotation of B. japonicum represent a valuable resource for future systems biology studies and for detailed analyses of individual non-coding transcripts and ORFs. Our data will also provide new insights into bacterial gene regulation during the agriculturally important symbiosis between rhizobia and legumes.
The general structure and action of all eukaryotic and archaeal RNA polymerases machinery have an astonishing similarity despite the diversity of core promoter sequences in different species. The goal of our work is to find common characteristics of DNA region that define it as a promoter for the RNA polymerase II (Pol II).
The profiles of a large number of physical and structural characteristics, averaged over representative sets of the Pol II minimal core promoters of the evolutionary divergent species from animals, plants and unicellular fungi were analysed. In addition to the characteristics defined at the base-pair steps, we, for the first time, use profiles of the ultrasonic cleavage and DNase I cleavage indexes, informative for internal properties of each complementary strand.
DNA of the core promoters of metazoans and Schizosaccharomyces pombe has similar structural organization. Its mechanical and 3D structural characteristics have singular properties at the positions of TATA-box. The minor groove is broadened and conformational motion is decreased in that region. Special characteristics of conformational behavior are revealed in metazoans at the region, which connects the end of TATA-box and the transcription start site (TSS). The intensities of conformational motions in the complementary strands are periodically changed in opposite phases. They are noticeable, best of all, in mammals. Such conformational features are lacking in the core promoters of S. pombe. The profiles of Saccharomyces cerevisiae core promoters significantly differ: their singular region is shifted down thus pointing to the uniqueness of their structural organization. Obtained results may be useful in genetic engineering for artificial modulation of the promoter strength.