Our goal is to build scientific software capable of handling hundreds of thousands of genomes to gain analytical insights for predictive molecular biology research.

Our software collection can be found at: .

Software Overview:


Description: The vastly growing number of sequenced genomes allows us to perform a new type of biological research. Using a comparative approach these genomes provide us with new insights on how biological information is encoded on the molecular level and how this information changes over evolutionary time. The first step, however, of any genome-based study is to retrieve genomes and their annotation from databases. To automate the retrieval process of this information on a meta-genomic scale, the biomartr package provides interface functions for genomic sequence retrieval and functional annotation retrieval for thousands of species. The major aim of biomartr is to facilitate computational reproducibility and large-scale handling of genomic data for (meta-)genomic analyses. In addition, biomartr aims to address the genome version crisis. With biomartr users can now control and be informed about the genome versions they retrieve automatically. Many large scale genomics studies lack this information and thus, reproducibility and data interpretation becomes nearly impossible when documentation of genome version information gets neglected. In particular, biomartr automates genome, proteome, CDS, RNA, Repeats, GFF/GTF (annotation), genome assembly quality, and metagenome project data retrieval from the major biological databases and performs quality checks of the underlying genome assemblies.



Description: The R package metablastr harnesses the power of sequence search tools by providing interface functions between R and the standalone (command-line tool) version of BLAST for sequence searches against thousands of genomes or proteomes. In addition to providing interface functions, metablastr provides a scalable database backend infrastructure and analytics tools to store and handle the extensive search output generated by BLAST when handling thousands of genomes.



Description: Developed by Benjamin Buchfink, DIAMOND2 is an ultra-fast sequence aligner for protein and translated DNA searches, designed for high performance analysis of big sequence data. The key features are:

  • Pairwise alignment of proteins and translated DNA at 500x-20,000x speed of BLAST.
  • Frameshift alignments for long read analysis.
  • Low resource requirements and suitable for running on standard desktops or laptops.
  • Various output formats, including BLAST pairwise, tabular and XML, as well as taxonomic classification.

Please find the publication to the initial DIAMOND version here:

Buchfink B, Xie C, Huson DH, “Fast and sensitive protein alignment using DIAMOND”, Nature Methods 12, 59-60 (2015). doi:10.1038/nmeth.3176



Description: The comparative method is a powerful approach in genomics research. Based on our knowledge about the phylogenetic relationships between species, we can study the evolution, diversification, and constraints of biological processes by comparing genomes, genes, and other genomic loci across species. The orthologr package aims to provide a framework to perform large scale comparative genomics studies with thousands of genomes. Orthologr aims to be as easy to use as possible – from genomic data retrieval to orthology inference and dNdS estimation between several genomes. In combination with the R package biomartr, users can retrieve thousands of genomes, proteomes, or coding sequences for diverse species and use them as input for orthology inference and dN/dS estimation with orthologr. The advantage of using biomartr in combination with orthologr is that users can join the new wave of research that promotes and facilitates computational reproducibility in genomics studies and solve the issue of comparing thousands of genomes with different genome assembly qualities.



Description: Evolutionary transcriptomics studies can serve as a first approach to screen in silico for the potential existence of evolutionary constraints within a biological process of interest. This is achieved by combining comparative genomics output with transcriptome data to quantify transcriptome conservation patterns and their underlying gene sets in biological processes. The exploratory analysis functions implemented in R package myTAI provide users with an efficient, scalable, and automated framework to detect patterns of evolutionary constraints in any transcriptome dataset of interest.



Description: The LTRpred pipeline performs de novo annotation of intact retrotransposons within any given genome assembly. This software tool is optimized to de novo annotate intact retrotransposons in thousands of genomes. The difference between a classical annotation and a functional annotation of intact elements is that in a classical annotation the main goal is to annotate as much of the TE space as possible and as a result, most annotated TE loci comprise of truncated, overlapping or nested TEs. In most cases, such TEs are not mobile any more and rather characterize the historic activity of TEs in the respective genome. However, when aiming to mobilize extant TEs, we need a high-quality annotation of young and functional elements within a genome of interest. LTRpred aims to assist such mobilization efforts by providing a comprehensive pipeline able to detect intact and possibly functional retrotransposons in any genome assembly.



Description: The retrocombinator package provides a comprehensive framework to simulate the molecular evolution of retrotransposon recombination. Given a set of input sequences the package will simulate the reproductive biology of retrotransposons taking their extrachromosomal reverse-transcription-mediated recombination capacity into account. This allows users to simulate the sequence divergence of retrotransposon families over long evolutionary time scales.



Description: The R package philentropy implements >46 fundamental distance and similarity measures to quantify distances between probability density functions as well as traditional information theory measures. Comparison is a fundamental method of scientific research leading to more general insights about the processes that generate similarity or dissimilarity. In statistical terms comparisons between probability functions are performed to infer connections, correlations, or relationships between samples. The philentropy package implements optimized distance and similarity measures for comparing probability functions. These comparisons between probability functions have their foundations in a broad range of scientific disciplines from mathematics to ecology. The aim of this package is to provide a base framework for clustering, classification, statistical inference, goodness-of-fit, non-parametric statistics, information theory, and machine learning tasks that are based on comparing univariate or multivariate probability functions.



Description: The seqstats package provides a comprehensive framework for biological sequence statistics to allow researchers to design null hypotheses and random controls when performing exploratory genomics studies.