Software

Our goal is to build scientific software capable of handling millions of genomes to gain analytical insights for predictive molecular biology research.

Our software collection can be found at:

Software Overview:

Comparative Genomics

DIAMOND (Double Index Alignment of Next-Generation Sequencing Data)

Description: DIAMOND is an ultra-fast and sensitive sequence aligner for protein and translated DNA searches at tree-of-life scale. DIAMOND matches the alignment sensitivity of the gold-standard tool BLASTP when run in –very-sensitive and –ultra-sensitive modes, while achieving up to 360x computational speed-up.

The key features are:

  • Pairwise alignment of proteins and translated DNA at 500x-10,000x speed of BLAST.
  • Diverse sensitivity modi of pairwise protein alignments ranging from fast to ultra-sensitive and speedups between 80x – 10,000x speed of BLASTP.
  • Frameshift alignments for long read analysis.
  • Low resource requirements and suitable for running on standard desktops or laptops.
  • Various output formats, including BLAST pairwise, tabular and XML, as well as taxonomic classification.

Webpage: https://github.com/bbuchfink/diamond

Maintainer: Benjamin Buchfink;

Initially developed by Benjamin Buchfink in Daniel Huson’s lab.

(see B Buchfink, C Xie, DH Huson. Nature methods 12 (1), 59-60).

biomartr

Description: Fully automated genomic data retrieval at tree-of-life scale

The major aim of biomartr is to facilitate computational reproducibility and large-scale handling of genomic data for (meta-)genomic analyses. In addition, biomartr aims to address the genome version crisis. With biomartr users can now control and be informed about the genome versions they retrieve automatically. Many large scale genomics studies lack this information and thus, reproducibility and data interpretation becomes nearly impossible when documentation of genome version information gets neglected. In particular, biomartr automates genome, proteome, CDS, RNA, Repeats, GFF/GTF (annotation), genome assembly quality, and metagenome project data retrieval from the major biological databases and performs quality checks of the underlying genome assemblies.

Webpage: https://ropensci.github.io/biomartr/

Maintainer: Hajk-Georg Drost

orthologr

Description: Genome wide orthology inference and dN/dS estimation

The comparative method is a powerful approach in genomics research. Based on our knowledge about the phylogenetic relationships between species, we can study the evolution, diversification, and constraints of biological processes by comparing genomes, genes, and other genomic loci across species. The orthologr package aims to provide a framework to perform large scale comparative genomics studies with thousands of genomes. Orthologr aims to be as easy to use as possible – from genomic data retrieval to orthology inference and dNdS estimation between several genomes. In combination with the R package biomartr, users can retrieve thousands of genomes, proteomes, or coding sequences for diverse species and use them as input for orthology inference and dN/dS estimation with orthologr. The advantage of using biomartr in combination with orthologr is that users can join the new wave of research that promotes and facilitates computational reproducibility in genomics studies and solve the issue of comparing thousands of genomes with different genome assembly qualities.

Webpage: https://drostlab.github.io/orthologr/

Maintainer: Hajk-Georg Drost

metablastr

Description: Massive BLAST fuelled sequence searches with R

The R package metablastr harnesses the power of sequence search tools by providing interface functions between R and the standalone (command-line tool) version of BLAST for sequence searches against thousands of genomes or proteomes. In addition to providing interface functions, metablastr provides a scalable database backend infrastructure and analytics tools to store and handle the extensive search output generated by BLAST when handling thousands of genomes.

Webpage: https://drostlab.github.io/metablastr/

Maintainer: Hajk-Georg Drost

Evolutionary Transcriptomics

myTAI

Description: Evolutionary transcriptomics with R

Evolutionary transcriptomics studies can serve as a first approach to screen in silico for the potential existence of evolutionary constraints within a biological process of interest. This is achieved by combining comparative genomics output with transcriptome data to quantify transcriptome conservation patterns and their underlying gene sets in biological processes. The exploratory analysis functions implemented in R package myTAI provide users with an efficient, scalable, and automated framework to detect patterns of evolutionary constraints in any transcriptome dataset of interest.

Webpage: https://github.com/drostlab/myTAI

Maintainer: Hajk-Georg Drost

Causal Inference of Gene Regulatory Networks

edgynode

Description: Evolutionary simulation and statistical assessment of gene regulatory networks

Webpage: https://github.com/drostlab/edgynode

Maintainer: Hajk-Georg Drost

Network Inference Pipeline

Description: Network Inference Steps using bulk- or single-cell RNAseq data as input

Webpage:

https://github.com/drostlab/network-inference-pipeline

Maintainer: Lukas Maischak

Network Inference Toolbox

Description: Singularity containers wrapping the most prominent bulk- and single-cell network inference tools

Webpage: https://github.com/drostlab/network-inference-toolbox

Maintainer: Lukas Maischak

Transposable Element Annotation at Tree-of-Life Scale

LTRpred

Description: De novo annotation of young and intact retrotransposons

The LTRpred pipeline performs de novo annotation of intact retrotransposons within any given genome assembly. This software tool is optimized to de novo annotate intact retrotransposons in thousands of genomes. The difference between a classical annotation and a functional annotation of intact elements is that in a classical annotation the main goal is to annotate as much of the TE space as possible and as a result, most annotated TE loci comprise of truncated, overlapping or nested TEs. In most cases, such TEs are not mobile any more and rather characterize the historic activity of TEs in the respective genome. However, when aiming to mobilize extant TEs, we need a high-quality annotation of young and functional elements within a genome of interest. LTRpred aims to assist such mobilization efforts by providing a comprehensive pipeline able to detect intact and possibly functional retrotransposons in any genome assembly.

Webpage: https://hajkd.github.io/LTRpred/

Maintainer: Hajk-Georg Drost

retrocombinator

Description: Simulating the molecular evolutionary process of retrotransposon recombination

The retrocombinator package provides a comprehensive framework to simulate the molecular evolution of retrotransposon recombination. Given a set of input sequences the package will simulate the reproductive biology of retrotransposons taking their extrachromosomal reverse-transcription-mediated recombination capacity into account. This allows users to simulate the sequence divergence of retrotransposon families over long evolutionary time scales.

Webpage: https://drostlab.github.io/retrocombinator/

Maintainers: Anindya Sharma and Hajk-Georg Drost

Statistical Learning and Information Theory

philentropy

Description: Information Theory and Distance Quantification in R

The R package philentropy implements >46 fundamental distance and similarity measures to quantify distances between probability density functions as well as traditional information theory measures. Comparison is a fundamental method of scientific research leading to more general insights about the processes that generate similarity or dissimilarity. In statistical terms comparisons between probability functions are performed to infer connections, correlations, or relationships between samples. The philentropy package implements optimized distance and similarity measures for comparing probability functions. These comparisons between probability functions have their foundations in a broad range of scientific disciplines from mathematics to ecology. The aim of this package is to provide a base framework for clustering, classification, statistical inference, goodness-of-fit, non-parametric statistics, information theory, and machine learning tasks that are based on comparing univariate or multivariate probability functions.

Webpage: https://drostlab.github.io/philentropy/

Maintainer: Hajk-Georg Drost