The microbiome can be defined as the community of microorganisms that live in a particular environment. Metagenomics is the practice of sequencing DNA from the genomes of all organisms present in a particular sample, and has become a common method for the study of microbiome population structure and function. Increasingly, researchers are finding novel genes encoded within metagenomes, many of which may be of interest to the biotechnology and pharmaceutical industries. However, such “bioprospecting” requires a suite of sophisticated bioinformatics tools to make sense of the data. This review summarizes the most commonly used bioinformatics tools for the assembly and annotation of metagenomic sequence data with the aim of discovering novel genes.
The term microbiome refers to the entire community of micro-organisms that exist within any particular ecosystem, and includes bacteria, archaea, viruses, phages, fungi, and protozoa; though the majority of microbiome studies focus only on the bacteria and archaea. There are two main methods for studying the microbiome using high-throughput sequencing: marker-gene studies and whole-genome-shotgun (WGS) metagenomics. In marker-gene studies, generic primers are designed to PCR amplify a particular gene (e.g., 16S rRNA for bacteria/archaea, 18S for fungi) from all genomes present in a sample, and the resulting product is sequenced. The sequences are clustered into operational-taxonomic-units (OTUs) and these are compared across samples. Whilst fast and cheap, this method does not reveal anything else about the hundreds of thousands of genes encoded in the parts of the (meta) genomes that remained unsequenced.
Metagenomics, also referred to as WGS- or shotgun- metagenomics, can offer an alternative and complementary method. Handelsman et al. (1998) first coined the term as the functional analysis of a collection of microbial DNA extracted from soil samples. Metagenomics refer to the application of sequencing techniques to the entirety of the genomic material in the microbiome of a sample. Crucially, by sequencing the genomes of all organisms rather than a single marker gene, metagenomic studies can provide information about the function of genes, the structure and organization of genomes, identification of novel genes and biocatalysts, community structure and evolutionary relationships within the microbial community.
Advances in metagenomics have themselves been driven by advances in second- and third- generation sequencing technologies, which are now capable of producing hundreds of gigabases of DNA sequenced data at a very low cost (Watson, 2014). The high sequencing depth offered by such advances, means that even the least abundant microorganisms in an environment is possible to be represented. Modern sequencing technologies, in combination with continuing improvements in bioinformatics, have made metagenomic analysis an approachable, affordable and fast technique for most labs.
The microbiome can potentially provide a wide range of novel enzymes and biocatalysts with major applications in the marketplace, for example the biotechnology, biofuels and pharmaceutical industry (Cowan et al., 2004). Hess et al. (2011), through an extended metagenomic study, reported over 2.5 million novel genes and identified more than 27,000 putative carbohydrate-active enzymes with cellulolytic function. They also revealed the nearly complete genomes of 15 microorganisms which had never cultured in the lab. Samples were taken from the rumen of fistulated cows and sequenced using Illumina sequencing. The data were assembled using a de novo assembler and screened against public databases to define novelty. Wallace et al. (2015) also sequenced ruminal digesta samples using Illumina sequencing, assembling the data de novo. Annotation of the resulting contigs revealed over 1.5 million putative genes, with 58% having no known protein domain. Of over 2700 genes associated with methane emissions, only 0.6% had an exact match in the non-redundant protein database of the NCBI (Roehe et al., 2016).
Venter et al. (2004) discovered over 1.2 million unknown genes using metagenomic sequencing of the Sargasso Sea. Genomic libraries were sequenced, assembled into scaffolds and annotated using gene prediction software and sequence similarity tools. These data were estimated to be derived from more than 1800 different species including many newly discovered bacterial groups. Similarly, the global ocean sampling survey (Sunagawa et al., 2015) described 40 million non-redundant sequences from over 35000 species, only 0.44% of which overlapped with known reference genomes, highlighting the huge “unexplored genomic potential in our oceans.”
The above studies, and many others like them, used similar bioinformatics analysis pipelines: (a) the assembly of sequenced data (directly from environmental samples) in order to construct contiguous sequences (contigs and scaffolds), (b) the prediction of genes (and putative proteins) based on the assembled data, and (c) prediction of domains, functions and pathways for the putative proteins (Figure 1). Here, we review a collection of tools for the analysis of metagenomic microbiome sequence data with a focus on the prediction of novel genes and proteins.