CABBIO course Metagenomics hands-on

Bas E. Dutilh (link)

In the steps below we will assemble viral metagenomes derived from twelve human gut samples (Reyes et al Nature 2010). We will use de novo cross-assembly so that we can discover sequence elements that are shared between the metagenomes (Dutilh et al Bioinformatics 2012).
We will use the Linux command line as much as possible. Explanation and help are shown on the left, commands are shown in the gray boxes on the right.

Steps

  1. Download the twelve human gut virome datasets in your working directory and unzip them. (For the record, I got the original files from the Short Read Archive here.)
  2. Do a de novo cross-assembly of all twelve datasets with the assembly program idba (Peng et al Bioinformatics 2012). idba is based on de Bruijn graph assembly and allows for uneven depths, making it suitable for metagenome assembly and assembly of randomly amplified datasets. This cross-assembly will combine the metagenomic sequencing reads from all twelve viromes into contigs.
  3. Each of the assembled contigs represents a viral sequence that is present in one or more of the metagenomes. This sequence can consist of a complete viral genome, or fragments thereof. Let's take a look at the output.
  4. Some assembly programs provide a file containing information about which reads were assembled into each contig, for example in a SAM/BAM or ACE file. A disadvantage of idba is that this file is not generated, but we can easily generate such a file by mapping the metagenomic reads back to the contigs. We will do this with the read mapping tool Bowtie2 (Langmead and Salzberg Nature Methods 2012).
  5. We want to identify sequences that are shared between the viral metagenomes de novo, without depending on which sequences align to a reference database. We will do this by using the cross-assembly tool crAss (Dutilh et al Bioinformatics 2012).
  6. Next, we want to discover which contig sequences were present in many different metagenomes. Open the file "crAss.contigs2reads.txt" in a spreadsheet program like Excel or Open Office. This file shows how many reads from each of the twelve gut viromes were aligned to each contig.
  7. Now let's take a look at the depth profiles of these contigs.
  8. We will now extract the sequences belonging to this one chromosome.
    Note: if you got this far, you are a pro. Commands below are sketchy and might need fine tuning! Reload the page regularly (press F5) to see my latest changes.

  9. We will try to re-assemble these contigs together with the reads from one of the samples with the SPAdes assembly program. SPAdes accepts "trusted contigs" in an assembly. We will only use reads from one of the samples to minimize between-sample heterogeneity that may lead to breaks in the genome sequence.
  10. Next, I would like to Blast the highly correlating contigs against the new SPAdes scaffolds to see if they were assembled into a single genome.