CABBIO course Metagenomics hands-on

Bas E. Dutilh (link)

In the steps below we will assemble viral metagenomes derived from twelve human gut samples (Reyes et al Nature 2010). We will use de novo cross-assembly so that we can discover sequence elements that are shared between the metagenomes (Dutilh et al Bioinformatics 2012).
We will use the Linux command line as much as possible. Explanation and help are shown on the left, commands are shown in the gray boxes on the right.

Steps

Download the twelve human gut virome datasets in your working directory and unzip them. (For the record, I got the original files from the Short Read Archive here.)

I have prepared easy-to-use tgz or zip archives of the Fasta files. Download and unzip them. For more help on "wget" see here. For more help on "tar" see here. For most commands, you can also get more information by accessing the command line manual. For any command, just type "man command" directly on the command line. Quit the command line manual by pressing the "q".	`wget http://tbb.bio.uu.nl/dutilh/CABBIO/Reyes_fasta.tgz tar -zxf Reyes_fasta.tgz`
You can check the statistics of the Fasta files with my fasta.pl script. Download it and make it executable. For more help on "chmod" see here.	`wget http://tbb.bio.uu.nl/dutilh/CABBIO/fasta.pl chmod u+x fasta.pl ./fasta.pl stats F*.fna`

Do a de novo cross-assembly of all twelve datasets with the assembly program idba (Peng et al Bioinformatics 2012). idba is based on de Bruijn graph assembly and allows for uneven depths, making it suitable for metagenome assembly and assembly of randomly amplified datasets. This cross-assembly will combine the metagenomic sequencing reads from all twelve viromes into contigs.

Concatenate the reads from all twelve datasets into one file. For more help on "cat" see here.	`cat F*.fna > all_reads.fna`
Run idba on the concatenated reads file. This should take a few minutes. Note that idba exits with the error message "Segmentation fault (core dumped)" but you can still use the output. For more help on idba, type "idba_hybrid" without any options.	`idba_hybrid -r all_reads.fna -o cross-assembly`
The final output file in the output directory "cross-assembly" is the file "contig.fa". For more help on "ls" see here.	`ls -lrth cross-assembly`

Each of the assembled contigs represents a viral sequence that is present in one or more of the metagenomes. This sequence can consist of a complete viral genome, or fragments thereof. Let's take a look at the output.

Check the sequences in the contigs file with the fasta.pl script. You can also use the program "less" to look into the contigs file. Quit "less" by pressing the "q". For more help on "less" see here.	`./fasta.pl stats cross-assembly/contig.fa less -S cross-assembly/contig.fa`
What are the longest and shortest contig lengths? The command "grep" gets lines from a file that match a certain search string, in this case a ">" sign at the beginning of the line. With the pipe "\|", we can forward the output from "grep" to "less", so that we can inspect it. For more help on "grep" see here.	`grep "^>" cross-assembly/contig.fa \| less`
How many contigs were created? The command "wc" counts the number of lines, words, and characters in a file. When used with the pipe, "wc" counts them in the standard input stream, which in this case is the output from "grep". For more help on "wc" see here.	`grep "^>" cross-assembly/contig.fa \| wc`

Some assembly programs provide a file containing information about which reads were assembled into each contig, for example in a SAM/BAM or ACE file. A disadvantage of idba is that this file is not generated, but we can easily generate such a file by mapping the metagenomic reads back to the contigs. We will do this with the read mapping tool Bowtie2 (Langmead and Salzberg Nature Methods 2012).

Build a Bowtie2 index from the contigs assembled from the viromes. bowtie2-build cross-assembly/contig.fa cross-assembly

Align all reads to the contigs using Bowtie2.
The -p option indicates how many threads (CPUs) you want to use. This depends on what your computer has available. You can discover this by typing the command "nproc".
Bowtie2 prints the output (SAM format by default) to your the standard output stream (your screen), but by using the ">" command we can pipe it into the output file "all_reads.cross-assembly.bowtie2.sam" instead.
For more help on Bowtie2, see here. bowtie2 -f -p 4 -x cross-assembly -U all_reads.fna > all_reads.cross-assembly.bowtie2.sam

We want to identify sequences that are shared between the viral metagenomes de novo, without depending on which sequences align to a reference database. We will do this by using the cross-assembly tool crAss (Dutilh et al Bioinformatics 2012).

Before running crAss, we need to prepare a directory with the read files from the individual gut metagenomes (in Fasta format) and the file containing the reads mapped to contigs (e.g. the SAM or ACE format). For more help on "mv" see here.	`mkdir crAss_directory mv F*.fna all_reads.cross-assembly.bowtie2.sam crAss_directory`
Only the latest version crAss_v2.0 reads SAM files.	Download it from SourceForge. Go to the "Files" tab, download the latest version, and unzip it.
Then we can run crAss_v2.0.	`crAss_v2.0/crAss.pl crAss_directory`
One of the files in the output directory "crAss_directory" is the file "output.contigs2reads.txt". This file contains a list of all the contigs with the number of aligned reads from each metagenome. At the end of this file, crAss also lists the unassembled reads, but we do not need those. We can use "grep" to create a new file "crAss.contigs2reads.txt" that contains only the lines that do not start with "F".	`grep -v "^F" crAss_directory/output.contigs2reads.txt > crAss.contigs2reads.txt`

Next, we want to discover which contig sequences were present in many different metagenomes. Open the file "crAss.contigs2reads.txt" in a spreadsheet program like Excel or Open Office. This file shows how many reads from each of the twelve gut viromes were aligned to each contig.

Sort the contigs by the number of metagenomes from which they contain reads.	To do this, you need to create a new column in the file, where for each contig (row) you count the number of metagenomes that have at least one read mapped to it. In Excel, you could use a "countif" in column N for this: `=COUNTIF(B2:M2,">0")`
Contigs with reads from most different metagenomes are derived from the most widespread viral sequences in these twelve gut viromes. Sort the contigs by the number of metagenomes from which they contain reads.
From what organism are the most widespread viral sequences derived?	Use "less" or "grep" to find the sequence in the contigs file "cross-assembly/contig.fa" and then Blast it at NCBI.

Now let's take a look at the depth profiles of these contigs.

Because the 12 datasets contained unequal numbers of reads, standardize every row in the file by dividing the number in the cell by the column total. (Actually, we also need to correct for the contig length, but we will skip that for now.) For help on fixing cell references in Excel with $-signs see here.	For example, you could use: `=B2/SUM(B$2:B$7983)`.
Plot the depth profile of the top ~50 contigs across the 12 samples in a scatter plot. If all went well, you will see that almost all the widespread contigs have very similar depth profiles. These highly correlating depth profiles are probably derived from one chromosome, that had this abundance profile across the samples.

We will now extract the sequences belonging to this one chromosome.

Still within the spreadsheet program, calculate the correlation of the depth profile of all contigs with the depth profile of (one of) the most widespread contig(s).	For example, you could add a column: `=CORREL(B$2:M$2,B2:M2)`.
Then we can extract all the contigs with a high correlation (e.g. >0.9) with the widespread one. To do this, we can use Excel to create a list of "grep" commands.	In a new column, you could add: `=CONCATENATE("grep -A1 -w ",A2," cross-assembly/contig.fa")`.
The "grep" commands should look something like this:	`grep -A1 -w contig-100_1527 cross-assembly/contig.fa grep -A1 -w contig-100_2326 cross-assembly/contig.fa grep -A1 -w contig-100_4727 cross-assembly/contig.fa grep -A1 -w contig-100_2808 cross-assembly/contig.fa grep -A1 -w contig-100_1846 cross-assembly/contig.fa grep -A1 -w contig-100_6609 cross-assembly/contig.fa ... etc.`
Copy-paste the column containing the list of "grep" commands into a text file "grep_commands.txt" and copy it to your working directory. Make the file "grep_commands.txt" executable, and run it. Note that "grep" is a rather slow command, and there are much smarter and faster ways to do this for larger datasets. However, with a rather small dataset like the few thousand contigs generated in our cross-assembly, this will work fine.	`chmod u+x grep_commands.txt ./grep_commands.txt > correl_0.9.fna`
From what organism are these sequences derived?

Note: if you got this far, you are a pro. Commands below are sketchy and might need fine tuning! Reload the page regularly (press F5) to see my latest changes.

We will try to re-assemble these contigs together with the reads from one of the samples with the SPAdes assembly program. SPAdes accepts "trusted contigs" in an assembly. We will only use reads from one of the samples to minimize between-sample heterogeneity that may lead to breaks in the genome sequence.

SPAdes is not yet installed, so we will have to download it first. Search Google for "SPAdes assembly", go to the downloads page, and download the Linux binaries. When you unzip them, it is good to know that the file extension ".tgz" (that we saw above) is short for ".tar.gz".
Run SPAdes with Ion Torrent settings (454 settings are not available) using reads from a metagenome of your choice (fill in at FX). Use the highly correlating contigs as trusted contigs, run the assembler only (no error correction, since we have no read quality scores in the fasta file, and ask for a "careful" assembly. Before running, the file extensions should be renamed to ".fasta" because SPAdes does not recognize the extension ".fna". The output will be written to the directory "SPAdes_reassembly".	`./SPAdes-3.5.0-Linux/bin/spades.py --iontorrent --s1 ./FX.fasta --only-assembler --trusted-contigs ./correl_0.9.fasta --careful -o SPAdes_reassembly`
Check the "scaffolds.fasta" file in the SPAdes output directory with the fasta.pl script. Are the contigs longer than those derived from the idba assembly?

Next, I would like to Blast the highly correlating contigs against the new SPAdes scaffolds to see if they were assembled into a single genome.