In the steps below we will assemble viral metagenomes derived from twelve human gut samples (Reyes et al Nature 2010).
We will use de novo cross-assembly so that we can discover sequence elements that are shared between the metagenomes (Dutilh et al Bioinformatics 2012).
We will use the Linux command line as much as possible. Explanation and help are shown on the left, commands are shown in the gray boxes on the right.
For more help on "wget" see here. For more help on "tar" see here. For most commands, you can also get more information by accessing the command line manual. For any command, just type "man command" directly on the command line. Quit the command line manual by pressing the "q". |
wget http://tbb.bio.uu.nl/dutilh/CABBIO/Reyes_fasta.tgz
tar -zxf Reyes_fasta.tgz |
For more help on "chmod" see here. |
wget http://tbb.bio.uu.nl/dutilh/CABBIO/fasta.pl
chmod u+x fasta.pl ./fasta.pl stats F*.fna |
For more help on "cat" see here. |
cat F*.fna > all_reads.fna |
Note that idba exits with the error message "Segmentation fault (core dumped)" but you can still use the output. For more help on idba, type "idba_hybrid" without any options. |
idba_hybrid -r all_reads.fna -o cross-assembly |
For more help on "ls" see here. |
ls -lrth cross-assembly |
You can also use the program "less" to look into the contigs file. Quit "less" by pressing the "q". For more help on "less" see here. |
./fasta.pl stats cross-assembly/contig.fa
less -S cross-assembly/contig.fa |
For more help on "grep" see here. |
grep "^>" cross-assembly/contig.fa | less |
For more help on "wc" see here. |
grep "^>" cross-assembly/contig.fa | wc |
bowtie2-build cross-assembly/contig.fa cross-assembly | |
The -p option indicates how many threads (CPUs) you want to use. This depends on what your computer has available. You can discover this by typing the command "nproc". Bowtie2 prints the output (SAM format by default) to your the standard output stream (your screen), but by using the ">" command we can pipe it into the output file "all_reads.cross-assembly.bowtie2.sam" instead. For more help on Bowtie2, see here. |
bowtie2 -f -p 4 -x cross-assembly -U all_reads.fna > all_reads.cross-assembly.bowtie2.sam |
For more help on "mv" see here. |
mkdir crAss_directory
mv F*.fna all_reads.cross-assembly.bowtie2.sam crAss_directory |
Download it from SourceForge. Go to the "Files" tab, download the latest version, and unzip it. | |
crAss_v2.0/crAss.pl crAss_directory | |
grep -v "^F" crAss_directory/output.contigs2reads.txt > crAss.contigs2reads.txt |
To do this, you need to create a new column in the file, where for each contig (row) you count the number of metagenomes that have at least one read mapped to it. In Excel, you could use a "countif" in column N for this: =COUNTIF(B2:M2,">0") | |
Use "less" or "grep" to find the sequence in the contigs file "cross-assembly/contig.fa" and then Blast it at NCBI. |
For help on fixing cell references in Excel with $-signs see here. |
For example, you could use: =B2/SUM(B$2:B$7983). |
For example, you could add a column: =CORREL(B$2:M$2,B2:M2). | |
In a new column, you could add: =CONCATENATE("grep -A1 -w ",A2," cross-assembly/contig.fa"). | |
The "grep" commands should look something like this: | grep -A1 -w contig-100_1527 cross-assembly/contig.fa
grep -A1 -w contig-100_2326 cross-assembly/contig.fa grep -A1 -w contig-100_4727 cross-assembly/contig.fa grep -A1 -w contig-100_2808 cross-assembly/contig.fa grep -A1 -w contig-100_1846 cross-assembly/contig.fa grep -A1 -w contig-100_6609 cross-assembly/contig.fa ... etc. |
Note that "grep" is a rather slow command, and there are much smarter and faster ways to do this for larger datasets. However, with a rather small dataset like the few thousand contigs generated in our cross-assembly, this will work fine. |
chmod u+x grep_commands.txt
./grep_commands.txt > correl_0.9.fna |
./SPAdes-3.5.0-Linux/bin/spades.py --iontorrent --s1 ./FX.fasta --only-assembler --trusted-contigs ./correl_0.9.fasta --careful -o SPAdes_reassembly | |