Bioinformatic Pattern Analysis, Computer Exercise 1

Goal (leerdoelen):

to become familiar with multiple alignment software (CLUSTALW) available on the Internet,

learn the differences between protein and nucleotide alignments,

study the effect of parameters chosen (such as similarity matrices, gap penalties) on the alignments.

Remarks: If you are having trouble with one of the questions, have a look at the All Hints section. Write down your answers, either on paper or in digital form (e.g. a Word document). It is important to make notes! Write down for yourself (broadly) what you have done, what your results were, and try to formulate a bottom line (or in other words: what have you learned from the exercise?). If one of the webservers linked in the questions is offline or too slow, you might find alternative servers in the links section. Do not log out before you discuss your results with your TA.

Webpages of the NCBI and the EBI

Before starting the computer exercise, make sure that you have been going through the protocols for Chapter 1 and Chapter 2. These protocols will prepare you to work with NCBI and the EBI servers. Both offer a wide range of services and databases, and have quite extensive help or tutorial sections (tutorial section of the NCBI, Bioinformatics tools of the EBI). If you run into problems during this Computer Exercise (or any of the following ones), please have a look at these pages.

Cytochrome

Cytochromes are mostly membrane-bound proteins that contain heme groups and carry out electron transport or catalyze reductive/oxidative reactions. In Eukaryotes cytochromes are found in the inner membrane of mitochondria and endoplasmic reticulum (more information: Campbell p. 172).

cytochrome b

The file CytBProt contains the amino acid sequences of the cytochrome B proteins from the mitochondrial genome of 16 vertebrate species. The sequences are labeled with species name. Take a look at these sequences. Will very long gaps be necessary for aligning these sequences? Why?

Try Clustalw at EBI (or other clustalw servers given in the links section if the EBI server does not work) to make multiple alignment of these protein sequences. Look at the alignment using colors. First make the alignment using BLOSUM matrix, and then using the identity matrix. What is the effect of using different scoring matrices on your alignment? Can you identify conserved regions that are longer than 10 amino acids? Hints

The file CytBDNA contains the nucleotide sequences of the cytochrome B proteins from the same species. Make alignments of DNA sequences (DNA alignments will take longer time, be patient). Remember to set sequence type DNA. Is DNA alignment what you expected given protein alignment? What is (if any) unexpected? For example, are there gaps within the sequences, and if so how large are they? Why? What parameters can you change to correct a possible mistake? Hints

What is the difference between DNA and protein alignment? How do you explain this?

Let us now return to the cytochrome B alignment and have a look at a cytochrome B protein sequence from another kingdom, for example from Arabidopsis thaliana. Find this sequence in NCBI (choose pull down menu to make a search in NCBI protein, and take the protein sequence that has accession number CAA47966.1). Realign your vertebrate sequence with this plant sequence. Are the regions you previously identified as conserved, still conserved? Examine Entrez entry for CAA47966.1. Can you conclude anything about the functional properties of the conserved regions? Hints

Hexokinases are enzymes that phosphorylate hexose (mainly glucose). After phosphorylation the sugar is ready to enter some intracellular metabolic processes. This hexokinase file contains the amino acid sequences of hexokinases from human and dog. Take a look at the sequences in this file. Will long gaps be necessary in this alignment? Perform the alignment.

What can you say about the conservation of hexokinases based on this limited data set? Is the evolution of hexokinases or cytochrome B faster ?

Links

Clustalw webservers:

All Hints

The most commonly used format for sequence files in ClustalW servers (but also for many other bioinformatics servers) is the FASTA format. The description of this format is explained in NCBI help pages.
To align sequences using a webserver, open the FASTA file in your browser or in Notepad, and paste the sequences into the sequence box. You can select the scoring matrix (BLOSUM, identity) using the Weight Matrix dropdown-menu in Step 2 and Step 3 (click on more options). For simplicity use the same matrix in pairwise and multiple alignments.
The labels of most of the Clustalw options at the EBI website are links to the relevant bits of the Clustalw help pages.
Always use ClustalW's slow or full alignment algorithm, and not the fast one. Some servers default to the fast algorithm, so do not forget to change this.
You can compare the alignments using JalView, or if you use the EBI server, directly on the ClustalW Results page (scroll down to see them). Click the Show Colors button to color the amino acids according to their properties. This will also make it easier to compare the sequences.
To open more than 1 alignment at a time in JalView, you need to save the result and then open it again.
Remember that the differences between alignments are in the gaps! So focus on the gaps, heads and tails.
If a position is fully conserved, it is indicated with "*". Substitutions can fall into three categories: between very similar amino acids (":"), between relatively similar amino acids (".") and between non-similar ones (indicated without any symbol).
Remember that PAM or BLOSUM matrices cannot be used to align nucleotide sequences. (Do you know why?) Use the default or identity matrix when aligning DNA sequences (ClustalW will do this automatically).
Whether or not ClustalW decides that it is 'good' to have a gap in an alignment depends of course on the gap penalty. For very low gap penalties, ClustalW may easily insert a gap to get a 'better' alignment, even if a deletion or insertion would be unlikely (when could this be the case? Think of the relation between nucleotides and proteins). For extremely high gap penalties almost all gaps will go away (except at the beginning and end of the sequence), even if a deletion or insertion could easily have taken place.
You can limit your search result at NCBI by specifying the database field you want to search, for example: cytochrome B AND arabidopsis[orgn] only returns hits in the organism Arabidopsis. You will find more on the syntax of NCBI entrez queries here.