CHEM 440
Biochemistry I

J. D. Cronk   Syllabus [ Previous | Next ] Pick a lecture:
10.header

Lecture 10. Bioinformatics

Monday 28 September 2009

Bioinformatics, Sequence alignments

Reading: BTS6 - Ch.6, pp.164-177.


10. Bioinformatics

Lecture 10 Summary

Update for Fall 2009 in progress...

Protein structure prediction: Knowledge-based and ab initio methods.

Threading algorithms are an example of a knowlege-based approach
CASP: Critical Assessment of Techniques for Protein Structure Prediction

Homology

Structural homology compared and contrasted with sequence homology

Orthologs and paralogs

Orthologs are homologs from different species that have a common function, while paralogs are homologs from the same species whose functions may have diverged somewhat.

Dotplots and matrices

 
Ch.4 in Lesk, "Alignments and Phylogenetic Trees", begins (after a brief introduction to sequence alignment) with the dotplot. A correspondence can be made between a two-dimensional matrix and a pairwise sequence alignment. To see this, construct a two-dimensional grid with a sequence s (of length m) placed left of the first vertical column, and sequence t (of length n) above the first (top horizontal) row. The characters of s label the m rows of the grid, while the characters of t label the n columns. This can be thought of as an m × n matrix, with sequence-labeled rows and columns. The elements mij (the value placed in the cell of the grid at the intersection of row i and column j) can be determined simply as follows: if the character associated with row i (ti) of the matrix is the same as the character associated with column j (sj), then mij = 1 (or place a dot in the ij cell of the grid). If the characters differ, mij = 0 (place the null character or blank in cell ij).
  Sequence alignment example: The sequence s of length m = 10 is shown along the left margin of the grid, and the sequence t (n = 11) is represented along the top. Each entry of the matrix is determined by whether the characters associated with the intersecting row and column match. To derive an alignment from the matrix, we imagine moving from the top left corner to the bottom right corner. Each move is restricted to one of three possibilities: one cell to the right, one cell down, or to the cell that is diagonal by one unit to the right and one unit lower. The collection of all possible paths through the matrix that follow these rules corresponds to all possible alignments of s and t. One such alignment is shown below the grid, and it corresponds to the path indicated in the grid by the entries in red. Note that a horizontal move to the right introduces a gap in sequence s, while a vertical move one cell down would introduce a gap in the alignment in sequence t.  

Questions:
(1) How many possible alignments are there between s and t?
(2) Find the path through the matrix that produces the alignment shown at right.

 

Alignment scores, gap penalties

The alignment above reveals an obvious similarity between the two sequences, and in this sense is informative. Alignments can be ranked by a scoring scheme, and a reasonable goal of a scoring scheme would be for it to give high scores to informative alignments, while being based on biological mechanisms and data. A simple scoring scheme would be to sum the numbers in the matrix along possible paths corresponding to alignments. In the example above, the alignment (path shown in red) would have a score of 9, for the sum of the 1's from each of the 9 matches in the alignment. (Note: only 1's that occur along the path after a diagonal move correspond to matches and are included in the sum.) But this scheme leads to trouble when unrelated sequences are compared and we look for the highest scoring alignment. In the example below, the highest scoring alignment has a score of 4, but is uninformative.

The way to handle this problem is to introduce gap penalties into the scoring scheme. We could, for instance, subtract points for each gap position in the alignment. In fact, we can introduce very complex scoring schemes, assigning different penalties for different mismatches and distinguishing between gap initiation and gap extension. The penalties for mismatches, for example, can be based on actual frequency of their occurrence in the database in alignments between sequences up to a given threshold of similarity. This is the essential idea behind the so-called substitution matrices for protein sequence alignment, PAM and BLOSUM.

PAM matrices

Percent Accepted Mutation or PAM matrices describe the probability that one base or amino acid has changed during the course of evolution. Amino acid PAM matrices are derived from families of closely related sequences and are used to assess the similarity of sequences when performing alignments.

BLOSUM matrices

An alternative to PAM tables, BLOSUM tables were derived using local multiple alignments of more distantly related sequences than were used for the PAM matrix. These are used to assess the similarity of sequences when performing alignments

Protein sequence alignment example

We'll use Problem 3 from Ch.6 of Berg, Tymoczko, and Stryer (BTS6; Ref.3).

Identity based scoring (p.168): each identity = +10 Gap penalty: −25
Blosum-62 scores assigned according to Fig.6.9 (p.169) for identities and substitutions; Gap penalties are −12 for opening a gap, and extension of an existing gap is −2 per position.

Sequence alignments for Problem 6-3, Berg et al.   Score each of these two alignments according to (i) the identity based scoring scheme, and (ii) the Blosum-62 scoring scheme.
Click here for the solution.
 

Learning objectives

Page update in progress

 

References

  1. Ch. 4 in Introduction to Bioinformatics (2nd ed.) by Arthur M. Lesk.(Oxford, 2005)
  2. David W. Mount - Bioinformatics: Sequence and Genome Analysis (Cold Spring Harbor Press, 2002)
  3. Berg, Tymoczko, and Stryer. Biochemistry (BTS): 6th edition (2007, Freeman) pp.164-177.
footer

[ E-mail: cronk@gonzaga.edu ]