(a) Differentiate between: 1) a standard sequence consensus; 2) a one-dimensional, ‘regular-expression’ motif; 3) a simple, two-dimensional, weight matrix; 4) profile position specific site matrix (PSSM); and 5) a profile Hidden Markov Model (HMM).
Discuss pros and cons, relative power of each, why and when one would be used over the other.
(b) What do logs-odds scoring matrices like the BLOSUM50 table have to do with the concept of ‘pseudocounts’ and background frequencies in most types of multiple sequence alignment profiles?
(a) Assume that we have n sequences, each 50 residues long and pairwise alignment of two such sequences takes 1 second of CPU time on computer. An alignment of four sequences takes (2L) N-2 =10 2N-4 = 10 4 seconds. If we had unlimited computer memory and can wait for the answer until just before the sun burns out in five billion years, what is n that our computer could align?
(b) Outline the whole genome re-sequencing pipeline of short reads arising out of Next Generation Sequencing platform.
(a) Show how Hidden Markov Models (HMMs) are used to build profiles, using the
following alignment:
LEVK
LDIR
LEIK
LDVE
(b) How do HMMs help to deal with gaps in protein families? Explain.
(a) Find the sum-of-pairs score for a given alignment. Use the following scoring function for this program: 4 points for a match, -1 points for a mismatch, -2 for a s(-,base) or s(base,-) and 0 for a s(-,-).
A-G
AC-
TCG
(b) What is the Jukes-Cantor distance model and why is it more appropriate than a simple model that merely counts the number of mismatches?
(a) Explain the role of guide trees in progressive multiple sequence alignment algorithms.
What do the leaf and internal nodes of a guide tree represent?
(b) Determine the sum-of-pairs scores for the following multiple sequence alignment of DNA sequences, using the scoring matrix in which a match gets a value of +4, a mismatch gets a value of -1, a (base,-) pair gets a value of -2, and a (-,-) pair gets a value of 0.
GCAA
GT - A
C - - A
What are some issues associated with adapting multiple sequence alignment programs to large genomic sequences?
(a) What is a sequence pattern? Explain the use of patterns for functional annotation.
(b) What are True positive, True negative, False positive, and False-negative in the context of pattern searches in protein sequences? How to obtain sequence patterns?
(a) What are low complexity regions and how are they handled in database searching and why?
(b) What is the importance of E-value in database searching?
(a) Distinguish between the programs BLASTP, BLASTN, BLASTX, TBLASTN,
TBLASTX of the BLAST package.
(b) Give a brief discussion of programs, PHI-BLAST, and PSI-BLAST. How is PSI- BLAST used in multiple sequence alignment?
(a)What is RMSD and what is it used for?
(b) Proteins are not rigid, but flexible. How could the RMSD definition be modified to cope with flexibility? Give the advantages and disadvantages of the basic RMSD definition and
your modification.