BioMOO Transcript for 27-1-99

ClareS turns the ClareS_recorder on.

David.cavanaugh says "Would it be possible to say a few words about the BLAST algorithms ?"

ClareS says "we're recording now..."

ClareS [to david.cavanaugh] yes, of course. Do you have a specific question?

ClareS [to david.cavanaugh] ... or would you just like to know more in general?
ClareS drops the jan26tape.
GeorgF can answer questions especially on multiple alignment and phylogeny, and he needs to leave in about half an hour..

David.cavanaugh says "Yes. CLUSTALW for example uses a basic Dynamic programming algorithms with gap opening penalty for the alignment optimization. Is BLAST similar ?"

ClareS [to GeorgF] thanks for introducing yourself - sorry, I should have done so earlier

ClareS [to david.cavanaugh] Clustal, of course, is a multiple alignment program - BLAST is a database search program so they cannot really be compared directly
GeorgF knows much less about BLAST... I believe it's aligning chunk-by-chunk, so it doesn't need a gap opening penaly in the first place
ClareS . o O ( obvious, but probably best stated )

GeorgF says "chunk-by-chunk -> match-by-match"

David.cavanaugh says "CLUSTALW can align sequences for which one can extract polymorphic characters for phylogenic analysis. Could a similar thing be done with BLAST ?"

ClareS says "the thing about BLAST is that it is FAST!!! It is effectively the fastest algorithm around for scanning a database with a sequence"

ClareS says "it is not the most accurate program around, by any means - but it's used a great deal as it's so efficient with large databases"

Gmocz [to GeorgF] As far as I know, BLAST has a new versioin which uses gap penalties too for gap openings too

GeorgF says "yes, WU-Blast have something like this..."
Gmocz The new BLAST is implemented at NCBI

David.cavanaugh says "I thought I saw gap penalties in PSY-BLAST during this modules assignments."

ClareS [to david.cavanaugh] what I would recommend is that you use BLAST (and possibly FastA, which is slower and more accurate) for extracting your sequence family - then a more precise algorithm for a multiple alignment
ClareS thinks its's PSI-BLAST in all the documentation she's ever seen
ClareS . o O ( not that that matters, of course )

David.cavanaugh says "I stand corrected"

GeorgF [to ClareS] does BLAST consider gaps during its first steps, or are the newer versions w/ gap penalty just using it for some final steps ?

ClareS says "Newer versions of Blast certainly include gap penalties"

ClareS [to GeorgF] only for the final stages - *I think* (don't quote me, please)

GeorgF [to ClareS] if you find out, post me as well...

ClareS says "the first scan through has to be pretty crude - that's why it can be so fast with the huge databases that are available now"
ClareS will do...

David.cavanaugh says "I noticed BLAST indicates + signs for certain AA substitutions it thinks are similar, what is the basis for this determination ?"

ClareS says "the important thing to note about PSI-BLAST, which makes it *so* valuable, ist that you can search the database with a multiple alignment"

ClareS says "that will almost always pull out more distant members of a family"

ClareS [to david.cavanaugh] I am not sure what criteria are used - sotty
ClareS . o O ( sorry, of course! )

ClareS can't type very well at any time let alone 2315

ClareS [to GeorgF] do you know what criteria are used?

David.cavanaugh says "I noticed also that several similarity matricies are used (e.g. PAM and BLOWSUM); which would you recommend for general use ?"
Gmocz My computer hang up, so I probably missed something. But anyway, As far as I know, the Smith-Waterman algorith can produce the best sequence comparison (although the slowest one). Does somebody know which server offers SW comparison?
ClareS doesn't - Georg??

GeorgF says "usually, it's similar physical properties, or (more indirect) substitution probabilities-- but I don't know the exact answer for BLAST"

ClareS says "there are 3 different ways which are used for determining which proteins are "similar""

ClareS meant amino acids, not proteins (
ClareS blushes...

ClareS says "chemical similarity - pretty obvious, or something similar like secondary structure propensity (not often used)"

GeorgF says "Can quote again what he knows about similarity matrices.."

ClareS says "substitution frequency (observed...)"
GeorgF will paste...

ClareS says "and, thirdly, the minimum number of base changes to get from one a.a. to another"

ClareS says "of those, the second - observed substitution frequency - is the most widely used in matrices"

ClareS says "but I don't know if it's always the criterion applied to whether or not a little symbol appears in an alignment - it will depend on the program"

ClareS says "is that relatively clear?"

GeorgF [to `[...]] the order of superiority of GCB250 > Blosum62 > JTT250 GeorgF [to Gotoh]: O. (1996) "Significant improvement in accuracy of multiple protein

GeorgF says "oops..."
Gmocz How about the Gonnet matrix? Where does it fit?
ClareS doesn't know that one... gmocz, is there a reference?

This is the last thing I heard about the matrix issueTo quote Gotoh (1996)
``[...] the order of superiority of GCB250 > Blosum62 > JTT250
>= MDM250 appears definite''.
(MDM250 is another name for Dayhoff's original Mutation Data Matrix.)
GCB250 is the matrix referenced as Gonnet, Cohen and Benner, Science 256, 1443-1445, 1992
Blosum62 is the matrix described in Henikoff and Henikoff, PNAS 89, 10915-10919, 1992
See ``Table 2'' on page 827 of Gotoh, O. (1996) "Significant improvement in accuracy of multiple protein
sequence alignments by iterative refinement as assessed by reference to structural alignments." J. Mol. Biol. 264, 823-838

GeorgF [to gmocz] Yep, Gotoh says Gonnet's is best.

David.cavanaugh says "How does one interpret the similarity Figure Of Merit (FOM) in the similarity matrix ? Is it the log of a probability or something similar ?"

ClareS [to GeorgF] but, as far as I can remember, Blosum is still the most usual set of matrices used as the default on servers - would you agree?

Gmocz [to ClareS] I have seen it a couple of years ago in Nature or Science. Its is somwhat similar to PAM but it used all available sequences known at that time

ClareS [to gmocz] if you do dig out the full reference, could you post it to the list? I'm sure others would be interested

Gmocz [to ClareS] Sure, I will

ClareS [to gmocz] thanks ;)
ClareS will get hold of the Gotoh paper and read it thoroughly!
ClareS . o O ( the algorithms are improving all the time... )
GmoczThe text said somewhere that the Chou and Fasman method for secondary structure prediction is not accurate because its parameters were derived from a limited number of proteins. Did someone re-analyze those propensity parameters with a much larger dataset what we have today? If yes, was there any improvement in prediction accuracy?

ClareS [to gmocz] there have been various improvements in these methods over the years (with larger databases etc.) but the improvement has been fairly small - from say 65% of residues predicted accurately to say 72-3%

ClareS says "I would never use secondary structure prediction on its own, but it's quite often useful as a backup"

ClareS says "another tip with 2e structure prediction is to use several algorithms at once and generate a consensus prediction"

Gmocz [to ClareS] Does someone know if there is a database of secondary structure information in a simple linear text format (such as HHHHHHTTTTCCCEEEEECC...)? OWL has a nice graphic presentation but not suitable for mass retrieval. DSSP at EBI has text format but it is very complicated. 3DB lists secondary structure elements, but does not give a combined linear representation. Any clue?

ClareS says "(if the algorithms all agree that a particular segment is helix or sheet you can have more confidence in the prediction)"

ClareS [to gmocz] you are talking about *actual* secondary structure (calculated from coordinates), not predictions - right?

Gmocz [to ClareS] Of course, actual

David.cavanaugh says "How good a correlation between substitutional frequencies and secondary structure and/or minimum base transition criteria is there ?"

ClareS [to gmocz] I'm sure I've seen something with that representation but I'm trying to remember where...
ClareS . o O ( and I'm on dialup from home, so no web access )

Gmocz [to david.cavanaugh] I guess is not better than the prediction accuracy

GeorgF says "-re- Gmocz' question on SW server: search for ``Smith'' in, and you'll get a lot of links"

ClareS says "I think the most likely place is the PDB summary for each PDB file stored at UCL"
GeorgF has to leave in a few minutes... (gotta catch the last bus...)

ClareS [to GeorgF] that's not that French URL that crashed my Netscape yesterday, is it??
ClareS can't remember

GeorgF [to ClareS] no, it isnt..

Gmocz [to david.cavanaugh]

ClareS says "there's a link to PDBsum from each main entry page in thre PDB (any of the mirrors, of course)"

ClareS says "talking of the PDB, did you all know that it is *moving* this year???"

ClareS [to GeorgF] thanks very much for staying - greatly appreciated
Gmocz Really? Let me know more,please
GeorgF waves.

Gmocz [to GeorgF] thank you
GeorgF bows.

ClareS says "if you go to any PDB home page there is a link describing what is happening.. it's going to Rutgers, by the way"

ClareS says "no-one has said what will happen to the mirrors, but I think the mirror sites are very likely to stay"
GeorgF goes home.

ClareS says "back to gmocz' last question but a few - another place which may well have that secondary structure info in the form you want is one of the EMBL servers"

Gmocz [to ClareS] Thank you. I will look around.

ClareS [to david.cavanaugh] I haven't answered your last question yet...

ClareS says "there is a fair correlation between different measures of similarity, but it's not absolute..."

ClareS says "and even with chemical similarity, there are different ways of approaching it..."

ClareS says "like, is Asp most similar to Glu because of its charge, or most similar to Asn because of its size"

ClareS says "most probably the real answer is different in different (structural) circumstances - and that, of course, is *impossible* to describe mathematically in a matrix"

David.cavanaugh says "How about an approach like AA retention times in reversed phase polar solvent gradient HPLC for a chemical similarity measure?"
ClareS hasn't heard of that approach before - it's a novel idea

ClareS says "you're looking at a multi-component problem: size, charge, hydrophobicity, plus other features"

ClareS says "and special features of individual a.a.s e.g. Gly, Pro, Cys"

Gmocz says "to ClareS Has someone tried to use multidimensional matrices to account for many various factors at the same time"

David.cavanaugh says "Certainly also bonding mechanism and solvent interraction also I would think."

ClareS says "I don't think so. The maths would be *impossible*"

ClareS says "yes, patterns of h-bond donors and acceptors - and then there's aromacity and beta-branching"

David.cavanaugh says "I have an approach I've been using for Phylogenetic analysis that might work"

ClareS [to david.cavanaugh] go ahead?

ClareS [to david.cavanaugh] have you got as far as testing it yet?

David.cavanaugh says "I have been tested these methods with real biological characters"

ClareS [to david.cavanaugh] .... with promising results?

David.cavanaugh says "Yes"

ClareS says "are you willing to let us into some of the secrets of your algorithm?"

David.cavanaugh says "There are pattern recognition/projection methods for reducing high order pattern spaces to 1D, 2D or 3D points."

David.cavanaugh says "I am working with a friend to publish a method like this."

ClareS [to david.cavanaugh] I'd be interested to see a preprint when you get that far

David.cavanaugh says "I can send you a proto paper if you keep it under wraps."

ClareS says "OK - please do. I'll keep it to myself"

David.cavanaugh says "It's highly mathematical though."
ClareS . o O ( maths is good for me )

Gmocz [to david.cavanaugh] Once, I also tried to use a two-dimensional to one dimensional binary paterrn reconstruction method to predict structure. With little success

ClareS [to gmocz] sounds like hard work!

Gmocz [to ClareS] It was. I gave it up.
ClareS isn't at all surprised

David.cavanaugh says "Did anyone have any thoughts regarding my question on similarity matrix interpretation ?"

ClareS says "I'd like to call the meeting to an end reasonably soon - it's getting very late"
ClareS . o O ( here, at least )

ClareS says "so if anyone has another question, speak now or..."
ClareS . o O ( John? )

David.cavanaugh says "Alas, I have more questions than time!"

JohnN says "I'd lower the tone rather a lot!"

ClareS [to david.cavanaugh] no - sorry - Georg and I exchanged notes about your question. Neither of us knew the answer!

ClareS [to david.cavanaugh] anyting we haven't covered - do post the list. You may start an interesting discussion!

ClareS [to JohnN] don't worry about it. All you really need to know in this course is very elementary bioinformatics
ClareS wonders if that's not a contradiction in terms

JohnN says "I'm getting on fine then"

ClareS [to JohnN] if you have worked through and understood the tutorial you will know enough bioinfo for PPS

JohnN says "I am a bit overwhelmed with both the power of the databases and the wealth of information, but it's fun and I'm makiong progress"
ClareS looks round expectantly for the last question or two

JohnN [to thinks] except with the typing
Gmocz Each sequence database seems to use a different file format. Would not it be nice to have one universal sequence file format with mandatory fields, which all databases can understand, and with optional fields, which are specific to a particular database, others can simply ignore it. Am I unrealistic? Are there plans for such an arrangement at all?

ClareS [to gmocz] I'm afraid you are unrealistic! Everyone wants to standardise on *their* format, so I don't think it'll happen

ClareS [to gmocz] fortunately there are some good programs around which will reformat almost any format into almost any other

Gmocz [to ClareS] I think so too, I was just wondering...

ClareS says "there's one called Readseq, one called Babel..."

ClareS says "of course, these only work on individual sequences or at most alignments, not whole databases"

Gmocz [to ClareS] I guess these are public domain sw, we can find them on the web

ClareS says "babel is p.d., I'm not so sure about readseq"

ClareS says "and then, of course there is GCG which is a law unto itself"

Gmocz [to ClareS] GCG has only a limited set of file conversion programs

ClareS says "firstly it's almost the only expensive bioinfo program, secondly it uses a unique format and expects everyone else to use its format too"

Gmocz [to ClareS] but perhaps the most useful

ClareS says "yes, that's why it's so annoying that it costs so much!"

Gmocz [to ClareS] I know, my workplce has to pay thousands of $ yearly for it
Gmocz just one more quick question. Do the primary sequence databases (Genbank, EBI, DDBJ, GSDB) share all data? Or each contain some unique information that cannot be found in the others?

ClareS says "if you're in the UK or EU you can join the HGMP resource centre and use GCG via its web pages for a mere UKP50"

ClareS says "all primary databases share data... eventually."

Gmocz [to CareS] I am in US so we have to pay

ClareS says "it may take a few days for a sequence submitted at one database to reach all the others"

ClareS says "so if you know of a really new seq it really would be worth searching every database"

ClareS [to gmocz] it is possible for Americans to join HGMP - sometimes (I don't know the details) and there may well be similar services I don't know of over your side of the pond
ClareS is propping her eyelashes open with matchsticks
David.cavanaugh Waves bye. Cheers!

ClareS says "thanks for making it such a lively meeting!"
JohnN similarly keels over
ClareS turns the ClareS_recorder off.