One Gary Forbis, by Gary Forbis: July 2013

Wednesday, July 31, 2013

Computational Molecular Evolution

OK, I trying to do the quiz for Bayesian Analysis. I'm stuck on the first problem. I'm doing the calculations as I understand them but instead of getting fractions for likelihood I'm getting whole numbers. That can't be. I coded:

as:


def fac(n):
    if n == 1: return 1
    else: return n * fac(n-1)

def prob(h,n,p):
    return float((fac(n)*p*(1-p)))/float(fac(h)*fac(n-h))

I'll figure it out. I thought I did it correctly. Clearly I did not. There is a math function factorial. I was too lazy to import the math library. It seems to me that the number of ways one can get heads from N flips doesn't really express the probability of it.

Ah, I see the problem. That superscript really does mean "Raised to the power". Now I can move on to more misunderstandings on my part.

Interestingly, Wikipedia gives a different but similar function where r is the uniform prior distribution, somewhat analogous to p in the prior formula. The Wikipedia example following this helped me identify my error.

While I know some about probabilities I hadn't dealt with conditional probabilities. It looks quite powerful and useful.

On another topic. When using Python one needs to know math.log(x) gives the natural log of x rather than the base 10 log of x. The base 10 log of x is math.log10(x).

It gets stranger. math.log1p(x) is the natural log of 1+x.

Tuesday, July 30, 2013

FoldIt, Protein folding as a game.

FoldIt is the game of folding proteins. I started playing in August 2010. After about a year I took a couple of years off. I started folding again this spring. I guess I'm helping science. I don't know.

Proteins rapidly fold into their natural state once built by Ribosomes. They can be denatured by temperature, PH, or in other ways but once conditions are right they will fold again into the same complex shape. Unfortunately science doesn't have good algorithms to predict their shape from the string of amino acids. The technology is developing rapidly. It has a long ways to go.

The assertion is humans are better at folding than software alone. Humans, using their intuitions, can see problems and adjust. By playing the game, science can build a huge database of moves humans make while folding proteins and use this database to improve the automation. The FoldIt players have scored well in global competition but mostly it's just a hand full of players. I don't know if the rest of us add any value at all. Here I am after about 45 minutes of play. Most of the score came in the first 15 minutes:

I'm in 94th place in the Global Soloist scoreboard. This shows the problem. While I can tell the protein fold is a mess I'm just learning about the tools available to me. A string of hydrophilic amino acids usually form a helix, In this game the hydrophilic AA have blue side chains and the hydrophobic AA have orange side chains. The hydrophobic AA usually end up in the middle of the protein grouped together and the hydrophilic AA end up on the outside. This puzzle says it's an Integrase. Wikipedia show them folding as a series of helixes. If I had a bunch of time on my hands I'd restructure the protein as a series of helixes and see but I'm just trying to maintain my global position so won't be spending much time. As it turns out there are websites I'll not mention here that will help you categorize a string of amino acids and provide their common motif. Rosalind is helping me learn about these databases and more.

I may have to quite folding for awhile so I can keep up on Coursera and Rosalind.

Sunday, July 28, 2013

Coding the Matrix -- Yikes, I'm quite behind.

OK, I'm having trouble keeping focused, as usual. I didn't start the class assignments for week 1(zero relative) of Coursera's "Coding the Matrix" until this after noon, about 6PM. It's now 10PM and I'm complete. The Politics lab was interesting. I don't know why I'm so slow at working these problems. I used to be able to handle them quickly. My age must be getting to me. Week 3 says

Week 3 is upon us, and with it we enter the middle third of the course. This week's topic, The Matrix, represents a transition. Most of it is, in a sense, mechanics, but it opens the door to the more math material that comes next.

There is a lot of work being assigned this week. If it is possible for you, I suggest you not try to get it all done in a week; take advantage of at least some of the three-week time period you have. Next week's assignment won't be as heavy, so you'll have time to catch up.

I've got to get moving. At least I've made the due date for week 1. week 4 is out now.

Now looking back at my work on week 1's assignments I realize I haven't taken advantage of the zip function. I coded:


  [a[x]*b[x] for x in range(len(a))]

when I should have coded:


  [x*y for x,y in zip(a,b)]

Learning the Python language takes time. I'm not used to thinking this way. I've mostly moved from procedural languages into SQL. This is a step sideways. SSIS is pictorial. One uses functions to get the results one wants. Python is textual.

Saturday, July 27, 2013

Completing my 32nd problem on Rosalind

I completed my 32nd problem on Rosalind. As is sometimes the case I over-thought the problem. Here's the short of it:

Give a count of nodes in a tree graph and a set of lines connecting particular pairs of nodes, how many additional lines will be needed to connect all of the nodes. So here's the thing, there's really only one way to connect three nodes. Here's the two ways to connect four nodes:

Count the nodes and count the lines. There's a pattern there. I don't know why I didn't see it. I got bogged down counting the nodes that weren't connected to any other nodes and the sets of connected nodes:

OK, that's 13 nodes, 3 sets of connected nodes and 3 unconnected nodes. Let's see, that's two lines to connect the three sets and three lines to connect the unconnected nodes for 5 lines. That's 13 - 10 connected nodes for 3 unconnected nodes and 3 sets of connected nodes and - 1 because one set counts as the base. 13 - 10 + 3 - 1 = 5. But how many nodes? 13 and how many lines are needed to connect them and how many lines have been defined?

That's as close as I'm going to get to providing the answer to Completing a Tree, my 32nd problem.

Thursday, July 25, 2013

On getting started solving Rosalind problems in the Python Village

Rosalind assumes some level of programming skills. Some people may come to the table with none whatsoever. The Python Village moves pretty quick. I'll try to slow it down for those who haven't programmed at all. I'm not going to rehash stuff Rosalind provides directly. For instance, in problem INI2 one sees:

That's pretty sparse. But that "click to expand" text needs to be clicked most of the time. Even if it doesn't address the particular problem it gives lots of information about why the particular problem is being asked to be solved.

That print command is version 2.x. In 3.x you would use:


print ('a - b is',c)

I don't mind helping people get started. Let me know if I'm being of any help. Otherwise I'll just move on. One last thing...
After you solve a problem buttons for "Solutions" and "Explanation" appear. "Explanation" gives some additional information. People seem more willing to give "Solutions". You will see many ways to solve the same problem.

Wednesday, July 24, 2013

Always check your work.

I posted a solution to a problem a few days back but my results weren't being accepted. It was quite frustrating. It turned out my posted solution wasn't the routine I was using to generate the results. I must have submitted my results 10 times. I was quite sure I had the correct solution. Finally this evening I reviewed the results I was submitting very closely. The error stuck out like a sore thumb.

Tag me embarrassed. There is confidence then over-confidence. I am humbled by my hubris.
Sorry for posting a working solution to a Rosalind problem and identifying it as such.

I normally produce my results by way of a wrapper routine like this:

def sseqf(filename):
    clearBioDb()
    loadFASTA(filename)
    print ' '.join(map(str,
                        sseq(namedSeq[namedOrder[0]],
                             namedSeq[namedOrder[1]])))

I write the actual routine sseq separately. If the output will be of substantial size I will write it to a file rather than print it. I'll discuss join and map in a few days.

Tuesday, July 23, 2013

Python 101, part 1

I was asked to provide some more in depth information about the Rosalind Python Village problems. Rosalind moves pretty quickly though the process. If one knows any computer languages Python will come pretty quickly but if you are new to computer programming all languages may be a bit hard. I'm assuming anyone who want to start programming is willing to do the first step, that is load the software on your computer. Here's the steps:

go to the Python Download page.
choose the correct version(s). V3.3.2 is the current version as of this writing but some Python code won't run on V3.3.2 so one might need to get V2.7.5. I have both version installed and use V2.7.5 for Rosalind and V3.3.2 for Computational Molecular Evolution. I haven't run across the need for V2.7.5 yet but that's the version Rosalind's ini1 asks you to load. I'm going to keep it simple but suggest you install both versions.

Download and Install V2.7.5. Let it use its default file folder. For Windows that's C:\Python27.
Download and Install V3.3.2. Let it use its default file folder. For Windows that's C:\Python33.

On windows both versions have their own folders in the start menu. You can run either version from the start menu and all is well. If you are like me you don't want to do that because the startup folder for both are their home directories. The .py extension is associated with V3.3.2 if you load both. Here's a simple trick to get the right version set up for a particular project.

Create a folder for each project. I have a folder for Rosalind under Documents.
If the project will be using V2.7.5:

Find the "IDLE (Python GUI)" Icon in the start menu under "Python 2.7".
Right click on it and select "Copy".
Open the projects folder.
Right Click on the folder's background and select "Paste".
Right Click on the newly pasted short cut and select "Properties".
Click on the "Shortcut" tab.
Change the "Start in: to the fully qualified path of the directory you just created.
Click on the "OK" button.

When you want to work on a V2.7.5 project use the "IDLE (Python GUI)" Icon in the folder to start Python then load modules using the "Open" item under the "File" menu. When you want to work on a V3.3.2 project right click on the module and select "Edit with IDLE".

I hope that gets you set up.

The first time I tried INI1 I didn't do the right thing. I don't know why. It said to type

  import this

in the command line and see what happens. I did that but for some reason I didn't get "Then, click the 'Download dataset' button below and copy the Zen of Python into the space provided." What a mess.

As it turns out this.py is a module in C:\Python33\Lib for V3.3.2 or in C:\Pythod27\Lib for V2.7.5. You can open these modules in the editor and see what they look like The only difference is the format of the print statement. The small difference in the print statement is one of the few differences between the two versions beginners will notice right away. There are other differences.

When you want to run a module from the shell you type "import " followed by the name of the module then press enter. I haven't tried putting the same module name in my working directory and in the library folder so I don't know what would happen.

About String constants.

String constants can be quoted by either single quote(') or double quote(").
While writing this I learned a little bit about Python string constants.
If you haven't played around with string constants try some of these:

>>> ''  ''
''
>>> '   ''    '
'       '
>>> "  "
'  '
>>> " ' "
" ' "
>>> ' \' '
" ' "
>>>

Then try some of your own. Have fun. Work Problem INI1 if you haven't done so already.

Protein Melting temerpature and ancient life.

Yesterday I was introduced to protein melting temperature. Many things can cause a protein to denature(unfold). Temperature is one of those causes. Life needs to have proteins within a certain range below their melting point. I don't know the details right now. The part I found most interesting is how scientists have built clades from DNA sequences then predicted the most likely ancestor sequences then used them to build proteins. From http://news.ufl.edu/2003/09/17/ancientprotein/, a 2003 paper concerning one such reconstruction, one can infer the temperature of the ancestor's habitat. Since that 2003 the field has progressed. It appears this technique maps closely to other techniques for determining world temperature in the distant past.

When introduced to Hamming distance, the number of changes needed to convert one sequence into another, I wrote this simple implementation:

def HamDist(s,t): return sum([1 for i in range(len (s)) if s[i] != t[i]])

It works and is easily understood. I found this from http://code.activestate.com/recipes/499304-hamming-distance/. I'm not sure I like summing truth values.

from itertools import imap 
import operator 
def HamDist(str1, str2): return sum(imap(operator.ne, str1, str2))

I'm learning pretty quickly. There's a lot to learn. I guess once one learns the language the faster code is readable.

Knowing the Hamming distance between a set of DNA, RNA, or AA sequences is the first step to building an un-rooted tree and this is the first step towards reconstructing the most likely ancestor sequences.

Monday, July 22, 2013

I've been sick. I had to take some time off. I'm caught up on Computation Molecular Evolution but am still a week behind on Coding the Matrix. I did a few more Rosalind problems but am currently stumped on one of them. It seems quite simple and I'm sure my answer is correct based upon my understanding of the problem. I've learned a lot about recursive programming in python. I've also learned a bit about markdown. Here's the function Rosalind doesn't accept:

def lev(lex,len,txt=''):
    for x in lex.split(' '):
        yield txt+x
        if len > 1:
            for y in lev(lex,len-1,txt+x):
                yield y

[x for x in lev('T C A G',3)]

I guess I still need to relearn html so I can set code blocks off more than I've just done.

Friday, July 12, 2013

Names confuse me.

I've gone back and reworked some of my earlier solutions to Rosalind problems. It turned out I lost several of the more complex solutions and all of the tables I had built. Rather than solving my 26th problem I got side traced on mRNA. I'm not sure why I have trouble remembering names like introns, exons, and spliceosomes. Following through Rosalind's links to Wikipedia I learned the codons 'GUG' and 'UUG' can also be start codons but not in humans. When bacteria start with these it is normally translated as formylmethionine rather than their normal Valine and Phenylalinine respectively. I'm having trouble memorizing the names of the various amino acids even though I've been playing FoldIt for years. The two I'm most likely to remember are Proline and Glycine. I remember Proline and Glycine because they are normally part of a loop rather than sheet or helix.
The protein can bend quite freely at Glycine. I should remember Cysteine because they can form a Disulfide Bridge. I think I'll have to mention the amino acids by name for awhile so I can remember their names. Heck, I probably need to do this with the nucleotides as well--It's easy to remember CATG but not so easy to remember Cytosine, Adenine, Thymine, and Guanine. Uracil will need to come along too.

I've never noticed in FoldIt that proteins normally start with Methionine. This is a side-effect
of the start codon AUG being translated, unlike the Stop codons. AUG translates as Methionine.

While I knew the codons weren't one to one with the amino acids I didn't realize just how many codons translated to the same amino and how this alone showed how ridiculous was the claim that even a single mutation would be fatal.

The whole 5' and 3' thing still gives me trouble. We'll see. I wasn't feeling well today so maybe that's causing it. I definitely need to get back to Coursera.

Thursday, July 11, 2013

An evening off

This morning when I woke up I noticed my computer had been rebooted. New windows updates had been installed. Unfortunately I hadn't saved the routines I built for the last few Rosalind problems. I was tired this evening so took a nap. I'm getting ready to go to bed and get a few additional hours sleep before heading off to work.

I spent a few minutes cleaning up some of the earlier solutions such as building the DNA complement. I also started gathering the statistics to build solution to the consensus problem. I'm still not using python comprehensions as must as I probably will in the future. More small steps.

def consensus(filename):
    clearBioDb()
    loadFASTA(filename)
    clist=[0 for x in namedSeq[list(namedSeq.keys())[1]]]
    counts={x:list(clist) for x in 'ACGT'}
    for x in list(namedSeq):
        seq=namedSeq[x]
        for y in range(seq):
            counts[seq[y]][y] +=1

I'm not happy with this temporary solution to building a code block. The ragged right edge doesn't look very pretty. Still it's better than nothing.

You might wonder about why I changed the name of my blog a few weeks ago.

I've started several blogs over the years but, as I said when I change the name, I have a hard time keeping focused on one issue for any length of time. I decided to choose a name that gave me freedom to switch topic as I saw fit. A few years back I got an advertisement saying "Gary Forbis" was a rare name. I found the web page http://gary-forbis.nameanalyzer.net/ that says there were only ten people with that name in the 1990 US census. Since the 1990 census hasn't been released I wonder how they got that information. In 1990 there were two of us working for Washington State. The other one was working as a computer programmer for the transportation department. I was a computer programmer working in Seattle. I saw an obituary for a Gary Forbis who owned a limousine service in Texas. About a year ago I saw another Gary Forbis on google+. That makes two of us on Google+. I don't know how google keeps us separate since we appear to have the same account name. Both of us are programmers and neither of us are the Gary Forbis who worked for the Washington State transportation department. So I know something about 4 of the 10 Gary Forbises in the 1990 census. At least three are computer programmers and two were living in Washington state.

Wednesday, July 10, 2013

Rosland day 7

I guess I'm going to have to learn some key html tags so I can embed code in my blog in a way that sets it apart from normal text. Blobspot only uses a very simple WYSIWYG editor.

I've only solved 19 "problems" but I've learned quite a bit. As I learn more I've quit writing my own code for stuff that is already written. I'm still writing code when I probably don't need to. Also my "old" code isn't the most efficient or elegant. Here's an example from an early Rosalind "problem"

def dna2rna(dna):
    #given a string of dna
    #return the transcribed
    rna=''
    for i in range(len(dna)-1):
        if dna[i] == 'T': rna = rna+'U'
        else: rna = rna+dna[i]
    return rna

Little did I know the same thing could be accomplished by:

def dna2rna(dna): return dna.replace('T','U')

These little routines let me find when messenger rna, mRNA, would start coding for a protein.

list(findall(dna2rna(dna),rnaStart))

Again, not much. Still, it's little steps.

The quiz for week three of "Computational Molecular Evolution" had lots of problems. The instructor developed a new quiz. It's pretty short. I think I'll do it tomorrow. I've really got to get started on the "Coding the Matrix" lab.

Tuesday, July 9, 2013

Working on Rosalind when I should be working on "Coding the Matrix"

Well, I just became the 1,203 person to solve 16 "problems" on Rosalind. Not much of an accomplishment. I see at least one person solved 32 in only 4 days. Each problem so far has only taken between 5 and 30 minutes. I guess I can be slower than I thought.

I finished the (zero relative) Week 1 lectures in "Coding the Matrix". This weeks lab looks to be interesting. By the end of it I will have defined a vector class with overloaded infix operators.
An infix operator is placed between the two expressions upon which it operates. For instance the plus sign(+) in "4+3" is an infix operator. While a vector can be written as a list, [1, 2, 3, 4] for instance, the class will be using two properties, D and f, for domain and function where given a vector x, x.D is the set of indices in the domain and f is a directory that map indices to their non-zero values. When an index in the domain isn't in the dictionary its value is zero.

In python a dictionary has this form {i:x, j:y, k:z} where the value before the colon(:) is the index and the value after the colon is the value at the index. Given a dictionary f={i:x, j:y, k:z}, f[i] would return the value x.

I've got to go to bed now.

Monday, July 8, 2013

A day for Coursera

The classes aren't ready until early afternoon my time zone. I made it through "Computational Molecular Evolution" but I hardly started "Coding the Matrix". It didn't really help that CME quizzes had several errors. It's hard enough to learn the material without having the quizzes graded incorrectly. It lead me to question what I was doing. I've learned to take notes from the PAUP quizzes, and this really helped keep me gounded. PAUP is a bioinformatics program written by David L. Swofford. He granted our instructor the right to use the program in our lessons. Since I don't plan on purchasing the product I'll have to learn quite a bit about the methods and terminology. Rosalind seems to be teaching some of it. I don't know how much. This weeks lessons dealt with building trees from pair differences, observed differences between sets of aligned strands of DNA. The same techniques work for proteins.

The "Coding the Matrix" class also had some problems with the popup quizzes during the lectures. I wasn't feeling lucky. Still, I keep learning lots of stuff at a very quick, for me, pace. I'll probably have to spend several evenings to get through the lab.

At least I got my python FASTA format importer to work in only 15 lines of code, including some limited format validation. It looks like lots of people have implemented nearly identical code so there's nothing special here. I'll probably only do a problem every couple of days until I get the Coursera work for the week done.

Sunday, July 7, 2013

Rosalind day 4

I'm not spending much time on this. I've worked through 13 of the "problems". I'm definitely learning python. Today's problems were related to Fibonacci numbers. I'm pretty sure I'll have to use this tomorrow in one of the two Coursera classes I'm currently taking. The Rosalind problems I've solved so far are pretty simple. I've reviewed some of the web pages concerning dynamic programming in python. The example at Math ∩ Programming deals with Fibonacci numbers. I'm not ready to create a complex dynamic solution. I ended up with something like this:

def fib(n):
   fibValues = [0,1]
   for i in range(2,n+1):
      fibValues.append(fibValues[i-1] + fibValues[i-2])
   return fibValues[n]

rather than something like this:

def memoize(f):
   cache = {}

   def memoizedFunction(*args):
      if args not in cache:
         cache[args] = f(*args)
      return cache[args]

   memoizedFunction.cache = cache
   return memoizedFunction

@memoize
def fib(n):
   if n <= 2:
      return 1
   else:
      return fib(n-1) + fib(n-2)

It doesn't matter that the latter is a more elegant solution. I'm just glad my solution was nearly identical to other people's solutions.

Thursday, July 4, 2013

On learning Python and more

Well, I've finished Rosalind's Python Village. It was pretty easy. It didn't go into much depth. I've learned more about reading and writing files. This complements my first "Coding the Matrix" lesson.
It appears Rosalind will now track more with my "Computational Molecular Evolution" course than with "Coding the Matrix", at least for now.

I'm near my limits right now. It's going to be hard to track and remember all of this stuff. I'll try to figure out how to format the additional pages. I don't particularly want to create separate pages as one big blob. I'd like to put the gadgets on the secondary pages rather than all on the front page. I'll do now and fix later, or rather I'm running out of steam for this evening, I'll do tomorrow night and fix later.

I don't think I'll have much time for foldit for awhile. I'm down to just getting some points on each puzzle. I'm guessing the courses will help me with protein alignments. Here's hoping.

Tuesday, July 2, 2013

A new beginning

OK, I've changed my focus yet again. I don't seem able to keep my attention on one area for very long. I keep coming back to the same topics.

I'll drop my musings on economic issues for now and chronicle what I've been learning most recently.

I finished the Coursera class, "Synapse, Neurons, and Brains" about a month ago. It covered a lot of material, none in any real depth. I had to learn a few electronic formula and concepts. There were no lab work; all I did was watch the lectures and take the tests. I'll expand my knowledge as I get up to speed in other areas. For now I'll set this aside then come back to neural network models later. NEURON is an open source implementation of Rall's Cable Theory, http://en.wikipedia.org/wiki/Cable_theory. I don't have the computing power to do much right now. The Blue Brain project, http://bluebrain.epfl.ch/, is using a 16k core supercomputer to simulate about 10k neurons. Just getting up to speed and expanding my knowledge about how to program these things simply will be my focus for now. Actual discovery will have to wait.

I'm in the second week of "Computation Molecular Evolution". This class covered the evolution theory basics and population frequencies in the first week and dealt with phylogenetic trees and some tools to help analyze them in the second week.

I started "Coding the Matrix" this week. It looks to be quite time intensive but should teach me a lot of new tricks and methods. The class lab work runs on a virtual machine under Oracle's VirturalBox using a disk image of an Umbutu

Many of the comments in the forums for these three classes mentioned edX and Rosalind. edX is much like Coursera in that one can take college level courses on line for free. I hope this model is sustainable. I may take courses from both and elsewhere. Rosalind focuses on Bioinformatics. I've started Rosalind. I'll back away for a few days while I verify I can handle both 2.x and 3.x versions of Python. I might have to just use two computers. I considered installing a Win7 virtual machine using Oracle's VirtualBox. I've backed away from that idea. I have enough on my plate.

Pages