Saturday, August 31, 2013

Starting Coursera course "Computational Investing Part 1"

As I previously mentioned, I ended up dropping "Coding the Matrix".  I'll likely try it again in the future.  The combination of taking two courses and dealing with some other issues was just too much for me.  The structure of the lectures and the focus on vocabulary was just too much for me.

Well, I started "Computational Investing Part 1" today. 

I started installing the software early this morning.  It turns out, at least for now, I had to remove the 64 bit versions of Python 2.7.5 and 3.3.2.  It appears there is a virtual environment program I can use to keep all of the versions I used installed at the same time and the various installers will work fine.  I may deal with that next time.  For now I have the 32 bit version of Python 2.7.5 installed. 

I have anxiety issues.  Having to drop "Coding the Matrix" has set me back.  I nearly dropped this new course after having so much problem getting the required software loaded.  It appears much of the 32 bit software is packaged in an executable installer.  The newer stuff is packages in zip files where one runs python from the command line.  I don't know.  Maintaining so many versions of packages with several installers seems like a lot of work.  I guess the newer method won't be so scary one I learn it.

Even people who don't want to learn Python but are interested in the stock market will find the lectures interesting.  I already knew most of what was discussed in the Week 1 lectures.

Since I've installed all of the software Week 2 lectures and homework should go pretty quickly tomorrow.

I haven't spent as much time learning Python as I'd like.  Here's my sloppy documentation of the software needed for Computational Investing Part 1:

install in this order
Python 2.7.5
numpy
scipy
dateutil
pyparsing
matplotlib
pandas
setuptools
cvxopt
scikit-learn
scikit-statsmodel
qstk

From this set of Files
cvxopt-1.1.4.win32-py2.7.exe
matplotlib-1.3.0.win32-py2.7.exe
numpy-1.7.1.win32-py2.7.exe
pandas-0.12.0.win32-py2.7.exe
pyparsing-2.0.1.win32-py2.7.exe
python-2.7.5.msi
python-dateutil-1.5.win32.exe
QSTK-0.2.6.win32.exe
Readme.py
scikit-learn-0.14.1.win32-py2.7.exe
scipy-0.11.0-win32-superpack-python2.7.exe
setuptools-0.6c11.win32-py2.7.exe
statsmodels-0.5.0.win32-py2.7.exe

Friday, August 23, 2013

time off

I had to back away for awhile.  I got irritated with "Coding the Matrix".  I dropped the course.  I have a short term memory problem.  I used to have a near eidetic memory with the exception vocabulary.  I've lost much of that capability over the years.  My memory is still quite good as long as I'm not being asked to memorize vocabulary.  "Coding the Matrix" was mostly vocabulary.  Even though I would pause the lectures and replay sections several times the presentation didn't work for me.  When he would say something like "There are three equivalent definitions for matrix/matrix multiplication..." my mind was already being overtaxed.  I need an anchor to control context switching not an item to hold in short term memory.  When dealing with  matrix/vector multiplication and vector/matrix multiplication he displayed the vector horizontally in both cases so I was suppose to remember there were three equivalent definitions, vectors to the right of a matrix are to be mentally transposed from a row of numbers to a column of numbers, listen for the name of a particular definition, and remember the distinction between equivalent algorithms associated with those names. 

Ah well.  I'll take the class again some other time and not try to handle two classes at the same time.  I'm not sure I will ever learn the distinction between equivalent named algorithms and appropriately construct distinct implementations.  I can view algorithms and know they are equivalent.  I would write a working solution in minutes then spend hours trying to figure out why it was graded incorrect.  Sometimes the problem was the order in which scalier numbers were multiplied, A*B vs B*A.  Getting this stuff down will be good for me even though I'd rather spent time applying Markov Chaining Monte Carlo to several domains before the algorithm gets knocked out of my memory.  While "Computational Molecular Evolution" was hard, it too introduced lots of new vocabulary, I could absorb it.  The instructor had a different approach to vocabulary.

I think I pick up Rosalind in the evenings again.  Maybe that will clear my mind so I can come back to MCMC and Bayesian analysis without the echoes from the Matrix class.  I have several applications in mind.

Wednesday, August 7, 2013

Wrapping up Computational Molecular Evolution.

Computational Molecular Evolution turned out to be something different than I expected.  I'll be finishing the last assignment tomorrow evening.  It is a survey class.  There was lots of hands on with several programs.  Many techniques were discussed.  I'll have an easier time getting up to speed the next time I encounter these things but I doubt I could use these programs by myself and get good results.  The field is quite in flux.  There are lots of good tools out there but they are changing all the time.

In the final assignment RevTrans 1.4 is mentioned to do DNA alignment.  RevTrans 2.0 is in beta release now.  It converts the DNA sequences into proteins, does the alignment on the proteins, then maps the DNA by codon to the aligned protein.  The Center For Biological Sequence Analysis was upgrading their servers.  Once I got the sequence I tried saving it to my (virtual) hard disk.  The download process failed.   I used copy and paste. 

While I was told to save the file with the .fasta extension the file produced by RevTrans 1.4 was
the ClustalW multiple alignment format.  EBI readseq was able to detect the real format just fine.  The rule is let it autoselect.  It was also able to write the PHYLIP4 format used by BioPerl.

I guess I am getting a feel for the range of tools available even if I don't know how to use them very well.

Python 101, part 2

If you haven't loaded Python then you need to do that first.  Let me know if the instruction I wrote up in Python 101, part 1 are too confusing.  I've tried to be pretty thorough.

If you're new to programming don't get too worried.  You will get the hang of it quite quickly.

Python can be used like a very strange calculator.  When you start python you see something like this:
The cursor will be one space after the ">>>" characters.  These are the command prompt characters.
To use python as a calculator one just types the formula after the prompt then press the enter key:

>>> 5 + 1
6
>>>

Unlike the normal calculator you probably have around the house, Python follows the math rules of operator precedence.  What that means is the multiply and divide operators get executed before the add and subtract operators.

>>> 5 + 1 * 2
7
>>>

You can tell python you want certain operations to happen first by putting parens (parentheses) around the operations you want done first.

>>> (5 + 1) * 2
12
>>>

All of this is defined in the Python Tutorial under Using Python as a Calculator.  While the manual may seem frightening or confusing at first you get use to these things.  Don't read manuals like you would a book.  Use the manual like you would a catalog or a web side like BestBuy.com.

Most calculators have MR and MS buttons.  These are Memory Recall and Memory Store.  Most programming languages let you name storage locations and Python is no exception.  If you've looked at the link to Using Python as a Calculator you see these memory stores are call "Variables".  In Python you put the variable on the left side of the equal sign(=) and the remember this is just a convention to store the result of a calculation in the named location.  You can then recall the value by using its name.

>>> a = 5 + 1 * 2
>>> a
7
>>>

This, and the stuff one gets when one clicks on the "click to expand" text should be enough to solve Rosalind's Variables and Some Arithmetic.  I haven't really added much other than a lot of words.

If one was limited to numbers programming languages wouldn't be very interesting.
Strings are covered quite nicely in the Python Tutorial.  If it's too much then just stick with the basic string for now and remember the backslash(\) needs to be doubled up in certain cases.

>>> a
'this\that'
>>> print(a)
this hat
>>> a='this\\that'
>>> print(a)
this\that
>>>

Rosalind suggests using v2.7.5.  Version 2 of Python doesn't put parens around what one wants to print.  Rosalind covers lists and strings together because strings are handled like lists of characters.  There is a never ending battle among programmers concerning zero relative or one relative indexing.  Python is zero relative indexing.  Python also lets you index a range of items in a list at the same time, just remember that the range starts at the zero relative indexed item and stops one short of the second index.  One can "add" one string or list to another using the plus sing(+). Be careful when dealing with lists. Selecting a range returns a list.

>>> a='one rainy day'
>>> a[0]
'o'
>>> a[0:4]+ a[10:]
'one day'
>>> a=['one','rainy','day']
>>> a[2]+' '+a[0]
'day one'
>>> a[1:3]+a[0:1]
['rainy', 'day', 'one']
>>>

I hope this is helping you get started.

Monday, August 5, 2013

Coding the Matrix Lectures vs Assignments

It is probably about learning style.  I don't do well in lecture classes.  I have trouble with short term memory.  By the time I'm done listening to a lecture about two types of matrix-vector multiplication and vector-matrix multiplication I've been vectored into the state of confusion.  My lights are out.  I find myself reviewing Wikipedia to do the assignments.  I've tried going back into the videos and listening to them several times.  It doesn't sink in.  Mathematicians have a love affair with Greek letters.  Just seeing them does something to my mind.  It doesn't really matter how many times I'm told about how a vector space can be defined by a span none of this is grounded.  I'm hearing word salad.  I fear I'm going to have some real trouble ahead.  I've been able to get the correct answer to all of the assignments so far.  When it came to vector space I ended up doing a Boolean search.  On two of the three tasks it took me 4 attempts to get the correct answer to two questions each with the answer to each question being True or False.  Four attempts at four possibilities to get the correct one?  That's obviously worse than chance. 

I really didn't understand vector space.

Here's hoping things get better.

Saturday, August 3, 2013

On retrieving protien sequences from online databases.

Getting a FASTA file from the UniProt database is quite simple.  I'll need to review the user agreement before I go very far but here's a simple python 2.7.5 routine to get a set of proteins from UniProt based upon the accession IDs

import urllib2

def getFromUniprot(filename,accessionList):
    """Get a set of protein sequenses from the UniProt database"""
    FastaData=''
    for accession in accessionList:
        x=urllib2.urlopen('http://www.uniprot.org/uniprot/'+accession+'.fasta')
        FastaData+=x.read()
        x.close()
    f=open(filename,'w')
    f.write(FastaData)
    f.close

getFromUniprot('mprt.fasta',['A2Z669','B5ZC00',
        'P07204_TRBM_HUMAN','P20840_SAG1_YEAST'])

While this could be part of a solution to Rosalind's Finding a Protein Motif there's a bit more to it than that.  I tried something similar with the NCBI database using a URL like this:
http://www.ncbi.nlm.nih.gov/protein/238064857?report=fasta&format=text but it doesn't work.  NCBI expect you to use their Entrez Utilities.  BioPython has its own interface on top of Entrez.
The two websites may have slightly different protein sequences.  The header to the sequence for B5ZC00 look like this (I've broken the line so it is easier to read):

UniProt:
>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase OS=Ureaplasma urealyticum serovar 10 
   (strain ATCC 33699 / Western) GN=glyQS PE=3 SV=1
NCBI:  
>gi|238064857|sp|B5ZC00.1|SYG_UREU1 RecName: Full=Glycine--tRNA ligase;
    AltName: Full=Glycyl-tRNA synthetase; Short=GlyRS

The sp prefix indicates SWISS-PROT. The gi prefix indicates GenInfo Integrated Database.  Here are the prefixes handled by NCBI.  The NCBI header seems to indicate a SWISS-PROT variant of the protein mentioned above.  Interestingly http://www.uniprot.org/uniprot/B5ZC00.1.fasta returns a different header with the prefix "UniProtKB/TrEMBL".  There's so much to learn and more of it every day.  I guess I could put all three into a single file then run a Hamming count on them to see if they differ.  Unfortunately I'm using the second field as the name when I build my python structures so I'd have to fudge one of the one's retrieved from the UniProt DB so they would have different names/accession IDs.