Friday, January 31, 2014

A rant about raising the minimum wage

I don't discuss economics all that much on my blog because it isn't well read.  I don't want to turn off what few readers I have.  None the less I want to get this off my chest and Usenet just isn't what it used to be.

I'll start with something that happened where I worked in the mid 1980s.  My employer was the defendant in a class action law suit where several women held they weren't being paid the same as their male counterparts doing the same job.  The truth was they weren't doing the same job but a different job with some surface similarities.  They won and a study was done to determine the appropriate pay based upon daily activities.  We were asked how much time we spent in front of the computer terminal entering code.  From this and other factors the new pay scale was set.  My employer retrained many of the keypunch operators into other fields then we programmers quite having to fill out coding forms for the keypunch operators and just entered the code ourselves.  For some of the older programmers this was traumatic but most of us thought having to spend a lot of time printing in the boxes with the proper slashing of zeros, zeds, and sevens, underlining the number one, etc.  So many of the women lost their jobs where a few found new jobs actually doing what the men were doing to earn their pay.

Many extrapolate from cases like the one I experienced to the conclusion that when minimum wage goes up people lose their jobs even though such extrapolation is unfounded.

People compete for jobs.  When more qualified people compete for the same jobs the employers can offer lower wages.  It's all supply and demand.  This even holds for jobs the employer would be more than willing to pay more to have done if the labor supply wasn't so loose.  It might be hard to determine how many jobs fall into this category but it isn't impossible.

One might wonder how the preferences of minimum wage labor would be affected by rising wages.  I'm certainly not willing to make a hard prediction on the matter.  If their consumption is primarily from their own production then the effective prices will rise with their wage and no changes will be felt except possibly those marginally above the minimum who have been able to buy superior goods where those at minimum have to settle for inferior goods.

If any of us who earn well about the minimum feel any pain at all when the minimum wage rises it's because we have been taking advantage of others who due to structural unemployment cannot demand the wage we are really willing to pay given the fact that we are feeling any pain by its lifting.  Maybe there are a few things we would go without having done.  If so those things are of marginal benefit to us anyway so there isn't much pain.

I'm pretty sure aggregate demand will go up with minimum wage given our current structural problems.  Unfortunately the tax code favors the current lopsided distribution of wealth.  I fear that until the tax code is fixed lifting the minimum wage will only have short term benefits.

Thursday, January 30, 2014

upon reflection. entrez e-utilities, genbank, python, and xml parsing.

A few days ago I started coding my own access to NCBI.  Generating the URLs isn't that big of a deal.  Learning when to use history is more of a problem.  I haven't heard back from them after requesting my "tool" and email be added for authorized access so I've not done much work.

While waiting I looked closer at the BioPython code.  It uses python's xml.etree.elementtree to build an element tree out of the XML the entrez URL returns.  Since NCBI is a trusted source that isn't a problem but the python documentation makes me weary of unknown XML sources.  The document on defusedxml goes into some pretty gruesome details about some simple XML that can cause a lot of pain.  I learned the term "Monkey patch".  Still, this "fix" doesn't fix everything people can do.

I think everyone expects people to make errors.  Programmers can unintentionally cause computer problems.  I doubt many of us haven't accidentally created a never ending loop.  Sometimes the problems one looks at are NP complete so one has little choice when traversing the search space but to let the program run for extended periods of time or just give up.  That being said, one can lament the intentional creation of pain for others.

I'm going to have to tread lightly in this area because I don't like thinking about protecting myself from others intentions even though I do it when I have to.  I like simple code that gets a lot of good done.  I fix issues when I happen upon them.  I don't like a bunker mentality it wastes too much energy for little gain.

Saturday, January 25, 2014

Rosland GenBank Introduction.

OK, the hint for Rosalind's GenBack Introduction says:

NCBI's databases, such as PubMed, GenBank, GEO, and many others, can be accessed via Entrez, a data retrieval system offered by NCBI. For direct access to Entrez, you can use Biopython’s Bio.Entrez module.

That's all well and good but I want to know how to access NCBI's databases directly.  I don't know why I missed this part:

Note that when you request Entrez databases you must obey NCBI's requirements:
  • For any series of more than 100 requests, access the database on the weekend or outside peak times in the US.
  • Make no more than three requests every second.
  • Fill in the Entrez.email field so that NCBI can contact you if there is a problem.
  • Be sensible with your usage levels; if you want to download whole mammalian genomes, use NCBI's FTP.

 I used the sample and built the url directly then pasted it into my web browser.  I got back this xml document:


<?xml version="1.0" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
 <Count>6</Count>
 <RetMax>6</RetMax>
 <RetStart>0</RetStart> 
<IdList>
<Id>11994090</Id>
<Id>168598</Id>
<Id>30959082</Id>
<Id>11990232</Id>
<Id>12445</Id>
<Id>18035</Id>
</IdList> 
<TranslationSet><Translation> 
 <From>"Zea mays"[Organism]</From> 
 <To>"Zea mays"[Organism]</To> 
 </Translation></TranslationSet>
<TranslationStack>
<TermSet> 
 <Term>"Zea mays"[Organism]</Term> 
 <Field>Organism</Field> 
 <Count>408310</Count> 
 <Explode>Y</Explode> 
</TermSet> 
<TermSet> 
 <Term>rbcL[Gene]</Term> 
 <Field>Gene</Field> 
 <Count>111602</Count> 
 <Explode>N</Explode> 
</TermSet> 
<OP>AND</OP> 
</TranslationStack>
<QueryTranslation>"Zea mays"[Organism] AND rbcL[Gene]</QueryTranslation> 
</eSearchResult>
That's great. But I forgot to include my email and "tool" parameters. I felt guilty.  I sent them and email to register my email and "tool" so I can roll my own rather than use biopython.  If you care to see the biopython code you can look here.  Interestingly the default tool name is biopython.  One can pass it a different tool name if one desires.  I don't know how tight they are on their requirements.  I've suggested they add a warning section to the xml document and use that to communicate their displeasure when one doesn't follow the rules and give one some hints as to how quickly they'll shut down the IP address.  I haven't heard back.  

The Rosalind problem talks about selecting a publishing date range.  Here's the definition of the date parameters:

Optional Parameters – Dates

datetype

Type of date used to limit a search. The allowed values vary between Entrez databases, but common values are 'mdat' (modification date), 'pdat' (publication date) and 'edat' (Entrez date). Generally an Entrez database will have only two allowed values for datetype.

reldate

When reldate is set to an integer n, the search returns only those items that have a date specified by datetype within the last n days.

mindate, maxdate

Date range used to limit a search result by the date specified by datetype. These two parameters (mindate, maxdate) must be used together to specify an arbitrary date range. The general date format is YYYY/MM/DD, and these variants are also allowed: YYYY, YYYY/MM.
 Well, good luck working on this Rosalind problem.  It's pretty simple.  I don't know how hard nosed they are about their rules.
 

Tuesday, January 21, 2014

Computational Neuroscience, MatLab, and python

OK, I have week two under my belt.  I'm not as quick as I used to be.  Still, I seem to be able to remember most of the material from week two.

MatLab has some interesting functions not in Python, or so I believe.

MatLab has a function "find" that when passed a vector returns a vector containing the indices of the non-zero values.  In Python one has to use a comprehension to do the same with a list.


MatLab:

  a=[10 20 0 30]
  find(a)

Python:

  a=[10, 20, 0, 30]
  [i for (i,v) in enumerate(a) if v!=0]

This begins to highlight MatLab's basic concept of vectors and matrices as opposed to Python's implementation of lists.  If you want to work more naturally with vectors in Python you need to import numpy.  Without importing numpy one has to do things like this:


MatLab:

  a=[1 2 3 4]
  a(a > 2)
  b=find(a > 2)
  x=1
  a(b-x)
  mean(a)
  sin(a)

Python:

  a=[1,2,3,4]
  [v for v in a if v > 2]
  b=[i for (i,v) in enumerate(a) if v > 2]
  x=1
  [a[i-x] for i in b]
  float(sum(a))/len(a)
  import math
  map(math.sin,a)

So when trying to find the Spike Triggered Average I was glad I was working in Octave rather than Python.  The Spike Triggered Average averages the postsynaptic potential, I suppose at the soma, for some arbitrarily sliced length of time prior to neuron producing spikes when exposed to Gaussian white noise.  By graphing the STA one can sometimes spot features that tells one something about what the neuron is doing.

Friday, January 10, 2014

Computational Neuroscience starts today.

Earlier this week I checked and saw I wasn't enrolled even though I thought I was.  I was a bit relieved.  Yesterday I found out my current fasting AC was 245.  I knew it was high but not that high.  I had a filling fall out a week before Christmas and set up an appointment to have the tooth removed.  It turned out it was just an exam.  The surgeon wasn't going to remove the tooth anyway but after the exam he said he didn't want to remove it until I got my blood sugar under control.  I was sort of hoping to not know and let nature take its course but I don't want to let the shell of a tooth just hang out because it's painful.  I have trouble dealing with lots of stress and dealing with this is stressful.

Any way....

I'm enrolled and don't know why it didn't appear on my Coursera dashboard earlier this week.

The first week was pretty simple.  I'll go ahead and go through the homework because my MatLab isn't that good.  I've learned some new things in the lectures but he also covered stuff I learned in "Brains, Neurons, and Synapses".  I'm glad I'm able to correlate what I learned in that course with stuff I get from this course.  Unfortunately it looks like the dendrite model will be collapsed into a point neuron even though some of the delay caused by the length of the axon may be handled.  I'm glad our model won't be of a point neuron with value representing firing rate as one of the suggested books has described its model.  I bought both books several years ago but have misplaced the "Theoretical Neuroscience" text book when I moved into my current house a year and a half ago.  It's a bit pricy on Amazon so will go without it now.  I bought a copy of the other book again and was working through it, lacking enthusiasm because of the model.  I'll hope for the best.

One of the major differences between point neurons and cable theory is that integration happens in the dendrites in cable theory.  The ion flows in and out of the branches equal zero. The movement is generally towards the soma because the size of the branch is bigger closer to the soma and the soma itself is a big collection pool.  I don't know, maybe there are action potentials at work all along the dendrite not just as the axonal hillock.  The rule is suspend disbelief.

It's hard to think about the Hinton classifier, a Restricted Boltzmann Machine, while learning about more natural the models of the brain.  I hope I can handle it.  There are places where functionality takes precedence over accuracy.  

Tuesday, January 7, 2014

Trying to learn Hinton's Restricted Boltzmann Machines

OK, years ago I watch the You-Tube video "The Next Generation of Neural Networks".  It's just under an hour and it peaked my interest.  I tried implementing the code he shared on his website using Octave.  I couldn't get it to work.  I ended up dropping it for the time being.  Now that home machines are faster and can handle more data I'm inching towards another attempt at learning this technology.

Hinton's approach is to learn using a Restricted Boltzmann machine.  A quick search popped up Richard Weiss's Machine Leaning and Programming blog and the article Restricted Boltzmann Machines in Python in particular.  The follow up article pointed to gethub.  All of this looked promising so I downloaded it and ran it.  It printed several numbers then quit.  It wasn't quite done. 

Richard Weiss took nearly a year off from his blog, I know the feeling, then last month posted one new article.  I like the clean presentation of his blog.  I'm not sure how to make blogger do some of this stuff.  When I copy and paste the HTML I don't get the same results.  I don't know, blogger must be doing or not doing something WordPress doesn't or does do.  I'm not ready to move to a better blog platform.

Anyway, The code may be useful.  I'll see what I can adapt from it then report.  I'm not sure I want to create a project on either gethub or sourceforge unless I can presents something pretty interesting, especially when there are people out there who've produced and shared some pretty good code.

Richard's code is attempting to demonstrate the ability of a Restricted Boltzmann Machine, RBM, to group texts about matrices and movies into different two dimensional vectors as indicated in Hinton's video.  The routine go() tries to plot magnitude_mat but it's just an 7 by 4 array of zeros.  Something got lost in the translation.  Maybe his training data was too small.

Saturday, January 4, 2014

Still working on UniProt protein file parser

OK, I've decided to store information extracted from the individual SwissProt and TrEMBL knowledge base files in an embedded tree structure like in this sample structure:


KB={'P31994': {
  'ID': {
    'ProtID':'FCG2B',
    'SpecID':'HUMAN',
    'Status':'Reviewed',
    'Length':310},
  'AC': ['P31994', 'A6H8N3', 'O95649', 'Q53X85', 'Q5VXA9', 'Q8NIA1'],
  'DT': {
    'Integrated': {
      'Date':'01-JUL-1993',
      'KB':'UniProtKB/Swiss-Prot'},
    'Sequenced': {
      'Date':'30-MAY-2000',
      'Version':2},
    'Entry': {
      'Date':'11-DEC-2013',
      'Version':158}},
...}

When a section can have multiple entries it will be in a list, like AC. This means one would access a particular item or set of items like this:

KB['P31994']['AC'][0]
KB['P31994']['DE']['AltName'][1]['Short']

I can provide the structure of the document and description of the various fields in a similar way.  Much of this is documentation with IDs into other systems.  Things change over time so some of the documentation is out of date or otherwise unavailable.  A relatively new line type, RX, has some issues.  I tracked down how to use the document ID


import webbrowser

#The UniProt referenced database names used by RX records
#append with publication ID
UniProtPubDB={'MEDLINE':'', # you need a cross reference to pubmed id.
              'PubMed':'http://www.ncbi.nlm.nih.gov/pubmed/',
              'DOI':'http://dx.doi.org/',
              'AGRICOLA','' # requires login
              }

webbrowser.open_new(UniProtPubDB['PubMed']+'2531080')

The MEDLINE UI search as been deprecated so one would need a crossreference. AGRICOLA requires a login ID. I've only looked at four proteins. None have either MEDLINE or AGRICOLA references. The sample is too small but MEDLINE UI isn't of much use.

So much for today.

Thursday, January 2, 2014

Back to Rosalind, Python, and Biopython

My energies are quite unfocused. Work seems never ending.  When I come come I just want to veg out in front of the computer or TV.  I feel the desire to do more but when it actually comes down to doing more my desires dissipate.  Yesterday evening I "solved" the first Rosalind problem I've done in awhile, "Introduction to Protein Databases", or DBPR for short.  I had started forgetting the python language.

DBPR states the problem to be solved this way:
Given: The UniProt ID of a protein.
Return: A list of biological processes in which the protein is involved (biological processes are found in a subsection of the protein's "Gene Ontology" (GO) section).

Programming Shortcut

ExPASy databases can be accessed automatically via Biopython’s Bio.ExPASy module. The function .get_sprot_raw will find a target protein by its ID.
We can obtain data from an entry by using the SwissProt module. The read() function will handle one SwissProt record and parse will allow you to read multiple records at a time.
Let's get the data for the B5ZC00 protein:

>>>from Bio import ExPASy
>>>from Bio import SwissProt
>>>handle = ExPASy.get_sprot_raw('B5ZC00') #you can give several IDs separated by commas
>>>record = SwissProt.read(handle) # use SwissProt.parse for multiple proteins
We now can check the list of attributes for the obtained Swiss-Prot record:

 >>> dir(record)
 [..., 'accessions', 'annotation_update', 'comments', 'created', 'cross_references',
 'data_class', 'description', 'entry_name', 'features', 'gene_name', 'host_organism', 'host_taxonomy_id', 'keywords',
 'molecule_type', 'organelle', 'organism', 'organism_classification', 'references', 'seqinfo', 'sequence',
 'sequence_length', 'sequence_update', 'taxonomy_id']
To see the list of references to other databases, we can check the .cross_references attribute of our record:

>>>record.cross_references[0]
('EMBL', 'CP001184', 'ACI60310.1', '-', 'Genomic_DNA')
OK, I looked at Biopython.  It's quite impressive.  I didn't use it.  I wanted to do my own thing.  After solving the problem I looked at other people's published solutions.  Different people have different focus and preferences.  Most of the solutions were very specific to the problem at hand.  Mine wasn't much better.  So my unfocused energy went about solving yet another already solved problem.

From the ExPASy documentation I started building my own Swiss-Prot parser.  I'm working on the Description lines now.  I'm thinking I'll go back and embed dictionaries for some of this so access would be something like


B5ZC00=parseSwissProt('B5ZC00')
B5ZC00['entry']['date']
B5ZC00['recommended_name']

That might be too wordy so maybe I should use codes and a dictionary to expand them.


SwissProtDict={'ID':'Identification','ID_P':'Protein ID','AC':'Accession'}
B5ZC00=parseSwissProt('B5ZC00')
B5ZC00['DT_E']['DT']
B5ZC00['DE_RN']['FULL']


Here's the first three line types:

def parseSwissProt(protID) :
    sp=dict()
    sp['accession']=[]
    sp['sequence']=''
    f=open(protID+'.txt','r')
    for line in f:
        lt=line[0:2]
        ld=line[2:-1].lstrip()
        if lt=='ID': # Identification
            l=ld.split()
            [sp['proteinID'],x,sp['speciesID']]=l[0].partition('_')
            sp['status']=l[1][:-1]
            sp['length']=int(l[2])
        elif lt=='AC': # Accession
            sp['accession'].extend([x[:-1] for x in ld.split()])
        elif lt=='DT': # Date
            [dt,x,txt]=ld[:-1].partition(', ')
            dtt=txt.split()
            if dtt[0]=='integrated':
                sp['integrated_date']=dt
                sp['integrated_into']=dtt[2].split('/')[1]
            elif dtt[0]=='sequence':
                sp['sequenced_date']=dt
                sp['sequence_version']=int(dtt[2])
            elif dtt[0]=='entry':
                sp['entry_date']=dt
                sp['entry_version']=int(dtt[2])
    return sp