One Gary Forbis, by Gary Forbis: 2014

Saturday, June 28, 2014

Overcoming DNA test errors.

I'm investigating my assumptions here and will follow up soon.

I recently learned sequencing machines have a 1% error rate. That means out of the 700,000 or so identified SNPs in the Ancestry.com DNA tests about 7,000 of them are incorrectly identified unless the testers have taken steps to validate inconsistencies or the published error rates are out of date. Even a .01% error rate would be 70 errors or just under 2 per chromosome.

Right now the tested people have no way to challenge and get obvious errors corrected. It calls into question my supposition about how moderate, low, and very low confidence levels are determined. I may have to actually look at my sister's and parents' DNA results.

On some of the matches with hints, meaning both tested individuals are identified in trees and both trees mention the same individual as an ancestor of the tested individual in the tree, the confidence levels for my sister and I are higher than for my mother. That doesn't make sense since my relationship to my cousins passes through one of my parents so the confidence level for mother should never be lower than mine once we're looking at a single chromosome to identify an individual as a cousin.

While there is a non-zero probability that my sister and I would share exactly the same mutation that undoes a mutation away from the ancestor's sequence it's much more likely our mom's DNA test has a sequencing error and the SNP is misidentified in her test.

Thursday, June 26, 2014

On supporting ancestry trees with DNA tests.

The following has some errors. I will try to explain in a later post.

It has been awhile. I grew tired of programming at work then coming home and programming some more. I've been working of my genealogy for awhile now. I'd like to bounce some ideas off you.

When I had my DNA tested by Ancestry.com then extracted the results as a text file I found it contained some documentation followed by the documentation and the start of the data as indicated below:

#Genetic data is provided below as five TAB delimited columns. Each line
#corresponds to a SNP. Column one provides the SNP identifier (rsID where
#possible). Columns two and three contain the chromosome and basepair position
#of the SNP using human reference build 37.1 coordinates. Columns four and five
#contain the two alleles observed at this SNP (genotype). The genotype is reported
#on the forward (+) strand with respect to the human reference.
rsid    chromosome    position    allele1    allele2
rs4477212    1    82154    T    T
rs3131972    1    752721    A    G
rs12562034    1    768448    G    G
...
rs6517463    21    39752673    0    0

One average each generation has a single mutation across the 23 pairs. Most of the mutations will happen in the non-coding regions so will have no impact on the SNPs mentioned above. Most mutations will be transitions, between A and G or C and T, rather than transpositions. Deletions and insertions are very rare. The allele position is based upon alignment against the model so deletions are indicated by a 0. I don't know how insertions are handled, maybe a repetition of rsID and position.

There were a bit over 700,000 line for 22 chromosomes. Chromosomes 23, 24, and 25
represent the x unique, y unique, and shared x/y alleles though not necessarily in that order. That's about 30,000 identified bases per strand. Ancestry selected the sites tested because they represent common variations within the human population. 4 to the 30,000 power seems large enough that no two individuals should ever have identical chromosomes but that doesn't really make sense given the way we come by them.

Except for very rare cases one shares 23 chromosomes with one's father and 23 with one's mother. Even if both parents had completely unique chromosomes I will share 50% with each. My sister also shares 50% of her chromosomes with each but not the same ones. At least the Y I get from my father's father where my sister gets one of my father's mother's X chromosomes. All sisters will share the same X chromosome from their paternal grandmother.

Most of the time the siblings will share about 12 chromosomes with each other. The number can go up or down based upon a normalized distribution. Cousins will share, based upon a normal distribution, around 5 chromosomes and second cousins will share around 2 chromosomes. It is highly unlikely there will be any point mutations within the documented sites within 4 generations.

Without knowing Ancestry's algorithm I am assuming confidence level 95% means one unmodified chromosome, Moderate means one mutation, low means two mutations, and very low means 3 mutations.

I have 16 unique great-great-grandparents. I only have 30 unique 3rd great grandparents. It's likely I don't share any DNA with some of them, at least through them. Who know, I may share 2 chromosomes with a few of them. There's no way I can share DNA through all of my 4th great grandparents because I only have 46 chromosomes to share amongst them.

I happen to have access to my sister's, my two parents', and my DNA tests. Just from the DNA hints on ancestry.com I can say that when my parent's confidence level in relationship to an individual is higher than mine and both are no more than 95% then a point mutation at a tested site has happened. When several cousins share the same confidence level for the very same portion of the tree no mutation has happened and the same chromosome is being shared. When the confidence drops the mutation happened in the unshared region of the tree.

Friday, February 7, 2014

On simulating the firing rate of neurons and understanding neuronal function.

I don't know why I have problem with higher math. I'm a systems analyst/computer programer by trade. I believe I'm good at what I do. Math is mostly symbol manipulation. None the less my mind shuts down when exposed to higher symbolic math. Most of the stuff covered so far in the Coursera "Computational Neuroscience" class is symbolic math. I am being given the formula for the firing rate of neurons and what is being explained is how one can extract information from a trace of the firing rate of neurons. I am left wanting.

This morning I realized what's going on. It's a bit like explaining a radio by expressing formula for the firing rates of transistors. Sure one can reconstruct the music by measuring the firing rate of the transistors but it doesn't really explain how a radio works. Something else is being explained.

I need to re-frame how I'm listening to the lectures. I'll need to manipulate symbols to do the math even though I don't quite understand what the symbols are representing. Understanding can come later. Conditional probability is quite a powerful tool. I've seen it applied to genetics and now to neurology. Maybe understanding what the neurons are doing is enough for now and how they are doing it can wait.

Tuesday, February 4, 2014

I got my NCBI Entrez tool approved.

The more I look at this stuff the less value creating my own access tool appears to be. I'll go ahead and develop it so I can force myself to learn the interface better. There's quite a few resources available on their web site. I'll probably down load the standalone BLAST program.

I don't get much time to decompress and play around with all my hobbies. I'll need to spend several of the next few evenings dealing with the Computational Neuroscience class so I don't fall too far behind.

Friday, January 31, 2014

A rant about raising the minimum wage

I don't discuss economics all that much on my blog because it isn't well read. I don't want to turn off what few readers I have. None the less I want to get this off my chest and Usenet just isn't what it used to be.

I'll start with something that happened where I worked in the mid 1980s. My employer was the defendant in a class action law suit where several women held they weren't being paid the same as their male counterparts doing the same job. The truth was they weren't doing the same job but a different job with some surface similarities. They won and a study was done to determine the appropriate pay based upon daily activities. We were asked how much time we spent in front of the computer terminal entering code. From this and other factors the new pay scale was set. My employer retrained many of the keypunch operators into other fields then we programmers quite having to fill out coding forms for the keypunch operators and just entered the code ourselves. For some of the older programmers this was traumatic but most of us thought having to spend a lot of time printing in the boxes with the proper slashing of zeros, zeds, and sevens, underlining the number one, etc. So many of the women lost their jobs where a few found new jobs actually doing what the men were doing to earn their pay.

Many extrapolate from cases like the one I experienced to the conclusion that when minimum wage goes up people lose their jobs even though such extrapolation is unfounded.

People compete for jobs. When more qualified people compete for the same jobs the employers can offer lower wages. It's all supply and demand. This even holds for jobs the employer would be more than willing to pay more to have done if the labor supply wasn't so loose. It might be hard to determine how many jobs fall into this category but it isn't impossible.

One might wonder how the preferences of minimum wage labor would be affected by rising wages. I'm certainly not willing to make a hard prediction on the matter. If their consumption is primarily from their own production then the effective prices will rise with their wage and no changes will be felt except possibly those marginally above the minimum who have been able to buy superior goods where those at minimum have to settle for inferior goods.

If any of us who earn well about the minimum feel any pain at all when the minimum wage rises it's because we have been taking advantage of others who due to structural unemployment cannot demand the wage we are really willing to pay given the fact that we are feeling any pain by its lifting. Maybe there are a few things we would go without having done. If so those things are of marginal benefit to us anyway so there isn't much pain.

I'm pretty sure aggregate demand will go up with minimum wage given our current structural problems. Unfortunately the tax code favors the current lopsided distribution of wealth. I fear that until the tax code is fixed lifting the minimum wage will only have short term benefits.

Thursday, January 30, 2014

upon reflection. entrez e-utilities, genbank, python, and xml parsing.

A few days ago I started coding my own access to NCBI. Generating the URLs isn't that big of a deal. Learning when to use history is more of a problem. I haven't heard back from them after requesting my "tool" and email be added for authorized access so I've not done much work.

While waiting I looked closer at the BioPython code. It uses python's xml.etree.elementtree to build an element tree out of the XML the entrez URL returns. Since NCBI is a trusted source that isn't a problem but the python documentation makes me weary of unknown XML sources. The document on defusedxml goes into some pretty gruesome details about some simple XML that can cause a lot of pain. I learned the term "Monkey patch". Still, this "fix" doesn't fix everything people can do.

I think everyone expects people to make errors. Programmers can unintentionally cause computer problems. I doubt many of us haven't accidentally created a never ending loop. Sometimes the problems one looks at are NP complete so one has little choice when traversing the search space but to let the program run for extended periods of time or just give up. That being said, one can lament the intentional creation of pain for others.

I'm going to have to tread lightly in this area because I don't like thinking about protecting myself from others intentions even though I do it when I have to. I like simple code that gets a lot of good done. I fix issues when I happen upon them. I don't like a bunker mentality it wastes too much energy for little gain.

Saturday, January 25, 2014

Rosland GenBank Introduction.

OK, the hint for Rosalind's GenBack Introduction says:

NCBI's databases, such as PubMed, GenBank, GEO, and many others, can be accessed via Entrez, a data retrieval system offered by NCBI. For direct access to Entrez, you can use Biopython’s Bio.Entrez module.

That's all well and good but I want to know how to access NCBI's databases directly. I don't know why I missed this part:

Note that when you request Entrez databases you must obey NCBI's requirements:

For any series of more than 100 requests, access the database on the weekend or outside peak times in the US.

Make no more than three requests every second.

Fill in the Entrez.email field so that NCBI can contact you if there is a problem.

Be sensible with your usage levels; if you want to download whole mammalian genomes, use NCBI's FTP.

I used the sample and built the url directly then pasted it into my web browser. I got back this xml document:

<?xml version="1.0" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>

 <Count>6</Count>

 <RetMax>6</RetMax>

 <RetStart>0</RetStart>

<IdList>
<Id>11994090</Id>
<Id>168598</Id>
<Id>30959082</Id>
<Id>11990232</Id>
<Id>12445</Id>
<Id>18035</Id>
</IdList>

<TranslationSet><Translation>

 <From>"Zea mays"[Organism]</From>

 <To>"Zea mays"[Organism]</To>

 </Translation></TranslationSet>

<TranslationStack>

<TermSet>

 <Term>"Zea mays"[Organism]</Term>

 <Field>Organism</Field>

 <Count>408310</Count>

 <Explode>Y</Explode>

</TermSet>

<TermSet>

 <Term>rbcL[Gene]</Term>

 <Field>Gene</Field>

 <Count>111602</Count>

 <Explode>N</Explode>

</TermSet>

<OP>AND</OP>

</TranslationStack>

<QueryTranslation>"Zea mays"[Organism] AND rbcL[Gene]</QueryTranslation>

</eSearchResult>

That's great. But I forgot to include my email and "tool" parameters. I felt guilty. I sent them and email to register my email and "tool" so I can roll my own rather than use biopython. If you care to see the biopython code you can look here. Interestingly the default tool name is biopython. One can pass it a different tool name if one desires. I don't know how tight they are on their requirements. I've suggested they add a warning section to the xml document and use that to communicate their displeasure when one doesn't follow the rules and give one some hints as to how quickly they'll shut down the IP address. I haven't heard back.

The Rosalind problem talks about selecting a publishing date range. Here's the definition of the date parameters:

Optional Parameters – Dates

datetype
Type of date used to limit a search. The allowed values vary between Entrez databases, but common values are 'mdat' (modification date), 'pdat' (publication date) and 'edat' (Entrez date). Generally an Entrez database will have only two allowed values for datetype.

reldate
When reldate is set to an integer n, the search returns only those items that have a date specified by datetype within the last n days.

mindate, maxdate
Date range used to limit a search result by the date specified by datetype. These two parameters (mindate, maxdate) must be used together to specify an arbitrary date range. The general date format is YYYY/MM/DD, and these variants are also allowed: YYYY, YYYY/MM.

Well, good luck working on this Rosalind problem. It's pretty simple. I don't know how hard nosed they are about their rules.

Tuesday, January 21, 2014

Computational Neuroscience, MatLab, and python

OK, I have week two under my belt. I'm not as quick as I used to be. Still, I seem to be able to remember most of the material from week two.

MatLab has some interesting functions not in Python, or so I believe.

MatLab has a function "find" that when passed a vector returns a vector containing the indices of the non-zero values. In Python one has to use a comprehension to do the same with a list.


MatLab:

  a=[10 20 0 30]
  find(a)

Python:

  a=[10, 20, 0, 30]
  [i for (i,v) in enumerate(a) if v!=0]

This begins to highlight MatLab's basic concept of vectors and matrices as opposed to Python's implementation of lists. If you want to work more naturally with vectors in Python you need to import numpy. Without importing numpy one has to do things like this:


MatLab:

  a=[1 2 3 4]
  a(a > 2)
  b=find(a > 2)
  x=1
  a(b-x)
  mean(a)
  sin(a)

Python:

  a=[1,2,3,4]
  [v for v in a if v > 2]
  b=[i for (i,v) in enumerate(a) if v > 2]
  x=1
  [a[i-x] for i in b]
  float(sum(a))/len(a)
  import math
  map(math.sin,a)

So when trying to find the Spike Triggered Average I was glad I was working in Octave rather than Python. The Spike Triggered Average averages the postsynaptic potential, I suppose at the soma, for some arbitrarily sliced length of time prior to neuron producing spikes when exposed to Gaussian white noise. By graphing the STA one can sometimes spot features that tells one something about what the neuron is doing.

Friday, January 10, 2014

Computational Neuroscience starts today.

Earlier this week I checked and saw I wasn't enrolled even though I thought I was. I was a bit relieved. Yesterday I found out my current fasting AC was 245. I knew it was high but not that high. I had a filling fall out a week before Christmas and set up an appointment to have the tooth removed. It turned out it was just an exam. The surgeon wasn't going to remove the tooth anyway but after the exam he said he didn't want to remove it until I got my blood sugar under control. I was sort of hoping to not know and let nature take its course but I don't want to let the shell of a tooth just hang out because it's painful. I have trouble dealing with lots of stress and dealing with this is stressful.

Any way....

I'm enrolled and don't know why it didn't appear on my Coursera dashboard earlier this week.

The first week was pretty simple. I'll go ahead and go through the homework because my MatLab isn't that good. I've learned some new things in the lectures but he also covered stuff I learned in "Brains, Neurons, and Synapses". I'm glad I'm able to correlate what I learned in that course with stuff I get from this course. Unfortunately it looks like the dendrite model will be collapsed into a point neuron even though some of the delay caused by the length of the axon may be handled. I'm glad our model won't be of a point neuron with value representing firing rate as one of the suggested books has described its model. I bought both books several years ago but have misplaced the "Theoretical Neuroscience" text book when I moved into my current house a year and a half ago. It's a bit pricy on Amazon so will go without it now. I bought a copy of the other book again and was working through it, lacking enthusiasm because of the model. I'll hope for the best.

One of the major differences between point neurons and cable theory is that integration happens in the dendrites in cable theory. The ion flows in and out of the branches equal zero. The movement is generally towards the soma because the size of the branch is bigger closer to the soma and the soma itself is a big collection pool. I don't know, maybe there are action potentials at work all along the dendrite not just as the axonal hillock. The rule is suspend disbelief.

It's hard to think about the Hinton classifier, a Restricted Boltzmann Machine, while learning about more natural the models of the brain. I hope I can handle it. There are places where functionality takes precedence over accuracy.

Tuesday, January 7, 2014

Trying to learn Hinton's Restricted Boltzmann Machines

OK, years ago I watch the You-Tube video "The Next Generation of Neural Networks". It's just under an hour and it peaked my interest. I tried implementing the code he shared on his website using Octave. I couldn't get it to work. I ended up dropping it for the time being. Now that home machines are faster and can handle more data I'm inching towards another attempt at learning this technology.

Hinton's approach is to learn using a Restricted Boltzmann machine. A quick search popped up Richard Weiss's Machine Leaning and Programming blog and the article Restricted Boltzmann Machines in Python in particular. The follow up article pointed to gethub. All of this looked promising so I downloaded it and ran it. It printed several numbers then quit. It wasn't quite done.

Richard Weiss took nearly a year off from his blog, I know the feeling, then last month posted one new article. I like the clean presentation of his blog. I'm not sure how to make blogger do some of this stuff. When I copy and paste the HTML I don't get the same results. I don't know, blogger must be doing or not doing something WordPress doesn't or does do. I'm not ready to move to a better blog platform.

Anyway, The code may be useful. I'll see what I can adapt from it then report. I'm not sure I want to create a project on either gethub or sourceforge unless I can presents something pretty interesting, especially when there are people out there who've produced and shared some pretty good code.

Richard's code is attempting to demonstrate the ability of a Restricted Boltzmann Machine, RBM, to group texts about matrices and movies into different two dimensional vectors as indicated in Hinton's video. The routine go() tries to plot magnitude_mat but it's just an 7 by 4 array of zeros. Something got lost in the translation. Maybe his training data was too small.

Saturday, January 4, 2014

Still working on UniProt protein file parser

OK, I've decided to store information extracted from the individual SwissProt and TrEMBL knowledge base files in an embedded tree structure like in this sample structure:


KB={'P31994': {
  'ID': {
    'ProtID':'FCG2B',
    'SpecID':'HUMAN',
    'Status':'Reviewed',
    'Length':310},
  'AC': ['P31994', 'A6H8N3', 'O95649', 'Q53X85', 'Q5VXA9', 'Q8NIA1'],
  'DT': {
    'Integrated': {
      'Date':'01-JUL-1993',
      'KB':'UniProtKB/Swiss-Prot'},
    'Sequenced': {
      'Date':'30-MAY-2000',
      'Version':2},
    'Entry': {
      'Date':'11-DEC-2013',
      'Version':158}},
...}

When a section can have multiple entries it will be in a list, like AC. This means one would access a particular item or set of items like this:


KB['P31994']['AC'][0]
KB['P31994']['DE']['AltName'][1]['Short']

I can provide the structure of the document and description of the various fields in a similar way. Much of this is documentation with IDs into other systems. Things change over time so some of the documentation is out of date or otherwise unavailable. A relatively new line type, RX, has some issues. I tracked down how to use the document ID


import webbrowser

#The UniProt referenced database names used by RX records
#append with publication ID
UniProtPubDB={'MEDLINE':'', # you need a cross reference to pubmed id.
              'PubMed':'http://www.ncbi.nlm.nih.gov/pubmed/',
              'DOI':'http://dx.doi.org/',
              'AGRICOLA','' # requires login
              }

webbrowser.open_new(UniProtPubDB['PubMed']+'2531080')

The MEDLINE UI search as been deprecated so one would need a crossreference. AGRICOLA requires a login ID. I've only looked at four proteins. None have either MEDLINE or AGRICOLA references. The sample is too small but MEDLINE UI isn't of much use.

So much for today.

Thursday, January 2, 2014

Back to Rosalind, Python, and Biopython

My energies are quite unfocused. Work seems never ending. When I come come I just want to veg out in front of the computer or TV. I feel the desire to do more but when it actually comes down to doing more my desires dissipate. Yesterday evening I "solved" the first Rosalind problem I've done in awhile, "Introduction to Protein Databases", or DBPR for short. I had started forgetting the python language.

DBPR states the problem to be solved this way:
Given: The UniProt ID of a protein.
Return: A list of biological processes in which the protein is involved (biological processes are found in a subsection of the protein's "Gene Ontology" (GO) section).

Programming Shortcut
ExPASy databases can be accessed automatically via Biopython’s Bio.ExPASy module. The function .get_sprot_raw will find a target protein by its ID.
We can obtain data from an entry by using the SwissProt module. The read() function will handle one SwissProt record and parse will allow you to read multiple records at a time.
Let's get the data for the B5ZC00 protein:
>>>from Bio import ExPASy
>>>from Bio import SwissProt
>>>handle = ExPASy.get_sprot_raw('B5ZC00') #you can give several IDs separated by commas
>>>record = SwissProt.read(handle) # use SwissProt.parse for multiple proteins
We now can check the list of attributes for the obtained Swiss-Prot record:
 >>> dir(record)
 [..., 'accessions', 'annotation_update', 'comments', 'created', 'cross_references',
 'data_class', 'description', 'entry_name', 'features', 'gene_name', 'host_organism', 'host_taxonomy_id', 'keywords',
 'molecule_type', 'organelle', 'organism', 'organism_classification', 'references', 'seqinfo', 'sequence',
 'sequence_length', 'sequence_update', 'taxonomy_id']
To see the list of references to other databases, we can check the .cross_references attribute of our record:
>>>record.cross_references[0]
('EMBL', 'CP001184', 'ACI60310.1', '-', 'Genomic_DNA')

OK, I looked at Biopython. It's quite impressive. I didn't use it. I wanted to do my own thing. After solving the problem I looked at other people's published solutions. Different people have different focus and preferences. Most of the solutions were very specific to the problem at hand. Mine wasn't much better. So my unfocused energy went about solving yet another already solved problem.

From the ExPASy documentation I started building my own Swiss-Prot parser. I'm working on the Description lines now. I'm thinking I'll go back and embed dictionaries for some of this so access would be something like


B5ZC00=parseSwissProt('B5ZC00')
B5ZC00['entry']['date']
B5ZC00['recommended_name']

That might be too wordy so maybe I should use codes and a dictionary to expand them.


SwissProtDict={'ID':'Identification','ID_P':'Protein ID','AC':'Accession'}
B5ZC00=parseSwissProt('B5ZC00')
B5ZC00['DT_E']['DT']
B5ZC00['DE_RN']['FULL']

Here's the first three line types:


def parseSwissProt(protID) :
    sp=dict()
    sp['accession']=[]
    sp['sequence']=''
    f=open(protID+'.txt','r')
    for line in f:
        lt=line[0:2]
        ld=line[2:-1].lstrip()
        if lt=='ID': # Identification
            l=ld.split()
            [sp['proteinID'],x,sp['speciesID']]=l[0].partition('_')
            sp['status']=l[1][:-1]
            sp['length']=int(l[2])
        elif lt=='AC': # Accession
            sp['accession'].extend([x[:-1] for x in ld.split()])
        elif lt=='DT': # Date
            [dt,x,txt]=ld[:-1].partition(', ')
            dtt=txt.split()
            if dtt[0]=='integrated':
                sp['integrated_date']=dt
                sp['integrated_into']=dtt[2].split('/')[1]
            elif dtt[0]=='sequence':
                sp['sequenced_date']=dt
                sp['sequence_version']=int(dtt[2])
            elif dtt[0]=='entry':
                sp['entry_date']=dt
                sp['entry_version']=int(dtt[2])
    return sp

Pages