Sunday, December 29, 2013

Genealogy and Information Overload

My sister became interested in genealogy in the early 1960s.  Our paternal grandmother was up in years and yet had kept in touch with many of her relatives.  My sister had the foresight to record much of this.  She approached our maternal grandparents and asked them for as much as they remembered and wrote this down as well.

For a long time she kept her records in boxes.  She kept doing research and adding more information.  From time to time I would help her at the libraries around Seattle.

She was an early adopter of Personal Ancestral File and has moved through several programs.  Each program and version comes with its own set of problems.  My sister's problem has been how to store the information she has collected and document her sources.  The GEDCOM file format let her share information with the contacts she has made over the years.

My interests lie more with integrating massive stores of data and making sense of it.

There are many genealogy books all over the world.  Unfortunately a lot of it conflicts with each other and lots of errors have been introduced. 

I tried my hands at integrating genealogy with a couple of early versions of World Family Tree.  The first edition I bought had, maybe, about three disks of family trees people had submitted.  A later edition had, I think, fourteen disks of family trees.  Many people didn't know birth dates.  World Family Free took to adding "WTF est" followed by a date range when no date was given.  I spent some time trying to make sense of all of this.  The noise was too much.  What that computer died so did my interest, at least to the extent of working on it.

Earlier this year, 2013, I decided to try again.  I chose Ancestry.com.  While it still has the familiar Ancestry Family Trees one can use to acquire information it also has US Census records, state and county marriage and death indexes, and much more.  I started by entering my name, my parents' names, and my grandparents' names.  From that I could find Family Trees so I started collecting as much as I could.  Again I encountered inconsistencies so I backed off and started adding documentation as I went, well at least some times. 

It seems people make mistakes and intentionally introduce errors.  I understand mistakes.  We are humans and humans make mistakes.  Intentional errors are another matter.  People have changed census records so people's names appear as curse words.  Ancestral Trees include fake ancestors.  (Unfortunately I've copied some of that into my tree.)  Religion has pushed people to document several generations and bind parents together even if it's not the reality of the matter.

Dealing with this raises my stress level.  The work is daunting and I've only collected about 6,200 names into my tree.  My sister snoops around the web.  After finding errors in my tree she shared a backup of her tree with me.  It has over 100,000 names in it.  She has much more on hardcopy she has yet to enter.  Everything I enter comes from the web but in some cases I have selected between alternates based upon her file.  I've left some of the stuff that doesn't appear in her file even though some of it may be fake and I may have mistakenly merged unrelated individuals.

It's too much.  I have to back off for days and weeks after encountering certain issues in the records.  I want to be as accurate as I can and as inclusive as I can.  I have other interests and this could easily suck up all my free time.

Tuesday, December 24, 2013

On translating MatLab syntax to Python syntax

OK, I used to own a copy of MatLab.  It's quite expensive.  I didn't use it very much.  When my computer died that was that.  I've used Octave before and for my purposes it's good enough.  I like having an IDE so Domain Math IDE works for me as well.  See my prior articles for more details.

As far as I'm concerned Python is a much better language.  It's more of a full blown programming language.  Unfortunately one has to use some extensions to get many of the features built into MatLab or Octave.  Fortunately SciPy has done a good job.  Rather than rehash stuff provided elsewhere and more comprehensively I will point you to the NumPy package in general and the NumPy for MatLab Users page in particular.  The NumPy syntax is a bit more cluncky in a few places but that's more than made up by speed and expanse over Octave.  I'll be switching back and forth between the two as I write code for Simulated neural networks.

Enough for now.  Review NumPy.  It extends what are otherwise scalier functions to apply to arrays and matrices  I'll download the CUDA library for Python and test it out this week..

Friday, December 20, 2013

Getting the basics of simple Artifical Neural Networks

In my last article I mention I've started playing with Artificial Neural Networks.  I bought the book "Tutorial on Neural Systems Modeling" by Thomas J. Anastasio.  I've been a bit slow about reading it and applying the lessons learned.  I tend to have trouble with mathematics expressed in Greek letters.  Likewise, I nod off when dealing with calculus.  That being said, I can begin to lay out the framework for the basic computational model he presents.

Anastasio goes on about how floating point numbers represent firing rate in the early models he presents.  I don't really care about all that.  As far as I'm concerned the models work equally as well no matter what we believe they represent.

So here's the basics:  Suppose you want to keep a trace of the activities of your Artificial Neural Network as it changes over time.  One could use a matrix where every row represents the value of the neuron as it changes over time and every column represents a particular time.  One can model a fully connected network by using a square matrix where each cell in the matrix tells one how strong the connection is between a neuron at time t and either itself or a different neuron at time t+1. 

V1, V2,V3, and V4 represents a vertical slice through the trace of the neurons.  (t) represents the values at time t and (t+1) represents the values at the next time slice.  W represents the Weights, multipliers, between the neurons.  W11 is the multiplier connecting V1 to itself.  W12 is the multiplier connecting V2 to V1.  Etc.

V1(t+1)=W11*V1(t)+W12*V2(t)+W13*V3(t)+W14*V4(t).
V2(t+1)=W21*V1(t)+W22*V2(t)+W23*V3(t)+W24*V4(t).
V3(t+1)=W31*V1(t)+W32*V2(t)+W33*V3(t)+W34*V4(t).
V4(t+1)=W41*V1(t)+W42*V2(t)+W43*V3(t)+W44*V4(t).

The Trace of the neurons can be expressed as the matrix V where the number after the V is converted to an index in an array of simulated neurons.  V(1,t) would contain the value of neuron 1 at time t.  That V1(t).  The weights can also be expressed as the matrix W.  W(1,2) would be the matrix version of W12.  The set of calculations mentioned above is written in short hand as V(:,t+1)=W*V(:,t) in Matlab.

MatLab starts indicies at 1 where Python starts them at 0.  For now I'll start at 1.

An Example using something like MatLab syntax inside English text.

Suppose all of W is 0 except W(2,1)=1, W(3,2)=1, and W(4,3)=1.  Suppose all of V is 0 except V(1,1)=1.  By iterating t from 1 to 4 and setting V(:,t+1) = W*V(:,t) one finds the following cells have been set to 1: V(2,2), V(3,3), V(4,4).  All else remain zero.  This "transmission line" behavior will be useful later on.

Input units are clamped to their source so if V(1,t) was an input unit's value at time t then we wouldn't want V(1,t+1) calculated from the weight matrix.  We could keep a separate matrix for the weights from the input cells.  That would probably be more efficient.  For my purposes here I'll leave the weight matrix alone and change the Matrix Vector multiply.  If V(1,:) represents an input cell then rather than calculating V(:,t+1):W*V(:,t) the calculation would be V(2:end,t+1)=W(2:end,:)*V(:,t).  If V(1,:) represents an input cell and V(4,:) represents and output cell then the calculation would be V(2:end,t+1)=W(2:end,1:end-1)*V(1:end-1,t).

So there you have it.  This is the basic layout for the a simple artificial neural net.  The weights can be any numbers you want, including imaginary numbers.  If you have many input  or output neurons then you probably want to split the weight matrix into two matrices.  One for the input neurons and another for the internal neurons.  Adjust the formulas as needed.

You may wonder why this describes an artificial neural net.  I'll leave that until next time.

Tuesday, December 17, 2013

Haven't posted anything in awhile. Trying to play with Artificial Neural Nets.

OK, I mostly skim text books.  This isn't always the best approach.  I kept looking for the program code in the book I'm reading.  It appears that after chapter 2 the reader is suppose to create the program and source the routines.  I'll have to do that now to prove to myself I know how to build a Hopfield Network and train it using a single pass through all the patterns.  I'll present the code and a sample walk-through when I do.  I scanned the material years ago but my machine wasn't up to the challenge to run the code presented.  Times have changed.  Maybe I can finally get something running and be able to test the limits of current home technology.  Learning the CUDA routines for Python should help exploit the NVIDIA video cards for other purposes.  I can get a gforce 690 for about $1000.  Prices should drop.