Saturday, January 4, 2014

Still working on UniProt protein file parser

OK, I've decided to store information extracted from the individual SwissProt and TrEMBL knowledge base files in an embedded tree structure like in this sample structure:


KB={'P31994': {
  'ID': {
    'ProtID':'FCG2B',
    'SpecID':'HUMAN',
    'Status':'Reviewed',
    'Length':310},
  'AC': ['P31994', 'A6H8N3', 'O95649', 'Q53X85', 'Q5VXA9', 'Q8NIA1'],
  'DT': {
    'Integrated': {
      'Date':'01-JUL-1993',
      'KB':'UniProtKB/Swiss-Prot'},
    'Sequenced': {
      'Date':'30-MAY-2000',
      'Version':2},
    'Entry': {
      'Date':'11-DEC-2013',
      'Version':158}},
...}

When a section can have multiple entries it will be in a list, like AC. This means one would access a particular item or set of items like this:

KB['P31994']['AC'][0]
KB['P31994']['DE']['AltName'][1]['Short']

I can provide the structure of the document and description of the various fields in a similar way.  Much of this is documentation with IDs into other systems.  Things change over time so some of the documentation is out of date or otherwise unavailable.  A relatively new line type, RX, has some issues.  I tracked down how to use the document ID


import webbrowser

#The UniProt referenced database names used by RX records
#append with publication ID
UniProtPubDB={'MEDLINE':'', # you need a cross reference to pubmed id.
              'PubMed':'http://www.ncbi.nlm.nih.gov/pubmed/',
              'DOI':'http://dx.doi.org/',
              'AGRICOLA','' # requires login
              }

webbrowser.open_new(UniProtPubDB['PubMed']+'2531080')

The MEDLINE UI search as been deprecated so one would need a crossreference. AGRICOLA requires a login ID. I've only looked at four proteins. None have either MEDLINE or AGRICOLA references. The sample is too small but MEDLINE UI isn't of much use.

So much for today.

No comments:

Post a Comment