Thursday, January 2, 2014

Back to Rosalind, Python, and Biopython

My energies are quite unfocused. Work seems never ending.  When I come come I just want to veg out in front of the computer or TV.  I feel the desire to do more but when it actually comes down to doing more my desires dissipate.  Yesterday evening I "solved" the first Rosalind problem I've done in awhile, "Introduction to Protein Databases", or DBPR for short.  I had started forgetting the python language.

DBPR states the problem to be solved this way:
Given: The UniProt ID of a protein.
Return: A list of biological processes in which the protein is involved (biological processes are found in a subsection of the protein's "Gene Ontology" (GO) section).

Programming Shortcut

ExPASy databases can be accessed automatically via Biopython’s Bio.ExPASy module. The function .get_sprot_raw will find a target protein by its ID.
We can obtain data from an entry by using the SwissProt module. The read() function will handle one SwissProt record and parse will allow you to read multiple records at a time.
Let's get the data for the B5ZC00 protein:

>>>from Bio import ExPASy
>>>from Bio import SwissProt
>>>handle = ExPASy.get_sprot_raw('B5ZC00') #you can give several IDs separated by commas
>>>record = SwissProt.read(handle) # use SwissProt.parse for multiple proteins
We now can check the list of attributes for the obtained Swiss-Prot record:

 >>> dir(record)
 [..., 'accessions', 'annotation_update', 'comments', 'created', 'cross_references',
 'data_class', 'description', 'entry_name', 'features', 'gene_name', 'host_organism', 'host_taxonomy_id', 'keywords',
 'molecule_type', 'organelle', 'organism', 'organism_classification', 'references', 'seqinfo', 'sequence',
 'sequence_length', 'sequence_update', 'taxonomy_id']
To see the list of references to other databases, we can check the .cross_references attribute of our record:

>>>record.cross_references[0]
('EMBL', 'CP001184', 'ACI60310.1', '-', 'Genomic_DNA')
OK, I looked at Biopython.  It's quite impressive.  I didn't use it.  I wanted to do my own thing.  After solving the problem I looked at other people's published solutions.  Different people have different focus and preferences.  Most of the solutions were very specific to the problem at hand.  Mine wasn't much better.  So my unfocused energy went about solving yet another already solved problem.

From the ExPASy documentation I started building my own Swiss-Prot parser.  I'm working on the Description lines now.  I'm thinking I'll go back and embed dictionaries for some of this so access would be something like


B5ZC00=parseSwissProt('B5ZC00')
B5ZC00['entry']['date']
B5ZC00['recommended_name']

That might be too wordy so maybe I should use codes and a dictionary to expand them.


SwissProtDict={'ID':'Identification','ID_P':'Protein ID','AC':'Accession'}
B5ZC00=parseSwissProt('B5ZC00')
B5ZC00['DT_E']['DT']
B5ZC00['DE_RN']['FULL']


Here's the first three line types:

def parseSwissProt(protID) :
    sp=dict()
    sp['accession']=[]
    sp['sequence']=''
    f=open(protID+'.txt','r')
    for line in f:
        lt=line[0:2]
        ld=line[2:-1].lstrip()
        if lt=='ID': # Identification
            l=ld.split()
            [sp['proteinID'],x,sp['speciesID']]=l[0].partition('_')
            sp['status']=l[1][:-1]
            sp['length']=int(l[2])
        elif lt=='AC': # Accession
            sp['accession'].extend([x[:-1] for x in ld.split()])
        elif lt=='DT': # Date
            [dt,x,txt]=ld[:-1].partition(', ')
            dtt=txt.split()
            if dtt[0]=='integrated':
                sp['integrated_date']=dt
                sp['integrated_into']=dtt[2].split('/')[1]
            elif dtt[0]=='sequence':
                sp['sequenced_date']=dt
                sp['sequence_version']=int(dtt[2])
            elif dtt[0]=='entry':
                sp['entry_date']=dt
                sp['entry_version']=int(dtt[2])
    return sp

No comments:

Post a Comment