DBPR states the problem to be solved this way:
Given: The UniProt ID of a protein.
Return: A list of biological processes in which the protein is involved (biological processes are found in a subsection of the protein's "Gene Ontology" (GO) section).
OK, I looked at Biopython. It's quite impressive. I didn't use it. I wanted to do my own thing. After solving the problem I looked at other people's published solutions. Different people have different focus and preferences. Most of the solutions were very specific to the problem at hand. Mine wasn't much better. So my unfocused energy went about solving yet another already solved problem.Programming Shortcut
ExPASy databases can be accessed automatically via Biopython’sBio.ExPASy
module. The function.get_sprot_raw
will find a target protein by its ID.
We can obtain data from an entry by using theSwissProt
module. Theread()
function will handle one SwissProt record andparse
will allow you to read multiple records at a time.
Let's get the data for the B5ZC00 protein:
>>>from Bio import ExPASy >>>from Bio import SwissProt >>>handle = ExPASy.get_sprot_raw('B5ZC00') #you can give several IDs separated by commas >>>record = SwissProt.read(handle) # use SwissProt.parse for multiple proteins
We now can check the list of attributes for the obtained Swiss-Prot record:
>>> dir(record) [..., 'accessions', 'annotation_update', 'comments', 'created', 'cross_references', 'data_class', 'description', 'entry_name', 'features', 'gene_name', 'host_organism', 'host_taxonomy_id', 'keywords', 'molecule_type', 'organelle', 'organism', 'organism_classification', 'references', 'seqinfo', 'sequence', 'sequence_length', 'sequence_update', 'taxonomy_id']
To see the list of references to other databases, we can check the
.cross_references
attribute of our record:
>>>record.cross_references[0] ('EMBL', 'CP001184', 'ACI60310.1', '-', 'Genomic_DNA')
From the ExPASy documentation I started building my own Swiss-Prot parser. I'm working on the Description lines now. I'm thinking I'll go back and embed dictionaries for some of this so access would be something like
B5ZC00=parseSwissProt('B5ZC00')
B5ZC00['entry']['date']
B5ZC00['recommended_name']
That might be too wordy so maybe I should use codes and a dictionary to expand them.
SwissProtDict={'ID':'Identification','ID_P':'Protein ID','AC':'Accession'}
B5ZC00=parseSwissProt('B5ZC00')
B5ZC00['DT_E']['DT']
B5ZC00['DE_RN']['FULL']
Here's the first three line types:
def parseSwissProt(protID) :
sp=dict()
sp['accession']=[]
sp['sequence']=''
f=open(protID+'.txt','r')
for line in f:
lt=line[0:2]
ld=line[2:-1].lstrip()
if lt=='ID': # Identification
l=ld.split()
[sp['proteinID'],x,sp['speciesID']]=l[0].partition('_')
sp['status']=l[1][:-1]
sp['length']=int(l[2])
elif lt=='AC': # Accession
sp['accession'].extend([x[:-1] for x in ld.split()])
elif lt=='DT': # Date
[dt,x,txt]=ld[:-1].partition(', ')
dtt=txt.split()
if dtt[0]=='integrated':
sp['integrated_date']=dt
sp['integrated_into']=dtt[2].split('/')[1]
elif dtt[0]=='sequence':
sp['sequenced_date']=dt
sp['sequence_version']=int(dtt[2])
elif dtt[0]=='entry':
sp['entry_date']=dt
sp['entry_version']=int(dtt[2])
return sp
No comments:
Post a Comment