Saturday, August 3, 2013

On retrieving protien sequences from online databases.

Getting a FASTA file from the UniProt database is quite simple.  I'll need to review the user agreement before I go very far but here's a simple python 2.7.5 routine to get a set of proteins from UniProt based upon the accession IDs

import urllib2

def getFromUniprot(filename,accessionList):
    """Get a set of protein sequenses from the UniProt database"""
    FastaData=''
    for accession in accessionList:
        x=urllib2.urlopen('http://www.uniprot.org/uniprot/'+accession+'.fasta')
        FastaData+=x.read()
        x.close()
    f=open(filename,'w')
    f.write(FastaData)
    f.close

getFromUniprot('mprt.fasta',['A2Z669','B5ZC00',
        'P07204_TRBM_HUMAN','P20840_SAG1_YEAST'])

While this could be part of a solution to Rosalind's Finding a Protein Motif there's a bit more to it than that.  I tried something similar with the NCBI database using a URL like this:
http://www.ncbi.nlm.nih.gov/protein/238064857?report=fasta&format=text but it doesn't work.  NCBI expect you to use their Entrez Utilities.  BioPython has its own interface on top of Entrez.
The two websites may have slightly different protein sequences.  The header to the sequence for B5ZC00 look like this (I've broken the line so it is easier to read):

UniProt:
>sp|B5ZC00|SYG_UREU1 Glycine--tRNA ligase OS=Ureaplasma urealyticum serovar 10 
   (strain ATCC 33699 / Western) GN=glyQS PE=3 SV=1
NCBI:  
>gi|238064857|sp|B5ZC00.1|SYG_UREU1 RecName: Full=Glycine--tRNA ligase;
    AltName: Full=Glycyl-tRNA synthetase; Short=GlyRS

The sp prefix indicates SWISS-PROT. The gi prefix indicates GenInfo Integrated Database.  Here are the prefixes handled by NCBI.  The NCBI header seems to indicate a SWISS-PROT variant of the protein mentioned above.  Interestingly http://www.uniprot.org/uniprot/B5ZC00.1.fasta returns a different header with the prefix "UniProtKB/TrEMBL".  There's so much to learn and more of it every day.  I guess I could put all three into a single file then run a Hamming count on them to see if they differ.  Unfortunately I'm using the second field as the name when I build my python structures so I'd have to fudge one of the one's retrieved from the UniProt DB so they would have different names/accession IDs.

No comments:

Post a Comment