Saturday, January 25, 2014

Rosland GenBank Introduction.

OK, the hint for Rosalind's GenBack Introduction says:

NCBI's databases, such as PubMed, GenBank, GEO, and many others, can be accessed via Entrez, a data retrieval system offered by NCBI. For direct access to Entrez, you can use Biopython’s Bio.Entrez module.

That's all well and good but I want to know how to access NCBI's databases directly.  I don't know why I missed this part:

Note that when you request Entrez databases you must obey NCBI's requirements:
  • For any series of more than 100 requests, access the database on the weekend or outside peak times in the US.
  • Make no more than three requests every second.
  • Fill in the Entrez.email field so that NCBI can contact you if there is a problem.
  • Be sensible with your usage levels; if you want to download whole mammalian genomes, use NCBI's FTP.

 I used the sample and built the url directly then pasted it into my web browser.  I got back this xml document:


<?xml version="1.0" ?>
<!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">
<eSearchResult>
 <Count>6</Count>
 <RetMax>6</RetMax>
 <RetStart>0</RetStart> 
<IdList>
<Id>11994090</Id>
<Id>168598</Id>
<Id>30959082</Id>
<Id>11990232</Id>
<Id>12445</Id>
<Id>18035</Id>
</IdList> 
<TranslationSet><Translation> 
 <From>"Zea mays"[Organism]</From> 
 <To>"Zea mays"[Organism]</To> 
 </Translation></TranslationSet>
<TranslationStack>
<TermSet> 
 <Term>"Zea mays"[Organism]</Term> 
 <Field>Organism</Field> 
 <Count>408310</Count> 
 <Explode>Y</Explode> 
</TermSet> 
<TermSet> 
 <Term>rbcL[Gene]</Term> 
 <Field>Gene</Field> 
 <Count>111602</Count> 
 <Explode>N</Explode> 
</TermSet> 
<OP>AND</OP> 
</TranslationStack>
<QueryTranslation>"Zea mays"[Organism] AND rbcL[Gene]</QueryTranslation> 
</eSearchResult>
That's great. But I forgot to include my email and "tool" parameters. I felt guilty.  I sent them and email to register my email and "tool" so I can roll my own rather than use biopython.  If you care to see the biopython code you can look here.  Interestingly the default tool name is biopython.  One can pass it a different tool name if one desires.  I don't know how tight they are on their requirements.  I've suggested they add a warning section to the xml document and use that to communicate their displeasure when one doesn't follow the rules and give one some hints as to how quickly they'll shut down the IP address.  I haven't heard back.  

The Rosalind problem talks about selecting a publishing date range.  Here's the definition of the date parameters:

Optional Parameters – Dates

datetype

Type of date used to limit a search. The allowed values vary between Entrez databases, but common values are 'mdat' (modification date), 'pdat' (publication date) and 'edat' (Entrez date). Generally an Entrez database will have only two allowed values for datetype.

reldate

When reldate is set to an integer n, the search returns only those items that have a date specified by datetype within the last n days.

mindate, maxdate

Date range used to limit a search result by the date specified by datetype. These two parameters (mindate, maxdate) must be used together to specify an arbitrary date range. The general date format is YYYY/MM/DD, and these variants are also allowed: YYYY, YYYY/MM.
 Well, good luck working on this Rosalind problem.  It's pretty simple.  I don't know how hard nosed they are about their rules.
 

No comments:

Post a Comment