Jan. 25, 2013, 7:04 p.m. by Rosalind Team
Topics: Bioinformatics Tools, Sequence Analysis
The World's Biggest Genetic Database
The most comprehensive database available for molecular biologists is GenBank, an open access resource that contains an annotated collection of all publicly available sequenced DNA and its translation into proteins. GenBank was founded by NCBI in 1982, and over the last three decades, the data it houses has grown exponentially, doubling every 18 months. As of August 2012, GenBank contained over 143 billion nucleobases.
Every sequence has a unique GenBank identifier that will directly retrieve its full sequence record. Here are some examples of database IDs:
CAA79696 NP_778203 263191547 BC043443 NM_002020You can also search by submitter/author name in the format
last_name first_initials
(e.g.,Smith JR
).To search for an exact match, enclose it in quotation marks.
"contactin associated protein" "duchenne muscular dystrophy"You can also restrict your search by using Boolean operators (AND, OR, NOT) as well as use specific subsets of records.
GenBank comprises several subdivisions:
Searching the Nucleotide database with general text queries will produce the most relevant results. You can also use a simple query based on protein name, gene name or gene symbol.
To limit your search to only certain kinds of records, you can search using GenBank's Limits
page or alternatively use the Filter your results
field to select categories of records after a search.
If you cannot find what you are searching for, check how the database interpreted your query
by investigating the Search details
field on the right side of the page.
This field automatically translates your search into standard keywords.
For example, if you search for Drosophila
, the Search details
field will contain (Drosophila[All Fields]
), and you will
obtain all entries that mention Drosophila (including all its endosymbionts).
You can restrict your search to only organisms belonging to the Drosophila genus
by using a search tag and searching for Drosophila[Organism]
.
Given: A genus name, followed by two dates in YYYY/M/D format.
Return: The number of Nucleotide GenBank entries for the given genus that were published between the dates specified.
Anthoxanthum 2003/7/25 2005/12/27
7
Programming Shortcut
NCBI's databases, such as PubMed, GenBank, GEO, and many others, can be accessed via Entrez, a data retrieval system offered by NCBI. For direct access to Entrez, you can use Biopython’s
Bio.Entrez
module.The
Bio.Entrez.esearch()
function will search any of the NCBI databases. This function takes the following arguments:
db
: The database to search. For example, this field can benucleotide
for GenBank orpubmed
for PubMed.term
: The search term for the "Query" field. You can use search tags here.We will now demonstrate a quick search for the rbcL gene in corn (Zea mays):
>>>from Bio import Entrez >>>Entrez.email = "your_name@your_mail_server.com" >>>handle = Entrez.esearch(db="nucleotide", term='"Zea mays"[Organism] AND rbcL[Gene]') >>>record = Entrez.read(handle) >>>record["Count"] '6' # Surely this value will change over time because GenBank is constantly updatedNote that when you request Entrez databases you must obey NCBI's requirements:
- For any series of more than 100 requests, access the database on the weekend or outside peak times in the US.
- Make no more than three requests every second.
- Fill in the Entrez.email field so that NCBI can contact you if there is a problem.
- Be sensible with your usage levels; if you want to download whole mammalian genomes, use NCBI's FTP.