A biological
database is a collection of data that is organized so that its contents can
easily be accessed, managed, and updated. There are two main functions of
biological databases:
- · Store, organize & manage biological data in computer readable formats
- · Make biological data available to scientists
Therefore the biological
databases are libraries of life sciences information, collected from scientific
experiments, published literature, high-throughput experiment technology, and
computational analyses. They often contain information from research areas
including genomics, proteomics, metabolomics, micro-array gene expression, and
phylogenetics.
The content of a biological database:
Since the
biological databases are an important tool in assisting scientists to
understand and analyze the biological information in order to amplify the ease
of research, their content also have to be maintained at high accuracy. On the
outset, the types of data generated by molecular biology research:
- · Nucleotide sequences (DNA and mRNA)
- · Protein sequences
- · 3D protein structures
- · Complete genomes and maps
- · Gene expression
- · Genetic variation (polymorphisms)
Metabolic pathways,
molecular interactions and mutations in molecular sequences are also some of
the contents of few types of databases. Two dimensional
DNA chip images of mRNA expression, two dimensional gel electrophoresis images
of protein expression are other types of data that can be used in developing
databases. When the
well-known Protein Data Bank (PDB) was developed as the first protein structure
database it contained only 10 entries in 1972. This has now grown in to a large
database with over 10,000 entries.
While the
initial databases of protein sequences were maintained at the individual
laboratories, the development of a consolidated formal database known as
SWISS-PROT protein sequence database was initiated in 1986 which now has about
70,000 protein sequences from more than 5000 model organisms, a small fraction
of all known organisms. These huge varieties of divergent data resources are
now available for study and research by both academic institutions and
industries. These are made available as public domain information in the larger
interest of research community through Internet (www.ncbi.nlm.nih.gov) and
CDROMs (on request from www.rcsb.org). These databases are constantly updated
with additional entries.
Annotation:
Definition: it is a note
by way of explanation or comment added to a text or diagram
The expert who appends
meaning to data making it informative for the user is called an annotator. According to A.
Bairoch the main aim of annotation in is to add information as complete and accurate
as possible to each sequence entry.
According to
him, one can divide the protein sequence information into inflexible data
called core data and data which characterizes the sequences as well as
possible, called annotations. The role of
genetic sequence annotation is of basic importance, the annotation itself is a
first step in sequence analysis and it is also a prerequisite for further
analysis and comparisons between the sequences.
Therefore, the
core data don’t greatly depend on the annotator’s point of view; on the other
hand, the annotations can be compared to interpretive information and depend
completely on the experience, the vocabulary and the interest of the annotator.
For instance in
case of genetic sequences:
Core data:
- The sequence data
- Information citation (bibliographical references)
- Taxonomic data (description of the biological source of the nucleic acid)
Annotation:
- The constituent domains and sub-regions of the sequence
- The product of the gene (protein)
- The similarities to other nucleic acid sequences
- The sequence conflicts
- The variants
Types of biological databases:
Databases in
general can be classified into: 1) Primary; 2) Secondary; 3) Composite databases.
A primary
database contains information of the sequence or structure alone. Examples of these
include Swiss-Prot & PIR for protein sequences, GenBank & DDBJ for
Genome sequences and the Protein Databank for protein structures.
A secondary data
base contains derived information from the primary database. A secondary
sequence database contains information like the conserved sequence, signature
sequence and active site residues of the protein families arrived by multiple
sequence alignment of a set of related proteins. A secondary structure database
contains entries of the PDB in an organized way. These contain entries that are
classified according to their structure like all alpha proteins, all beta
proteins, etc. These also contain information on conserved secondary structure
motifs of a particular protein. Some of the secondary database created and
hosted by various researchers at their individual laboratories includes SCOP,
developed at Cambridge University; CATH developed at University College of
London, PROSITE of Swiss Institute of Bioinformatics, eMOTIF at Stanford.
Composite
database amalgamates a variety of different primary database sources, which
obviates the need to search multiple resources. Different composite database
use different primary database and different criteria in their search
algorithm. Various options for search have also been incorporated in the
composite database. The National Center for Biotechnology Information (NCBI)
which hosts these nucleotide and protein databases in their large high
available redundant array of computer servers provides free access to the
various persons involved in research. This also has link to OMIM (Online
Mendelian Inheritance in Man) which contains information about the proteins
involved in genetic diseases.
-------------------------------------------------------------------------------------------------------------
Primary Nucleotide Sequence Repository – GenBank,
EMBL, DDBJ
These are three
chief databases that store and make available raw nucleic acid sequences.
GenBank is physically located in the USA and is accessible through NCBI portal
over internet. EMBL (European Molecular Biology Laboratory) is in UK and DDJB
(DNA databank of Japan) is in Japan. They have uniform data formats (but not
identical) and exchange data on daily basis. The access to GenBank, as to all
databases at NCBI is through the Entrez search program. This front end search
interface allows a great variety of search options.
Figure: A sample GenBank Entry
|
The word
accession number defines a field containing unique identification numbers. The
sequence and the other information may be retrieved from the database simple by
searching for a given accession number. Taking the field names in order, we
have first all the word ‘LOCUS’. This is a GenBank title that names the
sequence entry. Apart for accession number, it also specifies the number of
bases in the entry, a nucleic acid type, a code word PRI that indicates the
sequence is from primate, and the date on which the entry was made. PRI is one
of the 17 keyword search that are used to classify the data. The next line of
the file contains the definition of the entry, giving the name of the sequence.
The unique accession number came next, followed by a version number in case the
entries have gone through more than one version.
No comments:
Post a Comment