COMPUTATIONAL BIOLOGY: Biological Databases

A biological database is a collection of data that is organized so that its contents can easily be accessed, managed, and updated. There are two main functions of biological databases:

· Store, organize & manage biological data in computer readable formats
· Make biological data available to scientists

Therefore the biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses. They often contain information from research areas including genomics, proteomics, metabolomics, micro-array gene expression, and phylogenetics.

The content of a biological database:

Since the biological databases are an important tool in assisting scientists to understand and analyze the biological information in order to amplify the ease of research, their content also have to be maintained at high accuracy. On the outset, the types of data generated by molecular biology research:

· Nucleotide sequences (DNA and mRNA)
· Protein sequences
· 3D protein structures
· Complete genomes and maps
· Gene expression
· Genetic variation (polymorphisms)

Metabolic pathways, molecular interactions and mutations in molecular sequences are also some of the contents of few types of databases. Two dimensional DNA chip images of mRNA expression, two dimensional gel electrophoresis images of protein expression are other types of data that can be used in developing databases. When the well-known Protein Data Bank (PDB) was developed as the first protein structure database it contained only 10 entries in 1972. This has now grown in to a large database with over 10,000 entries.

While the initial databases of protein sequences were maintained at the individual laboratories, the development of a consolidated formal database known as SWISS-PROT protein sequence database was initiated in 1986 which now has about 70,000 protein sequences from more than 5000 model organisms, a small fraction of all known organisms. These huge varieties of divergent data resources are now available for study and research by both academic institutions and industries. These are made available as public domain information in the larger interest of research community through Internet (www.ncbi.nlm.nih.gov) and CDROMs (on request from www.rcsb.org). These databases are constantly updated with additional entries.

Annotation:

Definition: it is a note by way of explanation or comment added to a text or diagram

The expert who appends meaning to data making it informative for the user is called an annotator. According to A. Bairoch the main aim of annotation in is to add information as complete and accurate as possible to each sequence entry.

According to him, one can divide the protein sequence information into inflexible data called core data and data which characterizes the sequences as well as possible, called annotations. The role of genetic sequence annotation is of basic importance, the annotation itself is a first step in sequence analysis and it is also a prerequisite for further analysis and comparisons between the sequences.

Therefore, the core data don’t greatly depend on the annotator’s point of view; on the other hand, the annotations can be compared to interpretive information and depend completely on the experience, the vocabulary and the interest of the annotator.

For instance in case of genetic sequences:

Core data:

The sequence data
Information citation (bibliographical references)
Taxonomic data (description of the biological source of the nucleic acid)

Annotation:

The constituent domains and sub-regions of the sequence
The product of the gene (protein)
The similarities to other nucleic acid sequences
The sequence conflicts
The variants

------------------------------------------------------------------------------------------------------------------------------------

Types of biological databases:

Databases in general can be classified into: 1) Primary; 2) Secondary; 3) Composite databases.

A primary database contains information of the sequence or structure alone. Examples of these include Swiss-Prot & PIR for protein sequences, GenBank & DDBJ for Genome sequences and the Protein Databank for protein structures.

A secondary data base contains derived information from the primary database. A secondary sequence database contains information like the conserved sequence, signature sequence and active site residues of the protein families arrived by multiple sequence alignment of a set of related proteins. A secondary structure database contains entries of the PDB in an organized way. These contain entries that are classified according to their structure like all alpha proteins, all beta proteins, etc. These also contain information on conserved secondary structure motifs of a particular protein. Some of the secondary database created and hosted by various researchers at their individual laboratories includes SCOP, developed at Cambridge University; CATH developed at University College of London, PROSITE of Swiss Institute of Bioinformatics, eMOTIF at Stanford.

Composite database amalgamates a variety of different primary database sources, which obviates the need to search multiple resources. Different composite database use different primary database and different criteria in their search algorithm. Various options for search have also been incorporated in the composite database. The National Center for Biotechnology Information (NCBI) which hosts these nucleotide and protein databases in their large high available redundant array of computer servers provides free access to the various persons involved in research. This also has link to OMIM (Online Mendelian Inheritance in Man) which contains information about the proteins involved in genetic diseases.

-------------------------------------------------------------------------------------------------------------

Primary Nucleotide Sequence Repository – GenBank, EMBL, DDBJ

These are three chief databases that store and make available raw nucleic acid sequences. GenBank is physically located in the USA and is accessible through NCBI portal over internet. EMBL (European Molecular Biology Laboratory) is in UK and DDJB (DNA databank of Japan) is in Japan. They have uniform data formats (but not identical) and exchange data on daily basis. The access to GenBank, as to all databases at NCBI is through the Entrez search program. This front end search interface allows a great variety of search options.

Figure: A sample GenBank Entry

The word accession number defines a field containing unique identification numbers. The sequence and the other information may be retrieved from the database simple by searching for a given accession number. Taking the field names in order, we have first all the word ‘LOCUS’. This is a GenBank title that names the sequence entry. Apart for accession number, it also specifies the number of bases in the entry, a nucleic acid type, a code word PRI that indicates the sequence is from primate, and the date on which the entry was made. PRI is one of the 17 keyword search that are used to classify the data. The next line of the file contains the definition of the entry, giving the name of the sequence. The unique accession number came next, followed by a version number in case the entries have gone through more than one version.

COMPUTATIONAL BIOLOGY

Search This Blog

Biological Databases

No comments:

Post a Comment