Introduction:
Biological databases are libraries
of life sciences information, collected from scientific experiments, published
literature, high-throughput experiment technology, and computational analyses.
They contain information from
research areas including:
- Genomics
- Proteomics
- Metabolomics
- Phylogenetic
- Microarray gene expression
Such information contained in
biological databases provides insight into certain important areas like gene
function, structure, localization, clinical effects of mutations as well as
similarities of biological sequences and structures. However, due to the large amounts of data contained in the databases assistance is required for the researchers i.e. data mining tools and techniques.
Data mining:
Data mining refers to extracting
or “mining” knowledge from large amounts of data. Data Mining (DM) is the science
of finding new interesting patterns and relationship in huge amount of data. Therefore,
data mining is a process of retrieving relevant information from the databases
with the aid of software tool and techniques.
Data mining is also sometimes
called Knowledge Discovery in Databases (KDD). It can be defined as “the
process of discovering meaningful new correlations, patterns, and trends by
digging into large amounts of data stored in warehouses”.
Besides, technically it is an interdisciplinary
field of computer sciences, which involves application of statistical methods, artificial
intelligence and database management. Therefore, it requires intelligent
technologies and the willingness to explore the possibility of hidden knowledge
that resides in the data.
Applications:
Today there are several software tools
that are often used in bioinformatics researches to retrieve information from
the large and ever expanding databases. Some of these tools and techniques are briefed
below:
Hidden Markov Models:
HMMs (Hidden Markov Models) are
applied often in bioinformatics for identification of patterns in the sequence
databases in order to find the genes or conserved sequences or protein functions.
Further, HMMs are also employed
in finding conserved protein profiles by exploiting the known probabilities of
amino acids sequential occurrences at any give instance. The probabilities of
each and every amino acid residue in a particular position of the sequence in
obtained from the multiple sequence alignment methods performed over the known
databases. These are in turn used to identify protein profiles that play a
major role in the functional genomics research.
BLAST:
The Basic Local Alignment Search
Tool for comparing gene and protein sequences against others in public databases
now comes in several types including PSI-BLAST, PHI-BLAST, and BLAST 2
sequences. Specialized BLASTs are also available for human, microbial and other
genomes.
Microarrays:
Microarray technologies are used
to predict a patient’s outcome. On the basis of patients’ genotype micro array
data, their survival time and risk of tumor metastasis or recurrence can be estimated.
Such a technology requires the comparative analysis of large genomes and an ability
to look for patterns. Specific software tools that are trained to observe the micro-array data would carry out the job in this case.
Entrez
Entrez is an integrated database
retrieval system that enables text searching, using simple Boolean queries, of
a diverse set of 30 databases. Global Query, the default search on the NCBI
homepage, searches across all the Entrez databases and rapidly returns the
counts of matching records in each database.
The Entrez databases include
almost 70 million DNA and protein sequences derived from several sources such
as the NCBI taxonomy, genomes, population sets, gene expression data, UniGene,
UniSTS, dbSNP, Molecular Modeling Database (MMDB), protein domains and the
biomedical literature via PubMed, PubmedCentral, Online Mendelian Inheritance
in Man (OMIM), and online Books.
Entrez provides extensive links
within and between database records. In their simplest form, these links may be
simple cross-references between a sequence and the abstract of the paper in
which it is reported, or between a protein sequence and its coding DNA sequence
or its 3D structure. A service called LinkOut expands the range of links to
include external services, such as organism-specific genome databases
The records retrieved in Entrez
can be displayed in many formats and downloaded singly or in batches. Formats
available for GenBank records include the GenBank Flatfile, FASTA, XML etc.
Graphical display formats are offered for some types of records, including
genomic records.
No comments:
Post a Comment