COMPUTATIONAL BIOLOGY: DataMining and applications

Introduction:

Biological databases are libraries of life sciences information, collected from scientific experiments, published literature, high-throughput experiment technology, and computational analyses.

They contain information from research areas including:

Genomics
Proteomics
Metabolomics
Phylogenetic
Microarray gene expression

Such information contained in biological databases provides insight into certain important areas like gene function, structure, localization, clinical effects of mutations as well as similarities of biological sequences and structures. However, due to the large amounts of data contained in the databases assistance is required for the researchers i.e. data mining tools and techniques.

Data mining:

Data mining refers to extracting or “mining” knowledge from large amounts of data. Data Mining (DM) is the science of finding new interesting patterns and relationship in huge amount of data. Therefore, data mining is a process of retrieving relevant information from the databases with the aid of software tool and techniques.

Data mining is also sometimes called Knowledge Discovery in Databases (KDD). It can be defined as “the process of discovering meaningful new correlations, patterns, and trends by digging into large amounts of data stored in warehouses”.

Besides, technically it is an interdisciplinary field of computer sciences, which involves application of statistical methods, artificial intelligence and database management. Therefore, it requires intelligent technologies and the willingness to explore the possibility of hidden knowledge that resides in the data.

Applications:

Today there are several software tools that are often used in bioinformatics researches to retrieve information from the large and ever expanding databases. Some of these tools and techniques are briefed below:

Hidden Markov Models:

HMMs (Hidden Markov Models) are applied often in bioinformatics for identification of patterns in the sequence databases in order to find the genes or conserved sequences or protein functions.

Further, HMMs are also employed in finding conserved protein profiles by exploiting the known probabilities of amino acids sequential occurrences at any give instance. The probabilities of each and every amino acid residue in a particular position of the sequence in obtained from the multiple sequence alignment methods performed over the known databases. These are in turn used to identify protein profiles that play a major role in the functional genomics research.

BLAST:

The Basic Local Alignment Search Tool for comparing gene and protein sequences against others in public databases now comes in several types including PSI-BLAST, PHI-BLAST, and BLAST 2 sequences. Specialized BLASTs are also available for human, microbial and other genomes.

Microarrays:

Microarray technologies are used to predict a patient’s outcome. On the basis of patients’ genotype micro array data, their survival time and risk of tumor metastasis or recurrence can be estimated. Such a technology requires the comparative analysis of large genomes and an ability to look for patterns. Specific software tools that are trained to observe the micro-array data would carry out the job in this case.

Entrez

Entrez is an integrated database retrieval system that enables text searching, using simple Boolean queries, of a diverse set of 30 databases. Global Query, the default search on the NCBI homepage, searches across all the Entrez databases and rapidly returns the counts of matching records in each database.

The Entrez databases include almost 70 million DNA and protein sequences derived from several sources such as the NCBI taxonomy, genomes, population sets, gene expression data, UniGene, UniSTS, dbSNP, Molecular Modeling Database (MMDB), protein domains and the biomedical literature via PubMed, PubmedCentral, Online Mendelian Inheritance in Man (OMIM), and online Books.

Entrez provides extensive links within and between database records. In their simplest form, these links may be simple cross-references between a sequence and the abstract of the paper in which it is reported, or between a protein sequence and its coding DNA sequence or its 3D structure. A service called LinkOut expands the range of links to include external services, such as organism-specific genome databases

The records retrieved in Entrez can be displayed in many formats and downloaded singly or in batches. Formats available for GenBank records include the GenBank Flatfile, FASTA, XML etc. Graphical display formats are offered for some types of records, including genomic records.

COMPUTATIONAL BIOLOGY

Search This Blog

DataMining and applications

No comments:

Post a Comment