COMPUTATIONAL BIOLOGY: Protein Sequence analyses

PREDICTION OF 1ZBM, A PROTEIN OF UNKNOWN FUNCTION AS SOLUTE BINDING PROTEIN UTILIZING VARIOUS BIOINFORMATICS TOOLS

ABSTRACT
1ZBM, a hypothetical protein from Archaeglobus fulgidus was selected to predict the function based on similarity search using BlastP server against non-redundant databases of NCBI. Similarity analysis revealed matches with solute binding protein (45%), S-ribosylhomocysteine lyase (34%) and succinyl-CoA ligase (33%) proteins, respectively. Further multiple alignments were constructed to recognize the regions of conserved residues, followed by cavity detection using CASTp server. The first best pocket was shown to contain 23 identical and conserved residues from a total of 44 residues within the cavity. Finally, docking studies carried out using AutoDock revealed similar binding energies, -3.28 and -3.99 kcal/mol with 1ZBM and 2I48 respectively.

METHODOLOGY
The PDB (Protein Data Bank) database was searched for a hypothetical protein of unknown function using the criteria such as X-ray resolution between 2.0 - 2.5 A0 and around 250 to 300 amino acids chain length resulted in 92 hits. From these, 1ZBM protein (280 residues length) of Archaeglobus fulgidus AF1704 was selected [http://www.rcsb.org/pdb]. The structure was composed of 42% helical and 18% beta sheet secondary structural elements without any breaks in the protein. Breaks, if any, present in the protein structure increases the complexity depending on the region where they are located. Because, as breaks represent the undetectable regions in X-ray technique, there are two ways to handle the situation, one is by building a 3-D model of the disconnected segment and the other is by visual inspection and manual detection whether they are located in core, active site region or on the surface respectively.

Pair wise Alignment
To perform the task of predicting the function of 1ZBM protein, the sequence in fasta format was extracted from PDB. Similarity search was carried out by scanning the sequence against non-redundant database of NCBI (National Centre for Biotechnology Information) using protein-protein BLAST (Basic Local Alignment Search Tool) pair wise alignment tool, Blastp [http://www.ncbi.nlm.nih.gov; Altschul, et al., 2002].

Multiple Alignments
Multiple sequence alignments of the resulted proteins from Blast analysis were performed by Clustal W2 (Cluster analysis) using default parameters [http://www.ebi.ac.uk/clustalw]. Multiple alignments are carried out to identify the regions of residue conservation among homologous proteins [Yamada, etal., 2006]. This information was utilized further, to detect cavities within the protein, or in other words, the probable active site region using CASTp server [http://sts-fw.bioengr.uic.edu/castp].

Binding Studies
Solute binding protein 2I48 with bound ligand was selected for docking studies, extracted from PDB. Visual inspection of molecular structures was carried out using DS Viewer Pro [www.accelrys.com] and docking studies are performed by AutoDock 3.0.5 [www.scripps.edu]. All water molecules were removed from the original Protein Data Bank file, polar hydrogen atoms were added and Kollman charges, [Weiner, et al., 1984] atomic solvation parameters and fragmental volumes were assigned to the protein using AutoDock Tools (ADT). To validate the docking protocol, bound ligand coordinates in the crystal complex (2I48) was removed and the bond orders were checked and all torsions were allowed to rotate during docking.

The grid maps were generated by the program AutoGrid and each grid was centered at the crystal structure of the corresponding 1ZBM bound ligand. The grid dimensions were 60 X 60 X 60 A˚3 with points separated by 0.375 A˚. For the ligand, random starting positions, random orientations and torsions were used and translation, quaternion and torsion steps were taken from default values in AutoDock. The Lamarckian genetic algorithm was applied for minimization using default parameters. The standard docking protocol for rigid and flexible ligand docking consisted of 10 independent runs per ligand, using an initial population of 50 randomly placed individuals, with 2.5 X 106 energy evaluations, a maximum number of 27000 iterations, a mutation rate of 0.02, a crossover rate of 0.80, and an elitism value of 1. The probability of performing a local search on an individual in the population was 0.06, using a maximum of 300 iterations per local search.

RESULTS AND DISCUSSION
Blastp search analysis of 1ZBM protein sequence against non-redundant database resulted in many hits with varying degrees of percentage identities and similarities. Reported in Table 1 are the results of the analysis, where it can be observed that the predicted (NP_110800.1) or periplasmic solute-binding protein (ABK77551.1, ZP_02171500.1), S-ribosyl homocysteine lyase (YP_001467584.1) and succinyl-CoA ligase (YP_001409022.1) are shown to exhibit maximum similarity (31-45%) and the remaining hits obtained are with other hypothetical proteins. From table 1, it can be understood that 1ZBM might belong to either predicted or periplasmic solute-binding protein, S-ribosyl homocysteine lyase or succinyl-CoA ligase families respectively. Therefore multiple alignments are constructed within each protein family versus 1ZBM protein to provide clarity towards probable functional aspects of 1ZBM.

Table 1. Results of Blastp analysis showing biologically similar sequences with 1ZBM.

From the table it is evident that the number of residues identical with 1ZBM varied considerably, wherein the predicted solute binding protein matched more number of residues (124/270, 116/258) than the remaining. A very few residue - residue identities were observed in case of S-ribosyl homocysteine lyase (Autoinducer-2 production protein LuxS) (104/280) and succinyl-CoA ligase [ADP-forming] subunit alpha (96/269) whereas on the other hand, a high number of residue - residue similarities were observed in all the cases with 1ZBM. Another interesting feature observed from the alignment table was the number of gaps introduced in an alignment, 22 and 10 gaps in S-ribosyl homocysteine lyase and succinyl-CoA ligase whereas 2 and 6 gaps in predicted and periplasmic solute-binding proteins, respectively. Partially, these features delineate that the function of 1ZBM can be attributed to solute binding proteins. It was observed that a weak homology existed between the percentage similarities of S-ribosyl homocysteine lyase and succinyl-CoA ligase (51 and 53%) and 1ZBM, therefore to confirm the probable functional aspect of 1ZBM as either solute binding or S-ribosyl homocysteine lyase and succinyl-CoA ligase proteins, multiple alignments were constructed for individual family members.

Multiple Alignments
ClustalW2 tool [http://www.ebi.ac.uk/clustalw] was used for multiple alignments. In order to confirm the probable related homologues among solute binding protein, S-ribosylhomocysteine lyase and succinyl-coA ligase, multiple sequence analyses were carried out individually by considering similar family of proteins from swiss-prot database [http://www.expasy.org].

1ZBM vs S-ribosyl homocysteine lyase
To confirm whether 1ZBM belongs to this family, analysis was done by extracting two similar S-ribosylhomocysteine lyase proteins, A0KG57, A8FGJ6 sequences from Swiss-Prot/TrEMBL database, along with YP_001467584.1 and 1ZBM. Multiple alignment result given below (Figure 1) clearly demonstrated that none of the proteins exhibited a considerable match with 1ZBM except YP_001467584.1, S-ribosylhomocysteine lyase of Campylobacter concisus 13826, which might have resulted by chance.

Figure 1. Multiple sequence alignments of 1ZBM and other S-ribosylhomocysteine lyases.

1ZBM vs succinyl-CoA ligase
Similar analysis conducted for this family by extracting two proteins, P53598, O28098 and YP_001409022.1, 1ZBM respectively resulted in weak similarities and the multiple alignments shown below (Figure 2) confirms that the similarity of 1ZBM sequence with succinyl-coA ligase is by chance.

Figure 2. Multiple sequence alignments of 1ZBM and other succinly-CoA ligases.

Followed by these studies, data from table 1 and multiple alignments (Figure 3) of these proteins with 1ZBM, it can be emphasized that the solute binding proteins represent a high percent residue identity and similarity and with good confidence, it can be stated that 1ZBM belongs to solute binding family of proteins. Therefore, to provide a proof-of-concept that the correlation of 1ZBM with solute binding protein was not by chance, active site detection followed by docking studies are carried out.

1ZBM vs solute binding protein

Figure 3. Multiple sequence alignments of 1ZBM and other solute binding proteins.

Active Site Identification
Further, active site prediction carried out using CASTp server with a probe radius of 1.4 Aº revealed that the protein 1ZBM has a total of 39 cavities and the active site is an estimation of cavity that contains more number of conserved residues [http://www.rcsb.org/pdb]. The server CASTp was validated by first performing the cavity detection using bound ligand 2I48 downloaded from protein data bank [http://sts-fw.bioengr.uic.edu/castp]. From the result shown in Figure 4, the first cavity detected by the server comprised of a bound ligand carbonate confirms validation and therefore, the number of cavities in the hypothetical protein are scanned using similar probe radius and the residues in first cavity are listed in table 2.

Figure 4. First cavity detected by CASTp server. Left: 2I48 bound carbonate ligand. Right: 1ZBM cavity

Table 2. Amino acid residues of 2I48 and 1ZBM detected from CASTp server.
(* Bold represent identical residues; bold-italics represent conserved residues and underlined represent semi conserved residues within the proteins that are utilized in multiple alignment of 1ZBM vs. solute binding proteins.)

In order to confirm that the first pocket detected by CASTp which probably represents active site region of 1ZBM, a comparison was made between similar conserved residues reported from multiple alignments. It has been recognized that 1ZBM and other solute binding proteins share 13 identical residues, 10 conserved substitutions and 5 semi conserved substitutions (Table 2) within first cavity populated by a total of 44 residues. Therefore, owing to such high similarity between these two proteins with respect to identical and conserved regions, further docking studies are initiated by extracting carbonate ligand from solute-binding 2I48 protein [Xie, et al., 2007].

Before docking the carbonate ligand into the 1ZBM binding site, the docking protocol was validated with 2I48. 2I48 bound ligand was removed from the active site and docked back into the binding pocket. AutoDock predicted binding conformation of bound ligand was compared with the X-ray crystallographic obtained conformational superposition [Richardson, et al., 2007]. The RMSD of all atoms between these two conformations is 1.07 A˚ indicating that the parameters for the docking simulation are reasonable in reproducing the X-ray crystal structure and can be extended to search the enzyme binding conformations for other inhibitors accordingly.

Table 3. AutoDock binding energies and interacting residues of 2I48 and 1ZBM.

The obtained binding energies and H-bond interacting residues from table 3 suggest that the protein 1ZBM belongs to solute-binding protein family. It was also clearly indicated from the CASTp analysis and multiple alignments of solute-binding proteins that the maximum number of residues within the active site region is highly conserved with this family of proteins.

CONCLUSION
Predicting the family of a protein of unknown function using bioinformatics tools and various other softwares are in place since the era of databases has begun. With proper utilization of data resources and the advent of analysis tools to aid such situations has led to the analysis of biological data using computers. In an attempt to predict the function of a hypothetical protein 1ZBM as a solute binding protein, our first analysis using Blastp scan against non-redundant databases revealed a homology with solute binding protein (45%), S-ribosylhomocysteine lyase (34%) and succinyl-CoA ligase (33%), respectively and multiple alignments conducted on all similar protein sequences showed maximum number of residue-residue identities with solute-binding protein. Besides, identifying the active site of 1ZBM using CASTp server revealed 23 identical and conserved residues out of a total of 44 residues in first cavity. Further, it was confirmed from docking studies that 1ZBM protein belongs to solute-binding protein as it displayed -3.28 kcal/mol binding energy when compared with -3.99 kcal/mol of 2I48, respectively. Hence, the computational analysis and the reported methodology in this work suggest that the unknown protein 1ZBM belongs to solute binding protein family.

REFERENCES

Richardson, C.M., Nunns, C.L., Williamson, D.S., Parratt, M.J., Dokurno, P., Howes, R., Borgognoni, J., Drysdale, M.J., Finch, H., Hubbard, R.E., Jackson, P.S., Kierstan, P., Lentzen, G., Moore, J.D., Murray, J.B., Simmonite, H., Surgenor, A.E. and Torrance, C.J. (2007).

Weiner, S.J., Kollman, P.A., Case, D.A., Singh, U.C., Ghio, C., Alagona, G., Profeta, S. and Weiner, P. (1984). A new force field for molecular mechanical simulation of nucleic acids and proteins. Journal of American Chemical Society. 106, 765-784.

Whisstock, J.C. and Lesk, A.M. (2003). Prediction of protein function from protein sequence and structure. Quarterly Reviews of Biophysics. 36, 307-340.

Xie, L. and Bourne, P.E. (2007). A robust and efficient algorithm for the shape description of protein structures and its application in predicting ligand binding sites. BMC Bioinformatics. 8, S4-S9.

Yamada, S., Gotoh, O. and Yamana, H. (2006). Improvement in accuracy of multiple sequence alignment using novel group-to-group sequence alignment algorithm with piecewise linear gap cost. BMC Bioinformatics. 7, 524-541.

Search This Blog

Protein Sequence analyses