SMAP version 1.0

SMAP software package is designed for the comparison and the similarity search of protein three-dimensional motifs independent on the sequence order. It is based on the works published in:

 1. L. Xie and P.E. Bourne 2008 "Detecting Evolutionary Relationships Across Existing Fold Space, Using Sequence Order Independent Profile-profile Alignments". PNAS, 105(14):5441
 2. L. Xie, and P.E. Bourne 2007 "A Robust and Efficient Algorithm for the Shape Description of Protein Structures and Its Application in Predicting Ligand Binding Sites". BMC Bioinformatics, 8(Suppl 4):S9 

In the release of version 1.0, the software includes several improvements from the original algorithms in scoring functions and the statistics significance estimation of similarity mesurments.
With the current implementation, the software is the best to be used for concaved patches in the protein structure or model such as ligand binding pockets.

  A) System requirements  
      
     SMAP can be excuted on both Windows and linux operating system.  For windows users, it is strongly recommended to install cygwin.
     For the most of PDB chains, at least 1G RAM should be allocated to the software. Some comparisons needs more memory than 1G.
     Java 1.6 or higher version is required to run SMAP.


  B) Installation 
 
      Download SMAP V1.0 and upcompress it. 
     
     For SMAP_v1_0.zip, using command:
     >unzip SMAP_v1_0.zip
     
     For SMAP_v1_0.tar.gz, using command:
     >gunzip -c SMAP_v1_0.tar.gz | tar -xvf -
     
     A directory SMAP_v1_0 will be generated in the installed directory. There are several directories and files in this directory.
     
     README: this file
     License.txt: The academic license of SMAP software
     classes: the directory for java class files of the software
     lib: the directory for external java libraries that will be imported by the software
     external: the directory includes binaries of external software (PSI-Blast and qhull) for both windows and linux
     conformerUnit: the directory to save serializable object files of conformer units that are needed to run SMAP. It is empty when downloaded.
     *.bat, *.csh, and *.sh: shell scripts to facilitate runing the software
     
  C) How to start? 

     1) Go to SMAP_v1_0 directory and set up environment varibles in shell script, .csh, .sh, or .bat 

        a. Set SMAPROOT as the path of the directory in which user install SMAP 
	  eg. in smap_comp.bat 
	        set SMAPROOT=C:\user\smap 
	      in smap_comp.sh 
	        export SMAPROOT=/home/user/smap

        b. Add the paths of libraries which will be used to CLASSPATH 
	  eg. in smap_comp.bat 
	        set CLASSPATH=%CLASSPATH%;%SMAPROOT%\classes;%SMAPROOT%\lib\pdblibs.jar;%SMAPROOT%\lib\pdbormapping.jar;%SMAPROOT%\lib\mbt.jar;%SMAPROOT%\lib\siteormapping.jar
	      in smap_comp.sh 
	        export CLASSPATH=${CLASSPATH}:${SMAPROOT}/classes:${SMAPROOT}/lib/pdblibs.jar:${SMAPROOT}/lib/pdbormapping.jar:${SMAPROOT}/lib/mbt.jar:${SMAPROOT}/lib/siteormapping.jar

        c. Set minimum memory in the command line. If out of memory, errors may occur. 
	   eg. in smap_comp.bat  
	         java -Xmx1200M -classpath %CLASSPATH% org.interactome.siteengine.sitesearch.SMAP -templateChain %templChain% -queryChain %queryChain% -output %output% 
	       in smap_comp.sh  
	         java -Xmx1000M -cp ${CLASSPATH} org.interactome.siteengine.sitesearch.SMAP -templateChain $templChain -queryChain $queryChain -output $output
		
     2) Modify pdbdefault.properties in the classes directory 
 
        a. Change CONFORMER_UNIT_DIR to the directory in which serializable objects of conformer units is saved. 
           eg. CONFORMER_UNIT_DIR=/home/user/conformerUnit

        b. Change LOCAL_PDB_DIR to the directory in which PDB xml file is saved. 
           eg. LOCAL_PDB_DIR=/ExternalData/pdb/XML 
           If this directory does not exist, SMAP will get the file from the PDB online.
 
     3) Copy the file smapdefault.properties in the classes directory to the file smap.properties in the directory where you will run the program. Change parameters in smap.properties as needed. 
        Users can also use the default setting for these parameters.
	  
	a. Parameters to segment the structure

           MIN_PL_ATOM_SPHERE_SIZE=20
           This parameter represents the minimum number of atoms involved in one virtual ligand. The default value is 20.
 
           MIN_ATOM_SPHERE_DISTANCE=3.0 
           This parameter represents the minimum distance between two virtual ligands. If the distance between any intra-ligand atom pairs from
           ligand i and j is smaller than MIN_ATOM_SPHERE_DISTANCE, these two virtual ligands will be considered as overlapped and will be merged 
           as a single virtual ligand. The default value is 3.0 angstrom. 

           MAX_ATOM_SPHERE_RADIUS=5.0 
           This parameter represents the maximum radius for the circumscribed spheres outside the protein boundary but inside the environmental boundary.
           Any sphere with a radius larger than MAX_ATOM_SPHERE_RADIUS won't be considered. The default value is 5.0 angstrom when the protein is represented 
           by all atoms. 

           MIN_PL_CA_SPHERE_SIZE=5 
           This parameter represents the minimum number of CA atoms involved in one virtual ligand. The default value is 5.

           MIN_CA_SPHERE_DISTANCE=5.0 
           This parameter represents the minimum distance between two virtual ligands when the protein is represented only by CA atoms. 
           If the distance between any intra-ligand CA atom pairs from ligand i and j is smaller than MIN_CA_SPHERE_DISTANCE, these two virtual 
           ligands will be considered as overlapped and will be merged as a single virtual ligand. The default value is 5.0 angstrom.

           MAX_CA_SPHERE_RADIUS=7.5 
           This parameter represents the maximum radius for the circumscribed spheres outside the protein boundary but inside the environmental boundary.
           Any sphere with a radius larger than MAX_ATOM_SPHERE_RADIUS won't be considered. The default value is 7.5 angstrom when the protein is represented 
           only by CA atoms.

           MAX_NUM_PL=5 
           This parameter represents the maximum number of virtual ligands in each protein.
                 
        b. Parameters for determination of ligand binding sites
        
           LIGAND_CONTACT_DISTANCE_CUTOFF=5.0
           If the distance between a protein and a ligand atom is less than the specified value of LIGAND_CONTACT_DISTANCE_CUTOFF with a unit of angstrom, 
           and these two atoms are not obstructed by other atoms, the protein atom and its associated residue is considered as the ligand binding site. 
           The default value is 5.0
                     
        c. Parameters for comparison of two proteins   
    
           LOCAL_SCORE=true:   
           If this parameter is set as true, SMAP will compare local structure similarity for query and template structures. 

  
           PRINT_PDB=false:  
           If this parameter is set as false, SMAP will not print out the coordinate file for template structure after superposed on query structure.
  
           SUPER_PDB_OUTPUT_DIR=/home/user/SMAP/v2_2/pdb:  
           This parameter shows the directory in which the superposed structure will be printed out. 
  
           MATCH_SECONDARY_STRUCTURE=true:  
           If this parameter is set as true, secondary structure will be first matched during alignment.
  
           TEMPLATE_LIGAND_SITE_ONLY=true:  
           If this parameter is set as true, only the templates with known ligand will be compared. For a template with multiple binding pockets,   
           only the pockets with ligand presented in the PDB file will be compared.
  
           ASSOCIATE_GRAPH_NODE_FILTER=0.5  
           This parameter indicates how many nodes in associated graph will be removed.   
           When building the associated graph, each  node will be given a score according to the similarity of the residue pairs in this node.    
           To save time for the following alignment, some of the nodes will be removed according to their scores.  
           0 means all nodes will be kept. The alignment will be slow, but more accurate. 0.5 means almost half of the nodes will be removed.   
           1.0 means all nodes will be removed, which cannot happen during calculation.
  
           TIMES_RANDOM_SHUFFLE=0  
           This parameter represent how many times of random shuffle will be done to determine the statistic significance of a similarity score between two structures.  
           If this parameter is set as 0, a background distribution will be used to estimate the significance of the score.
    
           SCORE_MATRIX=McLACHLAN
           This parameter shows which scoring matrix will be used during alignment. Available matrices include McLACHLAN, AAGroup, BLOSUM45, MIYATA.
                The default value is McLACHLAN.
   
     
     4) Run the program
    
        You can start to run this program to compare the similarity of ligand binding sites for query and template structures.    
        ./smap_comp.sh template_chain query_chain output    
     
        Where the template_chain and the query_chain are two PDB chains specified as [PDB ID]_[Chain ID], respectively. The "output" is the name of file that the result will output.   
		   


     5) Explain the results

		SMAP will give the local structural alignment between detected ligand-binding sites on query and template protein, p-value, Raw-score, Tanimoto Coefficient and RMSD between them.  
        It will also provide the transformation matrices for query and template protein which are used to superimpose the two structures.   
         
        Raw score is the profile-profile alignment score between the binding pockets of two proteins.  
        This score will evaluate the evolutionary and geometric similarities for the two binding pockets.  
         
        p-value will estimate the statistic significance of the raw score by considering the background probability distribution of the binding site alignment scores. 
         
        Tanimoto coefficient is one way to calculate similarity coefficient. It will calculate the ratio of the overlapped binding sites over the sum of binding sites for two proteins. 
         
        RMSD is the root mean square of deviation between the binding sites in two proteins.  
         
        True and false positive mathces are the best distinguished by the p-value and the Tanimoto coefficient. The low p-value ( <1.0e-3 ) and Tanimoto coefficient ( >0.5 ) usually indicate a good chance of biological meaningfull similarity. 
        

      

