Support vector machine prediction of signal peptide cleavage site using a new class of kernels for strings

Proceedings of the Pacific Symposium on Biocomputing 2002, Altman, R.B., Dunker, A.K., Hunter, L., Lauerdale, K. and Klein, T.E., (Ed.), World Scientific, pp. 649-660, 2002.

Jean-Philippe Vert
Bioinformatics Center, Institute for Chemical Research, Kyoto University, Uji, Kyoto 611-0011, Japan


A new class of kernels for strings is introduced. These kernels can be used by any kernel-based data analysis method, including support vector machines (SVM). They are derived from probabilistic models to integrate biologically relevant information. We show how to compute the kernels corresponding to several classical probabilistic models, and illustrate their use by building a SVM for the problem of predicting the cleavage site of signal peptides from the amino-acid sequence of a protein. At a given rate of false positive this method retrieves up to 47% more true positives than the classical weight matrix method.

You can download a copy of this paper in various formats:
compressed postscript or PDF

You can also download the slides I used at PSB 2002.

Data used

We downloaded from the site three files containing the signal peptide and the first 30 amino acids of mature proteins from various organisms:

After concatenation of these three files we obtain the following file:

which contains 1,418 non-redundant sequences. Each protein in these file is described by exactly three lines (we removed a couple a carriage return from the original files):

Following is an example of a protein description:


SVM software

We used the support vector machine software mySVM (Version 2.0), publicly available from the page

Kernel implementation

With this sofware it is possible to implement the kernel defined in the paper as follows: copy the content of the file user_kernel.cpp at the appropriate location in the file kernel.cpp which comes with the source of mySVM (the appropriate location is where there is room for a user-defined kernel), and compile the program with that modification.

Parameter file

The results presented in the paper were obtained with the following parameter file (used by the modified software mySVM) :

Overall experiment

You should be able to reproduce the experiment presented in the paper by running the following perl script: This script executes a series of loops consisting of the following operations:

  1. creation of a random training set and test set;
  2. estimation of the weight matrix on the training set;
  3. computation of the weight matrix scores on the test set;
  4. creation of the training and test set to be used by mySVM
  5. training of the SVM on the training set;
  6. test of the SVM on the test set
  7. computation of ROC curves for the weight matrix and the SVM

and averages the ROC curves over a number of iterations. The results presented in the paper were obtained with the following command:

./ 8 2 100 0.8

meaning that we consider windows made of 8 amino-acids before and 2 amino-acids after the cleavage site, that we iterate the main loop 100 times, and that the training set is made of 80% of the windows. The resulting ROC curves look like the following curves:

ROC curveROC curve zoom

Contact information

To contact the author please send a mail to

CSS!Valid XHTML 1.0!
Back to JP's homepage
Last modified: September 5, 2001