Kernel for strings

Abstract:

A new class of kernels for strings is introduced. These kernels can be used by any kernel-based data analysis method, including support vector machines (SVM). They are derived from probabilistic models to integrate biologically relevant information. We show how to compute the kernels corresponding to several classical probabilistic models, and illustrate their use by building a SVM for the problem of predicting the cleavage site of signal peptides from the amino-acid sequence of a protein. At a given rate of false positive this method retrieves up to 47% more true positives than the classical weight matrix method.

Data used

We downloaded from the site ftp://virus.cbs.dtu.dk/pub/signalp three files containing the signal peptide and the first 30 amino acids of mature proteins from various organisms:

EUKSIG.red contains 1011 proteins from eucaryotes,
GRAM-SIG.red contains 266 proteins from Gram-negative prokaryotes,
GRAM+SIG.red contains 141 proteins from Gram-negative prokaryotes.

After concatenation of these three files we obtain the following file:

SIG.red

which contains 1,418 non-redundant sequences. Each protein in these file is described by exactly three lines (we removed a couple a carriage return from the original files):

The first line contains a description of the protein.
The second line contains the amino-acid sequence (signal peptide + 30 amino acids of the mature protein).
The third line contains a letter to indicate whether each amino acid belongs to the signal peptide (S) or to the mature protein (M). The first amino-acid of the mature protein is marked with a letter C.

Following is an example of a protein description:

   54 CRFB_HUMAN     24 CORTICOTROPIN-RELEASING FACTOR BINDING PROTEIN PRECURSOR
MSPNFKLQCHFILIFLTALRGESRYLELREAADYDPFLLFSANLKRELAGEQPY
SSSSSSSSSSSSSSSSSSSSSSSSCMMMMMMMMMMMMMMMMMMMMMMMMMMMMM

SVM software

We used the support vector machine software mySVM (Version 2.0), publicly available from the page http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM.

Kernel implementation

With this sofware it is possible to implement the kernel defined in the paper as follows: copy the content of the file user_kernel.cpp at the appropriate location in the file kernel.cpp which comes with the source of mySVM (the appropriate location is where there is room for a user-defined kernel), and compile the program with that modification.

Parameter file

The results presented in the paper were obtained with the following parameter file (used by the modified software mySVM) : param.data

Overall experiment

You should be able to reproduce the experiment presented in the paper by running the following perl script: kernel.pl. This script executes a series of loops consisting of the following operations:

creation of a random training set and test set;
estimation of the weight matrix on the training set;
computation of the weight matrix scores on the test set;
creation of the training and test set to be used by mySVM
training of the SVM on the training set;
test of the SVM on the test set
computation of ROC curves for the weight matrix and the SVM

and averages the ROC curves over a number of iterations. The results presented in the paper were obtained with the following command:

./kernel.pl 8 2 100 0.8

meaning that we consider windows made of 8 amino-acids before and 2 amino-acids after the cleavage site, that we iterate the main loop 100 times, and that the training set is made of 80% of the windows. The resulting ROC curves look like the following curves:

Contact information

To contact the author please send a mail to Jean-Philippe.Vert@mines.org.

Support vector machine prediction of signal peptide cleavage site using a new class of kernels for strings