Data used
We downloaded from the site ftp://virus.cbs.dtu.dk/pub/signalp
three files containing the signal peptide and the first 30 amino acids
of mature proteins from various organisms:
EUKSIG.red
contains 1011 proteins from eucaryotes,GRAM-SIG.red
contains 266 proteins from Gram-negative prokaryotes,GRAM+SIG.red
contains 141 proteins from Gram-negative prokaryotes.
After concatenation of these three files we obtain the following file:
which contains 1,418 non-redundant sequences. Each protein in these file is described by exactly three lines (we removed a couple a carriage return from the original files):
- The first line contains a description of the protein.
- The second line contains the amino-acid sequence (signal peptide + 30 amino acids of the mature protein).
- The third line contains a letter to indicate whether each amino
acid belongs to the signal peptide (
S
) or to the mature protein (M
). The first amino-acid of the mature protein is marked with a letterC
.
Following is an example of a protein description:
54 CRFB_HUMAN 24 CORTICOTROPIN-RELEASING FACTOR BINDING PROTEIN PRECURSOR MSPNFKLQCHFILIFLTALRGESRYLELREAADYDPFLLFSANLKRELAGEQPY SSSSSSSSSSSSSSSSSSSSSSSSCMMMMMMMMMMMMMMMMMMMMMMMMMMMMM
SVM software
We used the support vector machine software mySVM (Version 2.0), publicly available from the page http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM.
Kernel implementation
With this sofware it is possible to implement the kernel defined in the
paper as follows: copy the content of the file
user_kernel.cpp
at the
appropriate location in the file kernel.cpp
which comes
with the source of mySVM
(the appropriate location is
where there is room for a user-defined kernel), and compile the program with
that modification.
Parameter file
The results presented in the paper were obtained with the following
parameter file (used by the modified software mySVM
) : param.data
Overall experiment
You should be able to reproduce the experiment presented in the paper
by running the following perl script: kernel.pl
. This script executes a series
of loops consisting of the following operations:
- creation of a random training set and test set;
- estimation of the weight matrix on the training set;
- computation of the weight matrix scores on the test set;
- creation of the training and test set to be used by
mySVM
- training of the SVM on the training set;
- test of the SVM on the test set
- computation of ROC curves for the weight matrix and the SVM
and averages the ROC curves over a number of iterations. The results presented in the paper were obtained with the following command:
./kernel.pl 8 2 100 0.8
meaning that we consider windows made of 8 amino-acids before and 2 amino-acids after the cleavage site, that we iterate the main loop 100 times, and that the training set is made of 80% of the windows. The resulting ROC curves look like the following curves:
Contact information
To contact the author please send a mail to Jean-Philippe.Vert@mines.org.