Seminar "Introduction to Support Vector Machines (SVM) and applications to bioinformatics"Protein cellular localization contest |
---|
Data
The following files are the data used in the paper by Hua and Sun, and were downloaded from http://www.doe-mbi.ucla.edu/~astrid/astrid.html:
- Cytoplasmic proteins of eukaryotic origin
- Extracellular proteins of eukaryotic origin
- Nuclear proteins of eukaryotic origin
- Mitochondrial proteins of eukaryotic origin
- Cytoplasmic proteins of prokaryotic origin
- Extracellular proteins of prokaryotic origin
- Periplasmic proteins of prokaryotic origin
From these files Park-san built the following example files to train various support vector machines. For each learning problem the file with all examples was created (*.dat) and randomly split into a training set (
*.train
) and a test set (*.test
) with the perl scriptsplit.pl
. The test sets contain 80% of the data, the test sets contain the 20% remaining:
- Eukaryote, cytoplasmic (positive = 1, negative = 2,3,4) :
- eucy.dat (2427 examples)
- eucy.train (1941 examples)
- eucy.test (486 examples)
- Eukaryote, extracellular (positive = 2, negative = 1,3,4) :
- euex.dat (2427 examples)
- euex.train (1941 examples)
- euex.test (486 examples)
- Eukaryote, nuclear (positive = 3, negative = 1,2,4) :
- eunu.dat (2427 examples)
- eunu.train (1941 examples)
- eunu.test (486 examples)
- Eukaryote, mitochondrial (positive = 4, negative = 1,2,3) :
- eumi.dat (2427 examples)
- eumi.train (1941 examples)
- eumi.test (486 examples)
- Prokaryote, cytoplasmic (positive = 5, negative = 6,7) :
- procy.dat (997 examples)
- procy.train (797 examples)
- procy.test (200 examples)
- Prokaryote, extracellular (positive = 6, negative = 5,7) :
- proex.dat (997 examples)
- proex.train (797 examples)
- proex.test (200 examples)
- Prokaryote, periplasmic (positive = 7, negative = 5,6) :
- prope.dat (997 examples)
- prope.train (797 examples)
- prope.test (200 examples)
SVM software
We can use any of the following SVM implementations:
In order to be able to use the same example file format for both softwares Hattori-san wrote a patch for SVM LightExample: using mySVM for eukaryotes cytoplasmic prediction
- Install mySVM following the instruction on its web page:
$ mkdir mysvm $ cd mysvm $ wget http://www-ai.cs.uni-dortmund.de/SOFTWARE/MYSVM/mySVM-latest.tar.gz $ tar xvzf mySVM-latest.tar.gz $ make $ cd..This creates the two executables:
mysvm/bin/Linux-2.2.18-0vl4.2smp/mysvm
to train a SVMmysvm/bin/Linux-2.2.18-0vl4.2smp/predict
to predict with a SVM- Preparing a parameter file :
param.dat
- The examples files for mySVM need to be modified to include two lines at the begining:
- Runnning mySVM:
$ mysvm/bin/Linux-2.2.18-0vl4.2smp/mysvm data/eucy.mysvm.train data/param.dat *** Convergence Done training: 10525 iterations. Target function: -1356.3503 ---------------------------------------- WARNING: The results were obtained using a relaxed epsilon of 0.014405143 on the KKT conditions! Average loss : 0.50792432 (loo-estim: 221.53854) Avg. loss pos : 0.61668036 (875 occurences) Avg. loss neg : 0.42912096 (1040 occurences) Mean absolute error : 1.205965 Mean squared error : 3.2353862 Support Vectors : 1037 Bounded SVs : 1003 min SV: -1 max SV: 2 |w| = 1.3819927 max |x| = 13.536422 VCdim <= 350.9607 performance (+estimators): Accuracy : 0.77022154 (0.46625451) Precision : 0.56809339 (0.2272216) Recall : 0.79491833 (0.36660617) Predicted values: | + | - ---+-------+------- + | 438 | 113 (true pos) - | 333 | 1057 (true neg) w[0] = 0.027785219 w[1] = -0.12550412 w[2] = 0.18695814 w[3] = 0.15833423 w[4] = 0.21961691 w[5] = 0.12170837 w[6] = 0.0040019082 w[7] = 0.21221096 w[8] = -0.030101022 w[9] = 0.11343848 w[10] = 0.050352712 w[11] = -0.14782239 w[12] = -0.022585763 w[13] = -0.083205901 w[14] = -0.12793245 w[15] = -0.17766168 w[16] = 0.0301567 w[17] = 0.1637128 w[18] = 0.27190155 w[19] = -0.046270986 b = -4.2755588 Time for learning: init : 0s optimizer : 2s convergence : 1s update ws : 10s calc ws : 3s ============= all : 26s Saving trained SVM to data/eucy.mysvm.train.svm mysvm ended successfully. $ mysvm/bin/Linux-2.2.18-0vl4.2smp/predict data/eucy.mysvm.test.svm data/param.dat data/eucy.mysvm.test *** mySVM version 2.1 *** Reading data/eucy.mysvm.train.svm read 1941 examples, format sparse, dimension = 20. Reading data/param.dat Reading data/eucy.mysvm.test read 486 examples, format sparse, dimension = 20. ---------------------------------------- Predicting Testing examples from file data/eucy.mysvm.test Average loss : 0.606319 Avg. loss pos : 0.651721 (217 occurences) Avg. loss neg : 0.578293 (265 occurences) Mean absolute error : 1.38074 Mean squared error : 5.22022 Accuracy : 0.755144 Precision : 0.538043 Recall : 0.744361 Predicted values: | + | - ---+------+------ + | 99 | 34 (true pos) - | 85 | 268 (true neg) Predicting examples from file data/eucy.mysvm.test Prediction saved in file data/eucy.mysvm.test.pred mysvm ended successfully.
Back to the SVM seminar homepage