Seminar "Introduction to Support Vector Machines (SVM) and applications to bioinformatics"

Protein cellular localization contest

Data

The following files are the data used in the paper by Hua and Sun, and were downloaded from http://www.doe-mbi.ucla.edu/~astrid/astrid.html:

  1. Cytoplasmic proteins of eukaryotic origin

  2. Extracellular proteins of eukaryotic origin

  3. Nuclear proteins of eukaryotic origin

  4. Mitochondrial proteins of eukaryotic origin

  5. Cytoplasmic proteins of prokaryotic origin

  6. Extracellular proteins of prokaryotic origin

  7. Periplasmic proteins of prokaryotic origin

From these files Park-san built the following example files to train various support vector machines. For each learning problem the file with all examples was created (*.dat) and randomly split into a training set (*.train) and a test set (*.test) with the perl script split.pl. The test sets contain 80% of the data, the test sets contain the 20% remaining:

SVM software

We can use any of the following SVM implementations:

In order to be able to use the same example file format for both softwares Hattori-san wrote a patch for SVM Light

Example: using mySVM for eukaryotes cytoplasmic prediction