Data used
- Phylogenetic profiles for 2,465 genes of the yeast Saccharomyces cerevisiae BLASTed against 24 genomes. The values in this file represent "-log(E-value)". In the paper, each bit of the phylogenetic profile was set to 0 if this number is negative (E-value > 1), 1 otherwise.
- Functional categories(HTML or Texte) downloaded from the Munich Information Center for Protein Sequence Comprehensive Yeast Genome Database (CYGD)
Software
The SVM and kernel computations are based on the software SVM Light v.3.50 by Thorsten Joachims. We added the following files:
and modified the file:
After compiling SVM Light with the modified files, you should be able to use it with the option "-t 4" (user defined kernel, see the SVM Light documentation).
Warning: the kernel in kernel.c contains the definition of the Bayesian tree model for the particular example studied in this paper. If you want to use it for an other data set, don't forget to modify this file accordingly (or to rewrite it in a cleaner way...).
Useful PERL script
- prep.pl: prepares positive and negative example files for SVMLight from the phylogenetic profile file (with -log(E-value)) and a MIPS category text file
- randsplit.pl: randomly splits a file into two files (line by line)
- test.pl: performs the experiment described in the reference paper, which consists in comparing the performance of a SVM with a linear kernel and with the tree kernel on the prediction of the MIPS categories
Results
- The directory here contains the ROC curves for each MIPS class.
- The file roc50 contains the roc50 index for all MIPS classes, sorted by decreasing performance for the linear SVM.
Reproducing the experiments
To reproduce the experiment on your linux machine, download the file tree_kernel.tgz which includes all files described in this page, and do:
% tar xvzf tree_kernel.tgz % cd tree_kernel/perl/ % ./test.pl
It should take a while. When the experiments is done, check the results in the "result" directory.
Known issues/bugs
The goal of this page is to provide the exact data and code that were used in the publication. Please let me know if you discover a bug, I will list them here without changing the files above.- It looks like there might be a couple of convergence issues for some MIPS classes with this experiment. This might be because the kernel is not positive definite due to scaling by a power smaller than 1 of the original kernel described in the paper. You might consider a Control-C if you observe that SVM-Light spends days optimizing a class, the Perl script will then skip this class and continue with other classes.
- Sep. 2006: Philip R. Kensche from the Netherlands found a bug in the kernel.c code: in phyloTreeInit() I initialize the tree with nodeNew(0.9, 0.1, ...), but it should be nodeNew(0.1, 0.9..) instead. As this is a serious bug I changed the code of nodeNew() with a comment.
- Feb. 2008: Kyu-Won Kim from Korea noticed an error at line 17 of prep.pl file, the index 14 appears twice but it should be replaced once by 19.