Machine learning for bioinformatics practical session
This page contains material relevant to the practical sessions of the Machine learning for computational biology module taught by JeanPhilippe Vert at Mines ParisTech.
Prelude
 Download R and install it on your machine.
 Download RStudio and install it on your machine.
 Download this session’s R file and open it in RStudio.
 Lecture slides are available on Prof. Vert’s website.
Using R and RStudio
A fair number of quality entrylevel R tutorials are available on the internet, most notably the R Project’s official An Introduction to R. For those interested, Google provides the Google R style guide, a set of coding guidelines which also makes a good syntax primer.
One of the best features of the R system is its builtin, high quality help system. When you see a function you don’t know in the code, for example ksvm
, you can type ?ksvm
at the R prompt to access the help for this function. Help files are comprehensive, clear, and include examples for the most common usecases.
A related feature of RStudio is tabcompletion: when you write R code in RStudio, pressing the Tab
key will autocomplete whatever you are writing, or prompt you with possible choices. This means that writing Som
will autocomplete to SomeFunction
if this function is defined, and SomeFunction(
will prompt you a choice of x=
, y=
, z=
, some.option=
.
Linear SVM
The first few lines of the file serve the purpose of generating the dataset we will be working with in this section. The interesting stuff starts here:
34 35 36 

This will plot the dataset on the bottomright panel of the RStudio window. Note that you can display all the plots created in a session by using the arrows at the top of the panel.
Question 1: How can you characterise this dataset? Next, we train a linear SVM on the training set and plot the model.
Question 2: How does the plot relate to what you learnt during the lecture? Give a qualitative assessment of the quality of the model. Play with the C
parameter. Comment. We now want to quantify the performance of the model. ROC curves and Precisionrecall curves are widely used tools to visualise the quality of a statistical model.
52 53 54 55 56 57 58 59 60 

Question 3: Is this good? bad? surprising? As explained in the lecture, evaluation of the performance of the model can be done using crossvalidation.
Question 4: Write a function CrossValidation( data, n.folds=10 )
implementing crossvalidation for this model. The ksvm
function can also compute performance by k
fold crossvalidation if you add the parameter cross=k
in the function call.
Question 5: Compare the results of your crossvalidation and the ksvm
builtin crossvalidation. As explained in the lecture, it is crucial for the performance of the model to choose a good value for C
. There is no heuristic for this parameter, so the usual way is to evaluate the performance of a range of models by varying C
on a logarithmic scale.
Question 6: Implement this step for C
ranging in 2^seq( 10, 10 )
. Plot the models and their performance for different values of C
. In RStudio, you can adjust C
interactively on a plot of the model:
61 62 63 64 65 

Question 7: Plot a curve of the performance as a function of the parameter C
. Read about the biasvariance tradeoff. Explain how it relates to what you just did. More interesting things happen on a “harder” dataset.
Question 8: Use the GenerateDatasetNonlinear
function to generate a new dataset. Plot it, train a linear SVM, evaluate its performance by crossvalidation and plot the performance versus C
curve. What is happening?
Nonlinear SVM
This is where using a kernel becomes useful. Read the lecture slides again to familiarise with the concept.
90 91 92 93 

The Gaussian kernel has an additional sigma
parameter. Both sigma
(the kernel parameter) and C
(the SVM parameter) have an influence on the model. You can try several values of sigma
and C
to see the influence of each parameter. Fortunately, ksvm
has a heuristic to find a good value for sigma
; to use it, just omit the lpar=list(sigma=1)
part of the function call.
Question 9: Plot the performance versus C
curve for the nonlinear SVM with Gaussian kernel. Try other kernels (use ?kernels
at the R prompt to see a list of available kernels). Here again, using RStudio you can plot the model and select both C and the kernel interactively.
95 96 97 98 99 100 

You should obtain images similar to these for a Gaussian kernel when increasing C
; level sets represent the decision function, filled points are the support vectors:
Application: gene expression data analysis
You learnt how to use linear and nonlinear SVMs on example “toy” datasets. The goal of this section is to use what you learnt on a realworld dataset; namely a set of gene expression data for 128 patients suffering of acute lymphoblastic leukemia (ALL).
Question 10: Use what you learnt in the previous sections to build a SVM classifier to distinguish between Bcell and Tcell leukemia.
Bonus: If you take the disease stage into account, there are more than 2 classes. How would you design a system for multiclass classification from binary classifiers? ksvm
implements multiclass classification. Test it on the dataset and comment the results. It is recommended to use several kernels and check the influence of the C
parameter. Please remember that the choice C
parameter depends on the kernel, and there is no a priori good choice of value.