Machine learning for bioinformatics practical session
This page contains material relevant to the practical sessions of the Machine learning for computational biology module taught by Jean-Philippe Vert at Mines ParisTech.
Prelude
- Download R and install it on your machine.
- Download RStudio and install it on your machine.
- Download this session’s R file and open it in RStudio.
- Lecture slides are available on Prof. Vert’s website.
Using R and RStudio
A fair number of quality entry-level R tutorials are available on the internet, most notably the R Project’s official An Introduction to R. For those interested, Google provides the Google R style guide, a set of coding guidelines which also makes a good syntax primer.
One of the best features of the R system is its built-in, high quality help system. When you see a function you don’t know in the code, for example ksvm
, you can type ?ksvm
at the R prompt to access the help for this function. Help files are comprehensive, clear, and include examples for the most common use-cases.
A related feature of RStudio is tab-completion: when you write R code in RStudio, pressing the Tab
key will auto-complete whatever you are writing, or prompt you with possible choices. This means that writing Som
will auto-complete to SomeFunction
if this function is defined, and SomeFunction(
will prompt you a choice of x=
, y=
, z=
, some.option=
.
Linear SVM
The first few lines of the file serve the purpose of generating the dataset we will be working with in this section. The interesting stuff starts here:
34 35 36 |
|
This will plot the dataset on the bottom-right panel of the RStudio window. Note that you can display all the plots created in a session by using the arrows at the top of the panel.
Question 1: How can you characterise this dataset? Next, we train a linear SVM on the training set and plot the model.
Question 2: How does the plot relate to what you learnt during the lecture? Give a qualitative assessment of the quality of the model. Play with the C
parameter. Comment. We now want to quantify the performance of the model. ROC curves and Precision-recall curves are widely used tools to visualise the quality of a statistical model.
52 53 54 55 56 57 58 59 60 |
|
Question 3: Is this good? bad? surprising? As explained in the lecture, evaluation of the performance of the model can be done using cross-validation.
Question 4: Write a function CrossValidation( data, n.folds=10 )
implementing cross-validation for this model. The ksvm
function can also compute performance by k
-fold cross-validation if you add the parameter cross=k
in the function call.
Question 5: Compare the results of your cross-validation and the ksvm
built-in cross-validation. As explained in the lecture, it is crucial for the performance of the model to choose a good value for C
. There is no heuristic for this parameter, so the usual way is to evaluate the performance of a range of models by varying C
on a logarithmic scale.
Question 6: Implement this step for C
ranging in 2^seq( -10, 10 )
. Plot the models and their performance for different values of C
. In RStudio, you can adjust C
interactively on a plot of the model:
61 62 63 64 65 |
|
Question 7: Plot a curve of the performance as a function of the parameter C
. Read about the bias-variance tradeoff. Explain how it relates to what you just did. More interesting things happen on a “harder” dataset.
Question 8: Use the GenerateDatasetNonlinear
function to generate a new dataset. Plot it, train a linear SVM, evaluate its performance by cross-validation and plot the performance versus C
curve. What is happening?
Non-linear SVM
This is where using a kernel becomes useful. Read the lecture slides again to familiarise with the concept.
90 91 92 93 |
|
The Gaussian kernel has an additional sigma
parameter. Both sigma
(the kernel parameter) and C
(the SVM parameter) have an influence on the model. You can try several values of sigma
and C
to see the influence of each parameter. Fortunately, ksvm
has a heuristic to find a good value for sigma
; to use it, just omit the lpar=list(sigma=1)
part of the function call.
Question 9: Plot the performance versus C
curve for the nonlinear SVM with Gaussian kernel. Try other kernels (use ?kernels
at the R prompt to see a list of available kernels). Here again, using RStudio you can plot the model and select both C and the kernel interactively.
95 96 97 98 99 100 |
|
You should obtain images similar to these for a Gaussian kernel when increasing C
; level sets represent the decision function, filled points are the support vectors:
Application: gene expression data analysis
You learnt how to use linear and nonlinear SVMs on example “toy” datasets. The goal of this section is to use what you learnt on a real-world dataset; namely a set of gene expression data for 128 patients suffering of acute lymphoblastic leukemia (ALL).
Question 10: Use what you learnt in the previous sections to build a SVM classifier to distinguish between B-cell and T-cell leukemia.
Bonus: If you take the disease stage into account, there are more than 2 classes. How would you design a system for multi-class classification from binary classifiers? ksvm
implements multi-class classification. Test it on the dataset and comment the results. It is recommended to use several kernels and check the influence of the C
parameter. Please remember that the choice C
parameter depends on the kernel, and there is no a priori good choice of value.