Machine learning for bioinformatics practical session
This page contains material relevant to the practical sessions of the Machine learning for computational biology module taught by Jean-Philippe Vert at Mines ParisTech.
- Download R and install it on your machine.
- Download RStudio and install it on your machine.
- Download this session’s R file and open it in RStudio.
- Lecture slides are available on Prof. Vert’s website.
Using R and RStudio
A fair number of quality entry-level R tutorials are available on the internet, most notably the R Project’s official An Introduction to R. For those interested, Google provides the Google R style guide, a set of coding guidelines which also makes a good syntax primer.
One of the best features of the R system is its built-in, high quality help system. When you see a function you don’t know in the code, for example
ksvm, you can type
?ksvm at the R prompt to access the help for this function. Help files are comprehensive, clear, and include examples for the most common use-cases.
A related feature of RStudio is tab-completion: when you write R code in RStudio, pressing the
Tab key will auto-complete whatever you are writing, or prompt you with possible choices. This means that writing
Som will auto-complete to
SomeFunction if this function is defined, and
SomeFunction( will prompt you a choice of
The first few lines of the file serve the purpose of generating the dataset we will be working with in this section. The interesting stuff starts here:
34 35 36
This will plot the dataset on the bottom-right panel of the RStudio window. Note that you can display all the plots created in a session by using the arrows at the top of the panel.
Question 1: How can you characterise this dataset? Next, we train a linear SVM on the training set and plot the model.
Question 2: How does the plot relate to what you learnt during the lecture? Give a qualitative assessment of the quality of the model. Play with the
C parameter. Comment. We now want to quantify the performance of the model. ROC curves and Precision-recall curves are widely used tools to visualise the quality of a statistical model.
52 53 54 55 56 57 58 59 60
Question 3: Is this good? bad? surprising? As explained in the lecture, evaluation of the performance of the model can be done using cross-validation.
Question 4: Write a function
CrossValidation( data, n.folds=10 ) implementing cross-validation for this model. The
ksvm function can also compute performance by
k-fold cross-validation if you add the parameter
cross=k in the function call.
Question 5: Compare the results of your cross-validation and the
ksvm built-in cross-validation. As explained in the lecture, it is crucial for the performance of the model to choose a good value for
C. There is no heuristic for this parameter, so the usual way is to evaluate the performance of a range of models by varying
C on a logarithmic scale.
Question 6: Implement this step for
C ranging in
2^seq( -10, 10 ). Plot the models and their performance for different values of
C. In RStudio, you can adjust
C interactively on a plot of the model:
61 62 63 64 65
Question 7: Plot a curve of the performance as a function of the parameter
C. Read about the bias-variance tradeoff. Explain how it relates to what you just did. More interesting things happen on a “harder” dataset.
Question 8: Use the
GenerateDatasetNonlinear function to generate a new dataset. Plot it, train a linear SVM, evaluate its performance by cross-validation and plot the performance versus
C curve. What is happening?
This is where using a kernel becomes useful. Read the lecture slides again to familiarise with the concept.
90 91 92 93
The Gaussian kernel has an additional
sigma parameter. Both
sigma (the kernel parameter) and
C (the SVM parameter) have an influence on the model. You can try several values of
C to see the influence of each parameter. Fortunately,
ksvm has a heuristic to find a good value for
sigma; to use it, just omit the
lpar=list(sigma=1) part of the function call.
Question 9: Plot the performance versus
C curve for the nonlinear SVM with Gaussian kernel. Try other kernels (use
?kernels at the R prompt to see a list of available kernels). Here again, using RStudio you can plot the model and select both C and the kernel interactively.
95 96 97 98 99 100
You should obtain images similar to these for a Gaussian kernel when increasing
C ; level sets represent the decision function, filled points are the support vectors:
Application: gene expression data analysis
You learnt how to use linear and nonlinear SVMs on example “toy” datasets. The goal of this section is to use what you learnt on a real-world dataset; namely a set of gene expression data for 128 patients suffering of acute lymphoblastic leukemia (ALL).
Question 10: Use what you learnt in the previous sections to build a SVM classifier to distinguish between B-cell and T-cell leukemia.
Bonus: If you take the disease stage into account, there are more than 2 classes. How would you design a system for multi-class classification from binary classifiers?
ksvm implements multi-class classification. Test it on the dataset and comment the results. It is recommended to use several kernels and check the influence of the
C parameter. Please remember that the choice
C parameter depends on the kernel, and there is no a priori good choice of value.