Machine learning for bioinformatics practical session

This page contains material relevant to the practical sessions of the Machine learning for computational biology module taught by Jean-Philippe Vert at Mines ParisTech.

Prelude

Download R and install it on your machine.
Download RStudio and install it on your machine.
Download this session’s R file and open it in RStudio.
Lecture slides are available on Prof. Vert’s website.

Using R and RStudio

A fair number of quality entry-level R tutorials are available on the internet, most notably the R Project’s official An Introduction to R. For those interested, Google provides the Google R style guide, a set of coding guidelines which also makes a good syntax primer.

One of the best features of the R system is its built-in, high quality help system. When you see a function you don’t know in the code, for example ksvm, you can type ?ksvm at the R prompt to access the help for this function. Help files are comprehensive, clear, and include examples for the most common use-cases.

A related feature of RStudio is tab-completion: when you write R code in RStudio, pressing the Tab key will auto-complete whatever you are writing, or prompt you with possible choices. This means that writing Som will auto-complete to SomeFunction if this function is defined, and SomeFunction( will prompt you a choice of x=, y=, z=, some.option=.

Linear SVM

The first few lines of the file serve the purpose of generating the dataset we will be working with in this section. The interesting stuff starts here:

34
35
36

# Plot the data
require( 'ggplot2' )
qplot( data=linear.data, x.1, x.2, colour=factor(y), shape=factor(train) )

This will plot the dataset on the bottom-right panel of the RStudio window. Note that you can display all the plots created in a session by using the arrows at the top of the panel.

Question 1: How can you characterise this dataset? Next, we train a linear SVM on the training set and plot the model.

Question 2: How does the plot relate to what you learnt during the lecture? Give a qualitative assessment of the quality of the model. Play with the C parameter. Comment. We now want to quantify the performance of the model. ROC curves and Precision-recall curves are widely used tools to visualise the quality of a statistical model.

# Compute ROC and Precision-Recall curves
require( 'ROCR')
linear.roc.curve <- performance( prediction( linear.prediction.score, linear.test$y ),
                                 measure='tpr', x.measure='fpr' )
plot( linear.roc.curve )

linear.pr.curve <- performance( prediction( linear.prediction.score, linear.test$y ),
                                measure='prec', x.measure='rec' )
plot( linear.pr.curve )

Question 3: Is this good? bad? surprising? As explained in the lecture, evaluation of the performance of the model can be done using cross-validation.

Question 4: Write a function CrossValidation( data, n.folds=10 ) implementing cross-validation for this model. The ksvm function can also compute performance by k-fold cross-validation if you add the parameter cross=k in the function call.

Question 5: Compare the results of your cross-validation and the ksvm built-in cross-validation. As explained in the lecture, it is crucial for the performance of the model to choose a good value for C. There is no heuristic for this parameter, so the usual way is to evaluate the performance of a range of models by varying C on a logarithmic scale.

Question 6: Implement this step for C ranging in 2^seq( -10, 10 ). Plot the models and their performance for different values of C. In RStudio, you can adjust C interactively on a plot of the model:

# Plot the model, interactively adjust C
require( 'manipulate' )
manipulate( plot( ksvm( y ~ ., data=linear.train, type='C-svc', kernel='vanilladot',
                        C=2^c.exponent, scale=c() ), data=linear.train ),
            c.exponent=slider(-10,10) )

Question 7: Plot a curve of the performance as a function of the parameter C. Read about the bias-variance tradeoff. Explain how it relates to what you just did. More interesting things happen on a “harder” dataset.

Question 8: Use the GenerateDatasetNonlinear function to generate a new dataset. Plot it, train a linear SVM, evaluate its performance by cross-validation and plot the performance versus C curve. What is happening?

Non-linear SVM

This is where using a kernel becomes useful. Read the lecture slides again to familiarise with the concept.

# Train a nonlinear SVM
nonlinear.svm <- ksvm( y ~ ., data=nonlinear.train, type='C-svc', kernel='rbf',
                       kpar=list(sigma=1), C=100, scale=c() )
plot( nonlinear.svm, data=nonlinear.train )

The Gaussian kernel has an additional sigma parameter. Both sigma (the kernel parameter) and C (the SVM parameter) have an influence on the model. You can try several values of sigma and C to see the influence of each parameter. Fortunately, ksvm has a heuristic to find a good value for sigma; to use it, just omit the lpar=list(sigma=1) part of the function call.

Question 9: Plot the performance versus C curve for the nonlinear SVM with Gaussian kernel. Try other kernels (use ?kernels at the R prompt to see a list of available kernels). Here again, using RStudio you can plot the model and select both C and the kernel interactively.

# Plot the model, interactively adjust C and the kernel
manipulate( plot( ksvm( y ~ ., data=nonlinear.train, type='C-svc', kernel=k,
                        C=2^c.exponent, scale=c() ), data=nonlinear.train ),
            c.exponent=slider(-10,10),
            k=picker('Gaussian'='rbfdot', 'Linear'='vanilladot', 'Hyperbolic'='tanhdot',
                     'Spline'='splinedot', 'Laplacian'='laplacedot') )

You should obtain images similar to these for a Gaussian kernel when increasing C ; level sets represent the decision function, filled points are the support vectors:

Application: gene expression data analysis

You learnt how to use linear and nonlinear SVMs on example “toy” datasets. The goal of this section is to use what you learnt on a real-world dataset; namely a set of gene expression data for 128 patients suffering of acute lymphoblastic leukemia (ALL).

Question 10: Use what you learnt in the previous sections to build a SVM classifier to distinguish between B-cell and T-cell leukemia.

Bonus: If you take the disease stage into account, there are more than 2 classes. How would you design a system for multi-class classification from binary classifiers? ksvm implements multi-class classification. Test it on the dataset and comment the results. It is recommended to use several kernels and check the influence of the C parameter. Please remember that the choice C parameter depends on the kernel, and there is no a priori good choice of value.