bio-info, SVM and Graph-kernels

Introduction

This page is the practical session of the "Support Vector Machines" module taught by Chloé-Agathe Azencott.

Material

The practical session is done using R. For a quick tutorial, follow this reference card and this link.

Practical Session

Necessary support:

Part A: Linear dataset.

A.1/A.2) Visualization of the first dataset:

load('linear1.RData')

3 new variables are in the environment now (type ls() to check):

Then visualize the dataset:

require( 'ggplot2' )
qplot( data=linear1.data, x.1, x.2, colour=factor(y), 
                    shape=factor(train) )

Question 1: How can you characterise this dataset ?

Given what we learned during the lecture, we train a linear SVM on the training set which will be our predictor for the testing set.

A.3/A.4/A.5) Training the SVM:

### A.3) Train a linear SVM
require( 'kernlab' )
linear1.svm <- ksvm( y ~ ., data=linear1.train, type='C-svc', 
                        kernel='vanilladot',
                        C=100, scale=c() )

### A.4) Plot the model
plot( linear1.svm, data=linear1.train )

### A.5) Adding points of test on the graph
points( linear1.test.input[ sample.int(nrow(linear1.test.input),10), ], pch=4 )

Question 2: What are the black points in the figure ?

A.6/A.7 Testing on another dataset:

### A.6) Prediction
linear1.prediction <- predict( linear1.svm, linear1.test.input )

### A.7) Look at accuracy
load('linear1Sol.RData')
# contains linear1.test.output

print(paste0('Accuracy: ', floor(100*sum( linear1.prediction == linear1.test.output )/length(linear1.test.output)), '%'))

Let's consider a dataset a bit more complex:

A.9 to A.14) Do the exact same thing as the first part on the linear2 dataset (non-separable dataset).

Question 3: Is the accuracy a sufficient method to assess the performance of a model?

Let's discuss a few ways to improve the assessment of the performance:

A.15) Separate positive and negative examples :

### A.15) A confusion matrix gives more information than just accuracy
print('Confusion Matrix: ');print(table( linear2.prediction, linear2.test.output, dnn= c("prediction","reality") ))

A.16) ROC Curves (See wikipedia):

linear2.prediction.score <- predict( linear2.svm, linear2.test.input, type='decision' )

require( 'ROCR' )

## ROC
linear2.roc.curve <- performance( prediction( linear2.prediction.score, linear2.test.output ), measure='tpr', x.measure='fpr' )
plot( linear2.roc.curve )

Part B: Non-Linear dataset.

This part deals with the 'nonlinear' dataset.

B.1/B.2/B.3) Trying linear SVM on a dataset where it is not appropriate.

Let's try a different kernel

B.4) RBF Kernel:

nonlinear.svm <- ksvm( y ~ ., data=nonlinear.train, type='C-svc', 
                  kernel='rbf', kpar=list(sigma=1), 
                  C=100, scale=c() )
plot( nonlinear.svm, data=nonlinear.train )

Question 4: Recall what is the parameter C in the svm. In the following we will see what happens when we vary C.

B.5) Impact of C

require('manipulate')
manipulate( plot( ksvm( y ~ ., data=nonlinear.train, type='C-svc', 
            kernel=k, C=2^c.exponent, scale=c() ),
            data=nonlinear.train ), 
           c.exponent=slider(-10,10),
                   k=picker('Gaussian'='rbfdot', 'Linear'='vanilladot', 
                    'Hyperbolic'='tanhdot','Spline'='splinedot', 
                'Laplacian'='laplacedot') )

B.6) Visualization of the impact of C on the prediction accuracy (Bias-Variance Tradeoff):

### B.6) Bias-Variance Tradeoff
BiasVarianceTradeoff <- function( dataset, cross=10, c.seq=2^seq(-10, 10), ... ) {
  err <- sapply( c.seq, function( c ) 
                {
                        cross( ksvm( y ~ ., data=dataset, C=c, cross=cross, ...) )
                })
  return(data.frame( c=c.seq, error=err ))
}

qplot( c, error, data=BiasVarianceTradeoff( nonlinear.train, type='C-svc', kernel='rbfdot' ), geom='line', log='x' )

Question 5: How to choose C ?

Part C: Acute lymphoblastic leukemia dataset.

In this part, we work on a real public dataset of Acute lymphoblastic leukemia (ALL) patients.We are interested in classifying leukemia patients into two classes: B-cell ALL vs T-cell ALL because this classification can have an impact on the patient's prognosis or its response to a given treatment.

source( 'http://bioconductor.org/biocLite.R' )
biocLite( 'ALL' )
require( 'ALL' )
data( 'ALL' )

# Inspect the data
?All
source( All )
print( summary(pData(ALL)) ) # Display the available data name, type and amount

# Look at the different class of each patient
print( ALL$BT )

C.1) Can we train a classifier that separate with a good accuracy B-cell leukemia from T-cell leukemia ?

## Preprocessing of the data:
all.features <- t(exprs(ALL))  ## matrix of size (n=128 x p=12625) containing gene expression profiles of patients with Acute lymphoblastic leukemia.
all.class_binary <- substr( ALL$BT, 1, 1 ) ## output y (2 classes) (classification of leukemia: B-cell ALL vs T-cell ALL).
In section A and B, the input and output were combined into the same variable (the last column of the data.frame was the output). Here we separated the design matrix a.k.a the input X as a variable "all.features" and the output "all.class_binary". You can adapt the code for ksvm in A and B): instead of doing ksvm(y~., data= dataset, ...) you have to do ksvm(x=all.features,y=all.class_binary, ...).

C.2) When we add supplementary clinical information, we have more than 2 classes (B1, B2,...). How can you derive from what you learned a SVM classifier that can predict more than 2 classes?