bio-info, SVM and Graph-kernels
27 May 2016Introduction
This page is the practical session of the "Support Vector Machines" module taught by Chloé-Agathe Azencott.
Material
The practical session is done using R. For a quick tutorial, follow this reference card and this link.
Practical Session
Necessary support:
- The lecture slides
- The pratical session requires R installed on your machine, with packages kernlab, ggplot2 and ROCR
- The practical session R and the datasets file
Part A: Linear dataset.
A.1/A.2) Visualization of the first dataset:
load('linear1.RData')
3 new variables are in the environment now (type ls()
to check):
- linear1.train : a matrix of size n1x3 (first column is x1, second column is x2 and third column is the output y).
- linear1.test.input : a matrix of size n2 x 2 (same as the train but without the output y).
- linear1.data : the combined dataset (for visualization).
Then visualize the dataset:
require( 'ggplot2' )
qplot( data=linear1.data, x.1, x.2, colour=factor(y),
shape=factor(train) )
Question 1: How can you characterise this dataset ?
Given what we learned during the lecture, we train a linear SVM on the training set which will be our predictor for the testing set.
A.3/A.4/A.5) Training the SVM:
### A.3) Train a linear SVM
require( 'kernlab' )
linear1.svm <- ksvm( y ~ ., data=linear1.train, type='C-svc',
kernel='vanilladot',
C=100, scale=c() )
### A.4) Plot the model
plot( linear1.svm, data=linear1.train )
### A.5) Adding points of test on the graph
points( linear1.test.input[ sample.int(nrow(linear1.test.input),10), ], pch=4 )
Question 2: What are the black points in the figure ?
A.6/A.7 Testing on another dataset:
### A.6) Prediction
linear1.prediction <- predict( linear1.svm, linear1.test.input )
### A.7) Look at accuracy
load('linear1Sol.RData')
# contains linear1.test.output
print(paste0('Accuracy: ', floor(100*sum( linear1.prediction == linear1.test.output )/length(linear1.test.output)), '%'))
Let's consider a dataset a bit more complex:
A.9 to A.14) Do the exact same thing as the first part on the linear2 dataset (non-separable dataset).
Question 3: Is the accuracy a sufficient method to assess the performance of a model?
Let's discuss a few ways to improve the assessment of the performance:
A.15) Separate positive and negative examples :
### A.15) A confusion matrix gives more information than just accuracy
print('Confusion Matrix: ');print(table( linear2.prediction, linear2.test.output, dnn= c("prediction","reality") ))
A.16) ROC Curves (See wikipedia):
linear2.prediction.score <- predict( linear2.svm, linear2.test.input, type='decision' )
require( 'ROCR' )
## ROC
linear2.roc.curve <- performance( prediction( linear2.prediction.score, linear2.test.output ), measure='tpr', x.measure='fpr' )
plot( linear2.roc.curve )
Part B: Non-Linear dataset.
This part deals with the 'nonlinear' dataset.B.1/B.2/B.3) Trying linear SVM on a dataset where it is not appropriate.
Let's try a different kernel
B.4) RBF Kernel:
nonlinear.svm <- ksvm( y ~ ., data=nonlinear.train, type='C-svc',
kernel='rbf', kpar=list(sigma=1),
C=100, scale=c() )
plot( nonlinear.svm, data=nonlinear.train )
Question 4: Recall what is the parameter C
in the svm. In the following we will see what happens when we vary C
.
B.5) Impact of C
require('manipulate')
manipulate( plot( ksvm( y ~ ., data=nonlinear.train, type='C-svc',
kernel=k, C=2^c.exponent, scale=c() ),
data=nonlinear.train ),
c.exponent=slider(-10,10),
k=picker('Gaussian'='rbfdot', 'Linear'='vanilladot',
'Hyperbolic'='tanhdot','Spline'='splinedot',
'Laplacian'='laplacedot') )
B.6) Visualization of the impact of C
on the prediction accuracy (Bias-Variance Tradeoff):
### B.6) Bias-Variance Tradeoff
BiasVarianceTradeoff <- function( dataset, cross=10, c.seq=2^seq(-10, 10), ... ) {
err <- sapply( c.seq, function( c )
{
cross( ksvm( y ~ ., data=dataset, C=c, cross=cross, ...) )
})
return(data.frame( c=c.seq, error=err ))
}
qplot( c, error, data=BiasVarianceTradeoff( nonlinear.train, type='C-svc', kernel='rbfdot' ), geom='line', log='x' )
Question 5: How to choose C
?
Part C: Acute lymphoblastic leukemia dataset.
In this part, we work on a real public dataset of Acute lymphoblastic leukemia (ALL) patients.We are interested in classifying leukemia patients into two classes: B-cell ALL vs T-cell ALL because this classification can have an impact on the patient's prognosis or its response to a given treatment.
source( 'http://bioconductor.org/biocLite.R' )
biocLite( 'ALL' )
require( 'ALL' )
data( 'ALL' )
# Inspect the data
?All
source( All )
print( summary(pData(ALL)) ) # Display the available data name, type and amount
# Look at the different class of each patient
print( ALL$BT )
C.1) Can we train a classifier that separate with a good accuracy B-cell leukemia from T-cell leukemia ?
## Preprocessing of the data:
all.features <- t(exprs(ALL)) ## matrix of size (n=128 x p=12625) containing gene expression profiles of patients with Acute lymphoblastic leukemia.
all.class_binary <- substr( ALL$BT, 1, 1 ) ## output y (2 classes) (classification of leukemia: B-cell ALL vs T-cell ALL).
C.2) When we add supplementary clinical information, we have more than 2 classes (B1, B2,...). How can you derive from what you learned a SVM classifier that can predict more than 2 classes?