Machine learning in computational biology

Jean-Philippe Vert

ENSAE
Spring 2012

Modern technologies like DNA microarrays or high-throughput sequencing are revolutionising biology and medical research. By allowing the collection of large amounts of measures at the molecular level on living organisms, they pave the way to a quantitative and rationale analysis of biological systems. Unsurprisingly, statistics and machine learning play an important role in this revolution. By processing large collections of datasets, they allow to extract new biological knowledge and infer predictive models.

The goal of this course is to present a few modern statistical learning techniques, and to touch upon a selected panel of applications in computational and systems biology. We will study in particular support vector machines (SVM) and kernels, as well as feature selection techniques including lasso regression. Applications include protein annotation, virtual screening in drug design, prognostic and predictive models for personalised medicine in oncology, and gene network inference in systems biology.

Slides

Schedule

Date	Morning (11h-13h)	Afternoon (13h-15h)
Friday March 2, 2012	Introduction, SVM	Kernels
Friday March 9, 2012	String kernels, protein classification	String kernels
Friday March 16, 2012	Graph kernels, virtual screening	Feature selection
Friday April 6, 2012	Diagnosis, prognosis	Network inference

Project

We have collected gene expression levels for 4654 genes on 184 early-stage breast cancer samples: xtrain.txt (each row is a gene, each column a sample). After surgical removal of the tumour, some unfortunately relapsed within 5 years (label=+1), while other did not (label=-1). The labels of the the 184 samples are available in the file ytrain.txt.

Propose and test different techniques to predict the relapse from gene expression data. Check the effect of parameters, estimate the performance.
Make a prediction of relapse for the following 92 samples: xtest.txt. You should send me a file with 92 lines, each line containing a score which should be large if you think the corresponding sample has label +1 (relapse), small otherwise.

Please send report and prediction to jean-philippe.vert@mines.org before April 6, 10am.

Practical sessions

Prerequisites

The practical sessions require the following free softwares:

R, with packages kernlab, lars, BioDist.
Cytoscape, to visualize networks

P0: R basics

Goal: learn basic R command. R is the language we will use for all other practical sessions.
R reference card
R tutorial

P1: SVM and kernel methods basics

Goal: learn and manipulate SVM and kernel PCA, understand how they work on simple data, play with kernels and parameters.
Application: cancer diagnosis from gene expression data.
Notes
Useful R functions
Code

P2: Using your own kernels

Goal: Use precomputed kernels, and define your own kernels.
Notes
Code

P3: Classification of sequences with string kernels

Goal: understand and test a few string kernels, for classification of protein and DNA sequences
Notes
Code

P4: Reconstruction of regulatory networks from expression

Goal: understand and implement several methods to reconstruct the gene regulatory network from a matrix of expression data.
data.tar.gz, the data set of expression data and regulation of E. coli used by Faith et al.
Notes
Code
Code for stability selection

P5: Reconstruction of PPI and metabolic networks

Goal: understand and test several methods for the prediction of protein-protein interactions and edges in the metabolic network.
ppimetabo.tar.gz: yeast datasets (PPI and metabolic networks)
Notes
Code

Back to my homepage