Machine learning in computational biology

Jean-Philippe Vert

ENSAE
Spring 2017

Modern technologies like DNA microarrays or high-throughput sequencing are revolutionising biology and medical research. By allowing the collection of large amounts of measures at the molecular level on living organisms, they pave the way to a quantitative and rationale analysis of biological systems. Unsurprisingly, statistics and machine learning play an important role in this revolution. By processing large collections of datasets, they allow to extract new biological knowledge and infer predictive models.

The goal of this course is to present a few modern statistical learning techniques, and to touch upon a selected panel of applications in computational and systems biology. We will study in particular support vector machines (SVM) and kernels, as well as feature selection techniques including lasso regression. Applications include protein annotation, virtual screening in drug design, prognostic and predictive models for personalised medicine in oncology, and gene network inference in systems biology.

Results

Slides

Schedule

When	What	Slides
Friday, Feb 3, 4:30pm-6:30pm	Introduction, learning in high dimension, ridge regression	1-48
Friday, Feb 10, 4:30pm-7:30pm	Logistic regression, linear SVM	49-98
Friday, Feb 24, 4:30pm-7:30pm	Large-margin classifiers, kernels	99-141
Friday, Mar 3, 4:30pm-7:30pm	Cancelled
Friday, Mar 10, 4:30pm-7:30pm	Kernels
Friday, Mar 17, 4:30pm-7:30pm

Project

We have collected gene expression levels for 4654 genes on 184 early-stage breast cancer samples: xtrain.txt (each row is a gene, each column a sample). After surgical removal of the tumour, some unfortunately relapsed within 5 years (label=+1), while other did not (label=-1). The labels of the the 184 samples are available in the file ytrain.txt.

Propose and test different techniques to predict the relapse from gene expression data. Check the effect of parameters, estimate the performance.
Make a prediction of relapse for the following 92 samples: xtest.txt. You should send me a file with 92 lines, each line containing a score which should be large if you think the corresponding sample has label +1 (relapse), small otherwise.

Please send report and prediction to jean-philippe.vert@mines-paristech.fr before May 10, 2017. The predictions will be scored in terms of area under the ROC curve (AUC).

Although discussions are allowed and encouraged, each student must work on the challenge individually and submit a report and a prediction.

Back to my homepage