Machine learning in computational biology

Jean-Philippe Vert

ENSAE
Spring 2014

Modern technologies like DNA microarrays or high-throughput sequencing are revolutionising biology and medical research. By allowing the collection of large amounts of measures at the molecular level on living organisms, they pave the way to a quantitative and rationale analysis of biological systems. Unsurprisingly, statistics and machine learning play an important role in this revolution. By processing large collections of datasets, they allow to extract new biological knowledge and infer predictive models.

The goal of this course is to present a few modern statistical learning techniques, and to touch upon a selected panel of applications in computational and systems biology. We will study in particular support vector machines (SVM) and kernels, as well as feature selection techniques including lasso regression. Applications include protein annotation, virtual screening in drug design, prognostic and predictive models for personalised medicine in oncology, and gene network inference in systems biology.

Slides

Schedule

All lectures take place from 1pm to 6:15pm.
DateTopic
Tuesday, Feb 11Introduction, learning in high dimension
Tuesday, Feb 25SVM
Tuesday, Mar 4Kernels for strings and graphs; lasso
Tuesday, Mar 11Group lasso, fused lasso, structured sparsity

Project: breast cancer prognosis from gene expression data using a gene network as prior knowledge

We have collected gene expression levels for 12065 genes on 184 early-stage breast cancer samples: xtrain.txt (each row is a gene, each column a sample; the first element of each row is the gene ID). After surgical removal of the tumour, some unfortunately relapsed within 5 years (label=+1), while other did not (label=-1). The labels of the the 184 samples are available in the file ytrain.txt. Our goal is to design a predictive model in order to predict whether a tumour is likely to relapse or not, from its gene expression values. If we can predict the risk of relapse accurately, then we may be able to adapt the treatment given to the patient, e.g., limiting chemotherapy if the risk is low, or increasing it if it is high.

In addition, we know that genes interact with each other in the cell. We have collected in the file ppi.txt a set of 430815 known interactions between genes. Each row gives the IDs of two genes know to interact.

  1. Propose and test different techniques to predict the relapse from gene expression data. Check the effect of parameters, estimate the performance.
  2. Propose and test different techniques to predict the relapse from gene expression data, using the known interactions between genes as prior knowledge. Check the effect of parameters, estimate the performance. Do you think that the prior knowledge helps?
  3. Using your best model, make a prediction of relapse for the following 92 samples: xtest.txt. You should send me a file with 92 lines, each line containing a score which should be large if you think the corresponding sample has label +1 (relapse), small otherwise.

Please send a short report and your predictions to jean-philippe.vert@mines.org before April 15, 2014. The predictions will be scored in terms of area under the ROC curve (AUC).

You may do this project alone or in groups of 2 or 3 students (encouraged).




Back to my homepage