See the GitHub repository for links to source code and exercises: https://github.com/tdhock/change-tutorial

Before executing the code in this tutorial Rmd, make sure to install the required packages:

packages.R <- if(file.exists("packages.R")){
  "packages.R"
}else{
  "https://raw.githubusercontent.com/tdhock/change-tutorial/master/packages.R"
}
source(packages.R)
## Loading required package: neuroblastoma
## Loading required package: future
## Loading required package: Segmentor3IsBack
## Loading required package: methods
## Segmentor3IsBack v2.0 Loaded
## Loading required package: changepoint
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## Successfully loaded changepoint package version 2.2.2
##  NOTE: Predefined penalty values changed in version 2.2.  Previous penalty values with a postfix 1 i.e. SIC1 are now without i.e. SIC and previous penalties without a postfix i.e. SIC are now with a postfix 0 i.e. SIC0. See NEWS and help files for further details.
## Warning in pkg_ok_have(pkg, vers, packageVersion(pkg)): works with changepoint version 2.2, have
## 2.2.2
## Loading required package: directlabels
## Loading required package: data.table
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:future':
## 
##     cluster
## Loading required package: ggplot2
## Loading required package: requireGitHub
## Loading required package: penaltyLearning
options(width=100)

What is the difference between unsupervised and supervised changepoint detection?

There are two features of a data set that make it possible to do a supervised changepoint analysis:

The goal of the supervised analysis is to find a model that learns from the limited labeled data, and provides accurate changepoint predictions for all of the data (a different number of changepoints for each separate data sequence).

Neuroblastoma data set

We will use the neuroblastoma data set to explore supervised changepoint detection.

data(neuroblastoma, package="neuroblastoma")
str(neuroblastoma)
## List of 2
##  $ profiles   :'data.frame': 4616846 obs. of  4 variables:
##   ..$ profile.id: Factor w/ 575 levels "1","2","4","5",..: 7 7 7 7 7 7 7 7 7 7 ...
##   ..$ chromosome: Factor w/ 24 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
##   ..$ position  : int [1:4616846] 809681 928433 987423 1083595 1125548 1199359 1392490 1672814 1985786 2046695 ...
##   ..$ logratio  : num [1:4616846] -0.01742 0.06626 0.06764 0.04264 0.00576 ...
##  $ annotations:'data.frame': 3418 obs. of  5 variables:
##   ..$ profile.id: Factor w/ 575 levels "1","2","4","5",..: 207 322 379 144 388 497 564 331 189 315 ...
##   ..$ chromosome: Factor w/ 24 levels "1","2","3","4",..: 11 11 11 11 11 11 11 11 11 11 ...
##   ..$ min       : int [1:3418] 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 53700000 ...
##   ..$ max       : int [1:3418] 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 135006516 ...
##   ..$ annotation: Factor w/ 2 levels "breakpoint","normal": 2 2 2 2 2 2 1 2 2 2 ...
lapply(neuroblastoma, head, 10)
## $profiles
##    profile.id chromosome position     logratio
## 1           8          1   809681 -0.017417053
## 2           8          1   928433  0.066261442
## 3           8          1   987423  0.067638717
## 4           8          1  1083595  0.042644337
## 5           8          1  1125548  0.005759269
## 6           8          1  1199359  0.000000000
## 7           8          1  1392490  0.069014678
## 8           8          1  1672814  0.204140717
## 9           8          1  1985786  0.039840265
## 10          8          1  2046695  0.073134705
## 
## $annotations
##    profile.id chromosome      min       max annotation
## 1         213         11 53700000 135006516     normal
## 2         333         11 53700000 135006516     normal
## 3         393         11 53700000 135006516     normal
## 4         148         11 53700000 135006516     normal
## 5         402         11 53700000 135006516     normal
## 6         521         11 53700000 135006516     normal
## 7         590         11 53700000 135006516 breakpoint
## 8         342         11 53700000 135006516     normal
## 9         195         11 53700000 135006516     normal
## 10        326         11 53700000 135006516     normal

The neuroblastoma data set consists of two data.frames:

  • profiles contains the noisy data (logratio), which is approximate DNA copy number (measured at chromosome, position), for a particular patient (profile.id). Every unique combination of (profile.id, chromosome) defines a separate multiple changepoint detection problem. The logratio measures how many copies of each part of the human genome are present in the patient’s tumor. Normally there are two copies of each chromosome (logratio=0), but in cancer samples there can be gains (changes up) and losses (changes down) of specific regions or entire chromosomes.
  • annotations contains the labels, which indicate presence (breakpoint) or absence (normal) of changepoints in specific regions (profile.id, chromosome, min, max) of the data.

These data come from children who were treated at the Institut Curie (Paris, France). For six children we also have follow-up data on whether they recovered (ok) or had a relapse, several years after treatment:

followup <- data.frame(
  profile.id=paste(c(10, 8, 4, 6, 11, 1)),
  status=c("ok", "relapse", "relapse", "ok", "ok", "relapse"))
followup
##   profile.id  status
## 1         10      ok
## 2          8 relapse
## 3          4 relapse
## 4          6      ok
## 5         11      ok
## 6          1 relapse

Unsupervised: no labels, model selection via theory

When there are no labels, the input for a changepoint analysis just uses the noisy DNA copy number data in neuroblastoma$profiles$logratio. We begin by selecting the profiles of the six patients for which we have follow-up data.

rownames(followup) <- followup$profile.id
followup$status.profile <- with(followup, paste(status, profile.id))
some.ids <- rownames(followup)
library(data.table)
someProfiles <- function(all.profiles){
  some <- subset(all.profiles, paste(profile.id) %in% some.ids)
  status.profile <- followup[paste(some$profile.id), "status.profile"]
  some$status.profile <- ifelse(
    is.na(status.profile), paste(some$profile.id), status.profile)
  data.table(some)
}
six.profiles <- someProfiles(neuroblastoma$profiles)

For these six patients, the total number of separate changepoint detection problems is 24 chromosomes x 6 patients = 144 problems. We plot each problem below in a separate facet (panel).

library(ggplot2)
gg.unsupervised <- ggplot()+
  ggtitle("unsupervised changepoint detection = only noisy data sequences")+
  theme(
    panel.margin=grid::unit(0, "lines"),
    panel.border=element_rect(fill=NA, color="grey50")
  )+
  facet_grid(status.profile ~ chromosome, scales="free", space="free_x")+
  geom_point(aes(position/1e6, logratio),
             data=six.profiles,
             shape=1)+
  scale_x_continuous(
    "position on chromosome (mega bases)",
    breaks=c(100, 200))+
  scale_y_continuous(
    "logratio (approximate DNA copy number)",
    limits=c(-1,1)*1.1)
## Warning in ggproto(NULL, ScalesList): bytecode version mismatch; using eval
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property instead
print(gg.unsupervised)
## Warning: Removed 65 rows containing missing values (geom_point).

Note the facet titles on the right:

  • The top three profiles are “ok” because those children completely recovered.
  • The bottom three profiles are “relapse” because the neuroblastoma cancer came back, and required another treatment at the hospital.

Question: can you visually identify any differences between the “ok” and “relapse” profiles?

How to choose the number of changepoints? In the unsupervised setting, our only option is to use theoretical/statistical arguments, which give information criteria such as Aikaike Information Criterion (AIC, penalty=2) or Bayesian/Schwarz Information Criterion (BIC/SIC, penalty=log n). More on this later.

Supervised: learn a penalty function with minimal incorrect labels

In supervised changepoint detection, there are labels which indicate presence and absence of changepoints in particular data subsets. For the same set of 6 profiles, we superimpose the labels in the plot below.

six.labels <- someProfiles(neuroblastoma$annotations)
change.colors <- c(
  "1change"="#ff7d7d",
  breakpoint="#a445ee",
  normal="#f6c48f"
  )
gg.supervised <- gg.unsupervised+
  ggtitle(paste(
    "supervised changepoint detection = data + labels",
    "that indicate specific regions with or without changepoints"))+
  penaltyLearning::geom_tallrect(aes(
    xmin=min/1e6, xmax=max/1e6, fill=annotation),
    alpha=0.5,
    color=NA,
    data=six.labels)+
  scale_fill_manual("label", values=change.colors)+
  theme(legend.position="bottom")
print(gg.supervised)
## Warning: Removed 65 rows containing missing values (geom_point).