Skip to content

Latest commit

 

History

History
200 lines (174 loc) · 8.19 KB

README.org

File metadata and controls

200 lines (174 loc) · 8.19 KB
testshttps://github.com/tdhock/PeakSegJoint/workflows/R-CMD-check/badge.svg
coveragehttps://codecov.io/gh/tdhock/PeakSegJoint/branch/master/graph/badge.svg

PeakSegJoint: supervised joint peak detection (arXiv:1506.01286)

There are two major differences between PeakSegJoint and all of the other peak detection algorithms for ChIP-seq data analysis:

  • Supervised machine learning: PeakSegJoint is trained by providing labels that indicate regions with and without peaks. So if you see false positives (a peak called where there is clearly only background noise) or false negatives (no peak called where there should be one) you can add labels to correct PeakSegJoint, and it learns and gets more accurate as more labels are added. In contrast, other peak detection methods are unsupervised, meaning that they usually have 10-20 parameters, and no obvious way to train them, yielding arbitrarily inaccurate peaks that can NOT be corrected using labels.
  • Joint peak detection in any number of samples or cell types so the model can be easily interpreted to find similarities or differences between samples (PeakSegJoint outputs a binary matrix, samples x peaks). In contrast, it is not easy to find similarities and differences between samples using single-sample peak detection methods (e.g. MACS), and other multi-sample peak detection methods are limited to one (e.g. JAMM) or two (e.g. PePr) cell types (assuming all samples of the same cell type are replicates with the same peak pattern).

To use PeakSegJoint in practice on large genomic data sets

The PeakSegPipeline repository contains a pipeline which includes PeakSegJoint and our more recent PeakSegFPOP algorithm. We recommend to use this other package (which uses PeakSegJoint as a sub-routine) for predicting peaks in the whole genome of several samples.

R installation instructions and usage examples

Install and load/attach package:

install.packages("PeakSegJoint")
library(PeakSegJoint)

To run PeakSegJoint you need to get your data set in bedGraph-like data.frame format (chromStart, chromEnd, count) where “count” is the number of aligned DNA sequence reads at positions (chromStart,chromEnd]. There should also be columns “sample.id” (ID for each sample/track) and “sample.group” (ID for each group of samples with same expected peak up/down status, e.g. replicates or cell types).

library(PeakSegJoint)
data(H3K36me3.TDH.other.chunk1, envir=environment())
some.counts <- subset(
  H3K36me3.TDH.other.chunk1$counts,
  43000000 < chromEnd &
  chromStart < 43200000)
some.counts$sample.group <- some.counts$cell.type
head(some.counts)
> head(some.counts)
  cell.type  sample.id chromStart chromEnd count sample.group
1    kidney McGill0023   43119140 43119199     3       kidney
2    kidney McGill0023   43119199 43119211     2       kidney
3    kidney McGill0023   43119211 43119216     3       kidney
4    kidney McGill0023   43119216 43119222     4       kidney
5    kidney McGill0023   43119222 43119240     3       kidney
6    kidney McGill0023   43119240 43119284     2       kidney
> 

Then you can run algo using the PeakSegJointFaster function giving your data.frame as the first argument.

fit <- PeakSegJointFaster(some.counts)

The resulting model fit object is a named list, with the following components. peak_start_end is a vector with start/end positions of the most likely common peak. data_start_end is a vector with start/end positions of the data, which was used to compute the likelihood/loss.

> with(fit, rbind(peak_start_end, data_start_end))
                   [,1]     [,2]
peak_start_end 43157280 43180756
data_start_end 43119015 43201099
> 

mean_mat is a matrix with one row for each sample and three columns (mean before peak, mean in peak, mean after peak).

> fit$mean_mat
                                [,1]      [,2]     [,3]
kidney/McGill0023           3.598615  8.728489 2.933491
kidneyCancer/McGill0022     2.895257 10.498850 2.139557
skeletalMuscleMD/McGill0013 1.220855  1.691600 1.366809
skeletalMuscleMD/McGill0016 1.357977  1.967243 1.752986
> 

To select the number of samples with a common peak, you can use a criterion based on the decrease in Poisson loss. The loss decrease values for the group-sparse model are

> fit$group.loss.diff.vec
    kidneyCancer           kidney skeletalMuscleMD 
      -94886.231       -45108.124        -2913.815 
> 

The above can be interpreted as follows. The sample.group with the most likely common peak is the first entry (kidneyCancer) – the difference in Poisson loss between the model with and without a peak is -94886.231. The sample group with the next most likely common peak is kidney, then skeletalMuscleMD. The same kind of output is shown for each individual sample below:

> fit$sample.loss.diff.vec
    kidneyCancer/McGill0022           kidney/McGill0023 
                 -94886.231                  -45108.124 
skeletalMuscleMD/McGill0016 skeletalMuscleMD/McGill0013 
                  -1781.187                   -1132.627 
> 

Again the most likely sample with the common peak is the first entry (kidneyCancer/McGill0022), the second most likely sample is the second entry, etc. Based on the loss difference values above it seems pretty clear that the peaks in kidney/kidneyCancer samples are quite likely, whereas the skeletalMuscleMD peaks are not. We plot that model below:

peak.names <- names(fit$sample.loss.diff.vec[1:2])
peak.df <- data.frame(
  sample.id=sub(".*/", "", peak.names),
  sample.group=sub("/.*", "", peak.names),
  peakStart=fit$peak_start_end[1],
  peakEnd=fit$peak_start_end[2])
ggplot()+
  theme_bw()+
  theme(panel.margin=grid::unit(0, "lines"))+
  facet_grid(sample.id + sample.group ~ ., scales="free")+
  geom_rect(aes(
    xmin=chromStart, xmax=chromEnd,
    ymin=0, ymax=count),
    data=some.counts)+
  geom_segment(aes(
    peakStart, 0,
    xend=peakEnd, yend=0),
    data=peak.df,
    color="deepskyblue",
    size=2)

Finally the sample.modelSelection and group.modelSelection components may be useful for supervised learning of a penalty function. They are computed via penaltyLearning::modelSelection and describe the solution of P*(lambda)=min_p L_p + lambda*p, where L_p is the Poisson loss of the model with p samples/groups with the common peak, and lambda is a non-negative penalty parameter (larger for fewer samples/groups with a common peak). For example

> fit$group.modelSelection
        min.lambda max.lambda min.log.lambda max.log.lambda      loss
3groups      0.000   2913.815           -Inf       7.977218 -469853.8
2groups   2913.815  45108.124       7.977218      10.716818 -466940.0
1groups  45108.124  94886.231      10.716818      11.460434 -421831.9
0groups  94886.231        Inf      11.460434            Inf -326945.6
        complexity
3groups          3
2groups          2
1groups          1
0groups          0
> 

The first row above means that for lambda in (0, 2913.815) or equivalently log(lambda) in (-Inf, 7.977218) we have p=3 groups with a common peak. We can use this for supervised learning of a penalty function if there are labels which indicate which model should be selected. For example if the labels indicate the model with p=2 groups has minimum label error, then we want to predict log(lambda) in (7.977218, 10.716818) for this data set – this is the “target interval” of log(penalty) values. PeakSegPipeline::problem.joint.target can be used to compute this target interval on large labeled genomic data sets. We therefore recommend using PeakSegPipeline when you want to do joint peak detection for several samples with labels and aligned read coverage over the entire genome.