Running block=TRUE crashes during permutations with "Error in .start_as_unnamed_integer(start)" #64

mpiersonsmela · 2024-01-19T02:40:12Z

I am consistently getting the following error while running permutations. It doesn't always happen on the same permutation.
Example output:

Beginning permutation 1
...Chromosome chr1: Smoothed (1.26 min). 38 CpG(s) excluded due to zero coverage. 72 regions scored (0.07 min).
...Chromosome chr2: Smoothed (1.87 min). 10 regions scored (0.04 min).
...Chromosome chr3: Smoothed (1.34 min). 6 regions scored (0.03 min).
...Chromosome chr4: Smoothed (0.68 min). 11 regions scored (0.04 min).
...Chromosome chr5: Smoothed (1.65 min). 2 CpG(s) excluded due to zero coverage. Error in .start_as_unnamed_integer(start) :
each range must have a start that is < 2^31 and > - 2^31
Calls: dmrseq ... .new_IRanges_from_start_end -> .start_as_unnamed_integer
In addition: Warning messages:
1: In FUN(X[[i]], ...) : no non-missing arguments to min; returning Inf
2: In FUN(X[[i]], ...) : no non-missing arguments to max; returning -Inf
Execution halted

I am using the latest version of all packages.

mpiersonsmela · 2024-01-19T02:42:29Z

The input command was:
naive_new_reg <- dmrseq(bs = sort(oog_naive_new), testCovariate=testCovariate, block=TRUE, minInSpan=150, bpSpan=5000, maxGapSmooth=10000)

I previously ran it with default values for minInSpan, bpSpan, and maxGapSmooth but it gave a warning to increase them. (Also, what values are recommended for running with block=TRUE?)

kdkorthauer · 2024-01-29T19:24:30Z

Hi @mpiersonsmela,

Thanks for reporting this issue. I'm not able to determine the root cause from the output alone. Can you let me know what (if any) filtering you've done on the input object (e.g. removing loci with low coverage across samples)? If you're able to share a small portion of your input data that reproduces the output (e.g. chromosome 5 in the example above - you can use set.seed() to provide a reproducible example), that would greatly help. You may upload the small example input here.

Regarding your question about parameter settings when using block=TRUE, you may refer to the documentation section here which has some guidance on how to interpret the parameters. The bpSpan parameter governs the size of the window over which methylation values are smoothed, as long as they have at least minInSpan covered loci that are spaced apart my a max of maxGapSmooth. Your settings seem like a reasonable starting point, but in general I'd recommend examining the output candidate regions after scaling up with blocks - if you still see many candidate regions that are relatively close together, where you'd otherwise interpret them as one contiguous region, then I'd recommend to scale up all of those parameters further to make sure the smoothing is at a more relevant scale. Hope that helps!

mpiersonsmela · 2024-01-30T02:50:57Z

OK, I'll upload the example data tomorrow. Regarding the filtering, I'm removing all the loci that have zero reads for all samples in a group. This filtering strategy worked OK when not running block=TRUE.

This is the filtering code I'm using:

#Read in packages
suppressPackageStartipMessages(library(tidyverse))
suppressPackageStartipMessages(library(magrittr))
suppressPackageStartipMessages(library(bsseq))
suppressPackageStartipMessages(library(dmrseq))
suppressPackageStartipMessages(library(BiocParallel))
suppressPackageStartipMessages(library(argparse))
register(MulticoreParam(16)) 

# Make sure metadata has the following structure:
#    Files   | Group
# -----------------------
# <file_name>| <int 1 or 2>

args = commandArgs(trailingOnly=TRUE)

metadata_path <- args[1]

# Read in data
metadata <- read.csv(metadata_path)

# get file names
cov_files <- metadata %>%
  pull(Files)

# make BSseq object
bismark_cov <- read.bismark(files = cov_files, rmZeroCov = TRUE)

# trt = vector of condition labels for each sample
trt <- metadata %>%
  pull(Group)
pData(bismark_cov)$Condition <- trt

# filter for loci that are not 0 in all samples of a condition
## get group 1 rows that are not 0 for the group
sample_1 <- which(pData(bismark_cov)$Condition == 1)
cov_temp_1 <- bismark_cov[, sample_1]
loci_1_keep <- which(DelayedMatrixStats::rowSums2(getCoverage(cov_temp_1, type="Cov")) > 0)

## get group 2 rows that are not 0 for the group
sample_1 <- which(pData(bismark_cov)$Condition == 2)
cov_temp_2 <- bismark_cov[, sample_2]
loci_2_keep <- which(DelayedMatrixStats::rowSums2(getCoverage(cov_temp_2, type="Cov")) > 0)

# only keep loci that have at least 1 read in both sample groups 
loci.idx <- sort(unique(intersect(loci_1_keep, loci_2_keep)))
bismark_cov_filt <- bismark_cov[loci.idx,]

testCovariate <- "Condition"
bismark_cov_dmr <- dmrseq(bs = sort(bismark_cov_filt),
                          testCovariate = testCovariate, minInSpan=150, bpSpan=5000, maxGapSmooth=10000, block=TRUE)

mpiersonsmela · 2024-01-30T15:25:58Z

OK, I uploaded the files to your Dropbox link. I wasn't able to figure out how to take a small portion that replicates the results, so I just uploaded the whole thing.

kdkorthauer · 2024-02-17T18:07:26Z

Thanks again for reporting this issue and for sending me your files to debug. It took me a while to track it down, but I found the implicated line that didn't properly filter loci with zero coverage in all samples of one permuted condition in the permutations during the construction of candidate blocks. I suspect I never encountered this before because I typically use more stringent filters for coverage. I pushed a fix, and was able to run your code successfully. You can access this version on the devel branch of GitHub immediately (BiocManager::install_github("kdkorthauer/dmrseq", "devel"), and it will also be available in Bioconductor Devel (version 3.19) in either today or Monday's build. Assuming it passes all checks, I'll port it over to the current Bioc release version (version 3.18) early next week.

As a side note, I noticed that the number of loci in your dataset and the high number of consecutive genomic coordinates look like you may be including loci on both strands separately. You may consider collapsing strands. Happy to provide some more information on that if you're interested.

Thanks again for helping me to improve the software.

mpiersonsmela · 2024-02-17T20:56:11Z

Thanks! I'll follow up with you by email about collapsing strands.

mpiersonsmela · 2024-02-17T21:00:13Z

Regarding collapsing strands, does this basically mean treating each CpG site as one genomic coordinate? I am not sure this would be appropriate for me, since I expect a substantial amount of hemimethylated sites in my data. But I am pretty new to methylation analysis so I would appreciate it if you could explain more. Thanks for all your help.

…

On Sat, Feb 17, 2024 at 1:07 PM Keegan Korthauer ***@***.***> wrote: Thanks again for reporting this issue and for sending me your files to debug. It took me a while to track it down, but I found the implicated line that didn't properly filter loci with zero coverage in all samples of one permuted condition in the permutations during the construction of candidate blocks. I suspect I never encountered this before because I typically use more stringent filters for coverage. I pushed a fix, and was able to run your code successfully. You can access this version on the devel branch of GitHub immediately (BiocManager::install_github("kdkorthauer/dmrseq", "devel"), and it will also be available in Bioconductor Devel (version 3.19) in either today or Monday's build. Assuming it passes all checks, I'll port it over to the current Bioc release version (version 3.18) early next week. As a side note, I noticed that the number of loci in your dataset and the high number of consecutive genomic coordinates look like you may be including loci on both strands separately. You may consider collapsing strands. Happy to provide some more information on that if you're interested. Thanks again for helping me to improve the software. — Reply to this email directly, view it on GitHub <#64 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AO3OAGJ5WN4MHZMPPGGWW63YUDWWTAVCNFSM6AAAAABCBIHFAWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJQGI3TEMZTHA> . You are receiving this because you were mentioned.Message ID: ***@***.***>

kdkorthauer · 2024-02-18T00:02:50Z

Yes, exactly - collapsing would treat each CpG as one observation. If you are interested in hemimethylation, then it would not be appropriate to collapse.

kdkorthauer referenced this issue Feb 16, 2024

bug fix to test for #54

1b2b0e7

kdkorthauer added a commit that referenced this issue Feb 20, 2024

bux fix for #64 in release

daaf07a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running block=TRUE crashes during permutations with "Error in .start_as_unnamed_integer(start)" #64

Running block=TRUE crashes during permutations with "Error in .start_as_unnamed_integer(start)" #64

mpiersonsmela commented Jan 19, 2024

mpiersonsmela commented Jan 19, 2024

kdkorthauer commented Jan 29, 2024

mpiersonsmela commented Jan 30, 2024

mpiersonsmela commented Jan 30, 2024

kdkorthauer commented Feb 17, 2024

mpiersonsmela commented Feb 17, 2024

mpiersonsmela commented Feb 17, 2024 via email

kdkorthauer commented Feb 18, 2024

Running block=TRUE crashes during permutations with "Error in .start_as_unnamed_integer(start)" #64

Running block=TRUE crashes during permutations with "Error in .start_as_unnamed_integer(start)" #64

Comments

mpiersonsmela commented Jan 19, 2024

mpiersonsmela commented Jan 19, 2024

kdkorthauer commented Jan 29, 2024

mpiersonsmela commented Jan 30, 2024

mpiersonsmela commented Jan 30, 2024

kdkorthauer commented Feb 17, 2024

mpiersonsmela commented Feb 17, 2024

mpiersonsmela commented Feb 17, 2024 via email

kdkorthauer commented Feb 18, 2024