Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readBed parsing failure #185

Open
socanas opened this issue Jun 4, 2019 · 8 comments
Open

readBed parsing failure #185

socanas opened this issue Jun 4, 2019 · 8 comments

Comments

@socanas
Copy link

socanas commented Jun 4, 2019

I have been using the readBed feature of genomation to read Bed files into R for use in Enriched Heatmaps. Recently I have been getting a parsing failures error, ie.:

Warning: 62474 parsing failures.
row col expected actual file
199 X4 no trailing characters .3333 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt'
430 X4 no trailing characters .6667 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt'
1046 X4 no trailing characters .6667 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt'

Any number in the bed file that has a decimal value is changed to NA in the GRanges object. I do not want to truncate or round these values. I am sure that I am missing something simple. I have used the readBed function for similar files and never have had an issue. Any suggestions?

Thanks!

@al2na
Copy link
Member

al2na commented Jun 4, 2019 via email

@socanas
Copy link
Author

socanas commented Jun 4, 2019

Thank you for your quick reply! The file test1.txt gives the parsing error and test2.txt does not give the parsing error. Both files were formatted with the same scripts. Columns are "chr, start, stop, methylation %, coverage, strand (+/-)".

test1<-readBed("test1.txt", remove.unusual=TRUE)
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_double(),
X3 = col_double(),
X4 = col_double(),
X5 = col_double(),
X6 = col_character()
)
Warning: 12 parsing failures.
row col expected actual file
199 X4 no trailing characters .3333 'test1.txt'
430 X4 no trailing characters .6667 'test1.txt'
1046 X4 no trailing characters .6667 'test1.txt'
1127 X4 no trailing characters .3333 'test1.txt'
1199 X4 no trailing characters .6667 'test1.txt'
.... ... ...................... ...... ...........
See problems(...) for more details.

test2<-readBed("test2.txt", remove.unusual=TRUE)
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_double(),
X3 = col_double(),
X4 = col_double(),
X5 = col_double(),
X6 = col_character()
)

test1.txt
test2.txt

@katwre
Copy link
Contributor

katwre commented Jun 5, 2019

Hi @socanas, file test1.txt gives the parsing error and test2.txt does not give the parsing error, because readBed uses first 30 rows to detect classes of columns (character, integer, decimal numbers etc) and in the test1 file your 4th column in the first 30 rows doesnt have a decimal number, but in test2 file you have them. I would just add .0 to one of the first 30 numbers in your column for now, e.g. the first one:

chr1 3000827 3000827 100.0 1 +
chr1 3001007 3001007 100 1 +
chr1 3001018 3001018 100 2 +
chr1 3001019 3001019 100 1 -
chr1 3003339 3003339 100 1 +
chr1 3003340 3003340 100 1 -
chr1 3003380 3003380 100 1 -
chr1 3003582 3003582 100 2 +

cheers,
Kasia

@katwre
Copy link
Contributor

katwre commented Jun 5, 2019

hmm, I've also had such issue in the past, maybe rewriting a function that reads BED files https://github.com/BIMSBbioinfo/genomation/blob/master/R/readData.R#L68 from readr::read_delim to e.g. data.table::fread could be an option, @al2na what do you think? Right now it's a bit dangerous, because readTableFast might truncate decimal numbers, if e.g. in the 30 rows are only integers and 31st is a decimal number

@socanas
Copy link
Author

socanas commented Jun 5, 2019

@katwre Thank you for the solution! That works great!

@al2na
Copy link
Member

al2na commented Jun 5, 2019

hmm, I've also had such issue in the past, maybe rewriting a function that reads BED files https://github.com/BIMSBbioinfo/genomation/blob/master/R/readData.R#L68 from readr::read_delim to e.g. data.table::fread could be an option, @al2na what do you think? Right now it's a bit dangerous, because readTableFast might truncate decimal numbers, if e.g. in the 30 rows are only integers and 31st is a decimal number

Thank you @katwre !! I think we went with read_delim because it could read gzipped files at the time, but now data.table::fread can also read gzipped files without piping afaik. if that's the case we can use fread

@katwre
Copy link
Contributor

katwre commented Jun 5, 2019

@al2na ah true! but it looks like fread reads gzziped files now too, so it could work

> data.table::fread("test2.txt.gz")
        V1      V2      V3  V4 V5 V6
   1: chr1 3001630 3001630 100  2  -
   2: chr1 3003227 3003227 100  1  -
   3: chr1 3003340 3003340 100  2  -
   4: chr1 3003380 3003380   0  1  -
   5: chr1 3003582 3003582 100  1  +
  ---                               
1996: chr1 3670743 3670743   0  1  -
1997: chr1 3670752 3670752   0  1  -
1998: chr1 3670776 3670776   0  2  +
1999: chr1 3670821 3670821   0  1  +
2000: chr1 3670861 3670861   0  1  +

@al2na
Copy link
Member

al2na commented Jun 5, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants