readBed parsing failure #185

socanas · 2019-06-04T13:17:51Z

I have been using the readBed feature of genomation to read Bed files into R for use in Enriched Heatmaps. Recently I have been getting a parsing failures error, ie.:

Warning: 62474 parsing failures.
row col expected actual file
199 X4 no trailing characters .3333 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt'
430 X4 no trailing characters .6667 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt'
1046 X4 no trailing characters .6667 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt'

Any number in the bed file that has a decimal value is changed to NA in the GRanges object. I do not want to truncate or round these values. I am sure that I am missing something simple. I have used the readBed function for similar files and never have had an issue. Any suggestions?

Thanks!

al2na · 2019-06-04T13:52:56Z

This could be due to the potential changes in the dependencies we use to read the files faster. If you send a reproducible example I will take a look

On Tue 4. Jun 2019 at 15:17, socanas ***@***.***> wrote: I have been using the readBed feature of genomation to read Bed files into R for use in Enriched Heatmaps. Recently I have been getting a parsing failures error, ie.: Warning: 62474 parsing failures. row col expected actual file 199 X4 no trailing characters .3333 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt' 430 X4 no trailing characters .6667 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt' 1046 X4 no trailing characters .6667 'EnrichedHM.final.sort.formatted.a1-bs_input_CpG.txt' Any number in the bed file that has a decimal value is changed to NA in the GRanges object. I do not want to truncate or round these values. I am sure that I am missing something simple. I have used the readBed function for similar files and never have had an issue. Any suggestions? Thanks! — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#185?email_source=notifications&email_token=AAE32ENHASZCDAPFPBSOYKLPYZTP7A5CNFSM4HS4IGG2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4GXQ3C7Q>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAE32EO27HKRBZF27MJG23DPYZTP7ANCNFSM4HS4IGGQ> .

-- Sent from mobile, excuse the brevity

socanas · 2019-06-04T14:12:57Z

Thank you for your quick reply! The file test1.txt gives the parsing error and test2.txt does not give the parsing error. Both files were formatted with the same scripts. Columns are "chr, start, stop, methylation %, coverage, strand (+/-)".

test1<-readBed("test1.txt", remove.unusual=TRUE)
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_double(),
X3 = col_double(),
X4 = col_double(),
X5 = col_double(),
X6 = col_character()
)
Warning: 12 parsing failures.
row col expected actual file
199 X4 no trailing characters .3333 'test1.txt'
430 X4 no trailing characters .6667 'test1.txt'
1046 X4 no trailing characters .6667 'test1.txt'
1127 X4 no trailing characters .3333 'test1.txt'
1199 X4 no trailing characters .6667 'test1.txt'
.... ... ...................... ...... ...........
See problems(...) for more details.

test2<-readBed("test2.txt", remove.unusual=TRUE)
Parsed with column specification:
cols(
X1 = col_character(),
X2 = col_double(),
X3 = col_double(),
X4 = col_double(),
X5 = col_double(),
X6 = col_character()
)

test1.txt
test2.txt

katwre · 2019-06-05T15:45:05Z

Hi @socanas, file test1.txt gives the parsing error and test2.txt does not give the parsing error, because readBed uses first 30 rows to detect classes of columns (character, integer, decimal numbers etc) and in the test1 file your 4th column in the first 30 rows doesnt have a decimal number, but in test2 file you have them. I would just add .0 to one of the first 30 numbers in your column for now, e.g. the first one:

chr1 3000827 3000827 100.0 1 +
chr1 3001007 3001007 100 1 +
chr1 3001018 3001018 100 2 +
chr1 3001019 3001019 100 1 -
chr1 3003339 3003339 100 1 +
chr1 3003340 3003340 100 1 -
chr1 3003380 3003380 100 1 -
chr1 3003582 3003582 100 2 +

cheers,
Kasia

katwre · 2019-06-05T16:01:15Z

hmm, I've also had such issue in the past, maybe rewriting a function that reads BED files https://github.com/BIMSBbioinfo/genomation/blob/master/R/readData.R#L68 from readr::read_delim to e.g. data.table::fread could be an option, @al2na what do you think? Right now it's a bit dangerous, because readTableFast might truncate decimal numbers, if e.g. in the 30 rows are only integers and 31st is a decimal number

socanas · 2019-06-05T16:27:04Z

@katwre Thank you for the solution! That works great!

al2na · 2019-06-05T17:25:17Z

hmm, I've also had such issue in the past, maybe rewriting a function that reads BED files https://github.com/BIMSBbioinfo/genomation/blob/master/R/readData.R#L68 from readr::read_delim to e.g. data.table::fread could be an option, @al2na what do you think? Right now it's a bit dangerous, because readTableFast might truncate decimal numbers, if e.g. in the 30 rows are only integers and 31st is a decimal number

Thank you @katwre !! I think we went with read_delim because it could read gzipped files at the time, but now data.table::fread can also read gzipped files without piping afaik. if that's the case we can use fread

katwre · 2019-06-05T17:48:20Z

@al2na ah true! but it looks like fread reads gzziped files now too, so it could work

> data.table::fread("test2.txt.gz")
        V1      V2      V3  V4 V5 V6
   1: chr1 3001630 3001630 100  2  -
   2: chr1 3003227 3003227 100  1  -
   3: chr1 3003340 3003340 100  2  -
   4: chr1 3003380 3003380   0  1  -
   5: chr1 3003582 3003582 100  1  +
  ---                               
1996: chr1 3670743 3670743   0  1  -
1997: chr1 3670752 3670752   0  1  -
1998: chr1 3670776 3670776   0  2  +
1999: chr1 3670821 3670821   0  1  +
2000: chr1 3670861 3670861   0  1  +

al2na · 2019-06-05T21:41:10Z

then we can change it to fread:)

…

On Wed, Jun 5, 2019 at 7:48 PM katwre ***@***.***> wrote: @al2na <https://github.com/al2na> ah true! but it looks like fread reads gzziped files now too, so it could work > data.table::fread("test2.txt.gz") V1 V2 V3 V4 V5 V6 1: chr1 3001630 3001630 100 2 - 2: chr1 3003227 3003227 100 1 - 3: chr1 3003340 3003340 100 2 - 4: chr1 3003380 3003380 0 1 - 5: chr1 3003582 3003582 100 1 + --- 1996: chr1 3670743 3670743 0 1 - 1997: chr1 3670752 3670752 0 1 - 1998: chr1 3670776 3670776 0 2 + 1999: chr1 3670821 3670821 0 1 + 2000: chr1 3670861 3670861 0 1 + — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#185?email_source=notifications&email_token=AAE32EIYH4VAIXIYL5XCXDTPY736LA5CNFSM4HS4IGG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXAPYQY#issuecomment-499186755>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAE32EMJHVSIDI636753IW3PY736LANCNFSM4HS4IGGQ> .

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readBed parsing failure #185

readBed parsing failure #185

socanas commented Jun 4, 2019

al2na commented Jun 4, 2019 via email

socanas commented Jun 4, 2019

katwre commented Jun 5, 2019 •

edited

Loading

katwre commented Jun 5, 2019

socanas commented Jun 5, 2019

al2na commented Jun 5, 2019

katwre commented Jun 5, 2019

al2na commented Jun 5, 2019 via email

readBed parsing failure #185

readBed parsing failure #185

Comments

socanas commented Jun 4, 2019

al2na commented Jun 4, 2019 via email

socanas commented Jun 4, 2019

katwre commented Jun 5, 2019 • edited Loading

katwre commented Jun 5, 2019

socanas commented Jun 5, 2019

al2na commented Jun 5, 2019

katwre commented Jun 5, 2019

al2na commented Jun 5, 2019 via email

katwre commented Jun 5, 2019 •

edited

Loading