Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genomeCov bed breaks during iteration. #145

Closed
gpratt opened this issue Sep 14, 2015 · 2 comments
Closed

genomeCov bed breaks during iteration. #145

gpratt opened this issue Sep 14, 2015 · 2 comments

Comments

@gpratt
Copy link

gpratt commented Sep 14, 2015

It looks like genome coverage bed breaks when we are just trying to generate the standard histogram. Here is the minimally reproducible code.

 bedtool = pybedtools.BedTool("""
    chr1    1   100 feature1    0   +
    chr1    1   100 feature1    0   +
    """, from_string=True).saveas("foo.bed")

    for line in bedtool.genome_coverage(g="/Users/gpratt/bioinformatics/genomes/hg19/hg19.chrom.sizes", stream=True, **{'5': True}):
        print line

This is the stacktrace:

chr1    0   249250620   249250621   1

---------------------------------------------------------------------------
MalformedBedLineError                     Traceback (most recent call last)
<ipython-input-30-57d113f47b93> in <module>()
      4 """, from_string=True).saveas("foo.bed")
      5 
----> 6 for line in bedtool.genome_coverage(g="/Users/gpratt/bioinformatics/genomes/hg19/hg19.chrom.sizes", stream=True, **{'5': True}):
      7     print line

/Users/gpratt/anaconda/lib/python2.7/site-packages/pybedtools-0.7.0-py2.7-macosx-10.5-x86_64.egg/pybedtools/cbedtools.pyx in pybedtools.cbedtools.IntervalIterator.__next__ (pybedtools/cbedtools.cxx:10344)()
    770         # TODO: optimization: create_interval_from_list should have a version
    771         # that accepts C++ string instances
--> 772         return create_interval_from_list(fields)
    773 
    774 

/Users/gpratt/anaconda/lib/python2.7/site-packages/pybedtools-0.7.0-py2.7-macosx-10.5-x86_64.egg/pybedtools/cbedtools.pyx in pybedtools.cbedtools.create_interval_from_list (pybedtools/cbedtools.cxx:9238)()
    680 
    681     if pyb.start > pyb.end:
--> 682         raise MalformedBedLineError("Start is greater than stop")
    683     pyb._bed.fields = list_to_vector(orig_fields)
    684     return pyb

MalformedBedLineError: Start is greater than stop

And the output from genomecov:

be-scrm-228-233-145:ipython_notebooks gpratt$ genomeCoverageBed -i foo.bed -g /Users/gpratt/bioinformatics/genomes/hg19/hg19.chrom.sizes -5 
chr1    0   249250620   249250621   1
chr1    2   1   249250621   4.01203e-09

pybedtools is trying to create an interval out of something that shouldn't be an interval. Got any ideas for a fix?

@daler
Copy link
Owner

daler commented Sep 14, 2015

Pretty sure this is caused by the #143 "fix" that is breaking more than it fixed . . . I'll take a closer look.

@daler
Copy link
Owner

daler commented Sep 15, 2015

Ah right, this is actually expected behavior and is related to #110.

bedtools genomecov in default mode does not return a BED file. It's a general tab-delimited file representing histograms.

Demo time. Turns out we need the {"5":True} argument along with the specially-created x to get this issue to pop up in a minimal example.

import pybedtools
x = pybedtools.BedTool("""
    chr1    1   100 feature1    0   +
    chr1    1   100 feature1    0   +
    """, from_string=True)
g = pybedtools.chromsizes_to_file({'chr1': (0, 200)})
y = x.genome_coverage(g=g, **{'5': True})

As you showed, the following raises MalformedBedLineError. pybedtools sees the first line and thinks it's a BED file, but then chokes on the second line:

for line in y:
    print (line)
# MalformedBedLineError: Start is greater than stop

The solution is to bypass the interpretation of BED format. For a file-based BedTool like this example, open the file indicated by y.fn and iterate over it. Or better yet, use pandas:

import pandas
t = pandas.read_table(y.fn, names=['chrom', 'depth', 'bases', 'size', 'frac'])
#     chrom  depth  bases  size   frac
# 0    chr1      0    199   200  0.995
# 1    chr1      2      1   200  0.005
# 2  genome      0    199   200  0.995
# 3  genome      2      1   200  0.005

There you can see the problem on line 1. When it was trying to iterate over results, pybedtools thought "depth" was start position and "bases" was the stop, and sure enough depth is greater than bases.

For streaming genome_coverage results, you'll need to iterate over .fn, which contains the generator of lines:

y = x.genome_coverage(g=g, stream=True, **{'5': True})
for line in y.fn:
    print (line)

Note that everything works fine if you use one of the genomecov arguments that returns BED-like output, like bg=True which gets you bedGraph output:

for line in x.genome_coverage(g=g, bg=True, stream=True, **{'5': True}):
    print (line)

Of course, the question is should pybedtools know about non-BED-like output and supply an appropriate warning or iterate over the file rather than convert to Interval objects. There's an issue #113 for that, but honestly it hasn't been a high priority and I'm not sure it warrants the additional complexity. Suggestions welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants