Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a flushing process for temp files #112

Closed
radaniba opened this issue Aug 28, 2014 · 14 comments
Closed

Is there a flushing process for temp files #112

radaniba opened this issue Aug 28, 2014 · 14 comments

Comments

@radaniba
Copy link

When using pybedtools we are creating temporary files for access later on, when a call is done to one of pybedtools function (the so referenced x.fn)

My question is, sometimes these files can be big, and this will continue to pileup in the /tmp of the user or on servers.

Is there a way of flushing these temp files when the program exits, I don't think it is reasonable to flush while the program is still running but it is definitely useful to clean up a little bit after doing stuff

any thoughts ?

Thanks

@radaniba
Copy link
Author

I guess one can specify the output and flush it when the program ends from within the program itself, but in case he doesn't, it owuld be good if the pybedtools 'remembers' all the files generated on a given session and when the prog exits it just clear those out

@daler
Copy link
Owner

daler commented Aug 28, 2014

Yep, any tempfiles created are automatically cleaned up when the Python interpreter exits. Specifically, the last line in helpers.py registers helpers.cleanup() to be called upon exit.

Within a single session, you can always call pybedtools.cleanup() to get rid of any files created so far in that session.

By default, cleanup() only gets rid of the files in BedTool.TEMPFILES so that if other users on the same filesystem are using pybedtools, their files won't get deleted inadvertently. But you don't care about that, you can use pybedtools.cleanup(remove_all=True) to get rid of anything matching $TEMPDIR/pybedtools.*.tmp. But this could be slow if you have hundreds of thousands of files; see below for a solution to this.

If you kill a running Python process that created a lot of tempfiles, cleanup() will never run, and that can cause temp files to accumulate. For example, in the past I've killed a running Python process that was doing a lot of randomizations using multiple processors. This resulted in a LOT (hundreds of thousands) of tempfiles that never got cleaned up from a normal exit. From a terminal, rm /tmp/pybedtools.*.tmp gave an "argument list too long" error. The solution was to use find and xargs, as described here.

Also note that if you're creating new tempfiles across multiple processes, the list of tempfiles is not shared across process boundaries. That's why the functions in stats.py are careful about deleting files as they go.

@radaniba
Copy link
Author

hmm good to know, thanks @daler for explaining this

The reason I am asking is that I was thinking a malformedBedLineError would be caused by that


coverage_result = alignment.genome_coverage(genome="hg19")

coverage_result.head(100)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "../python2.7/site-packages/pybedtools/bedtool.py", line 1036, in head
    for i, line in enumerate(iter(self)):
  File "cbedtools.pyx", line 680, in pybedtools.cbedtools.IntervalFile.__next__ (pybedtools/cbedtools.cpp:8685)
pybedtools.cbedtools.MalformedBedLineError: malformed line: ['1', '30', '22', '249250621', '8.82646e-08']

This is the section from the temp file

1   28  52  249250621   2.08625e-07
1   29  83  249250621   3.32998e-07
1   30  22  249250621   8.82646e-08
1   31  59  249250621   2.3671e-07
1   32  29  249250621   1.16349e-07
1   33  23  249250621   9.22766e-08

This looks good for me though, and it used to work before, I don't really understand the reason of such message

any idea ?

ps : I cleaned all temp files before running

@daler
Copy link
Owner

daler commented Aug 28, 2014

Ah, that's because your start coord is greater than your stop coord for the third line in your example.

See this recent BEDtools mailing list post for details.

@radaniba
Copy link
Author

But that's not supposed to be coordinates, that's pybedtools.genome_coverage called with no bg or bga option it returns chromosome, depth, number of reads, size, fraction,

no ?

@daler
Copy link
Owner

daler commented Aug 28, 2014

Sorry, I missed that. In that case, this is similar to issue #110, where it's not actually a valid BED/GTF/GFF/VCF/BAM format file.

The problem here is that sometimes BedTool.genome_coverage (i.e. bedtools genomecov) returns a valid bedGraph file (if you use -d, -bg, -bga) and sometimes not (as in the default).

I suppose I could manually check for which parameters were passed, and detect whether a file will be formatted to work nicely with a BedTool object. If so, return a BedTool object. But if the default settings are used, what should be returned if not a BedTool object?

So far, I've chosen to not pay attention to kwargs passed and just always return a BedTool object, relying on the user to decide if their file is a valid format or not. But I'm certainly open to suggestions for how to improve this.

@radaniba
Copy link
Author

Hmm, I see, well I guess this could be solved with another function similar to #110 but instead of returning a BedTool + dataframe, this will return 2 dataframes instead.

In general, I think it is better to place a watcher kind of function, something that checks if kwargs are provided then the object saved will be a BedTool, otherwise, the object will be any exploitable / parsable kind of data, a DataFrame will be ideal

I guess for now I can play with the solution provided in #110 , but that's good to know, thanks for clarifying this @daler

@radaniba
Copy link
Author

btw, is there another utility similar to pybedtools.create_interval_from_list ? pubedtools.create_csv ??

@daler
Copy link
Owner

daler commented Aug 28, 2014

What are you aiming to do? If you'd like a CSV version of a BedTool, you could use the new to_dataframemethod:

import pybedtools
a = pybedtools.example_bedtool('a.bed')
a.to_dataframe().to_csv('output.csv')

@radaniba
Copy link
Author

Is that merged in the master branch ? Should I pull the repo again ?

@daler
Copy link
Owner

daler commented Aug 28, 2014

Yep -- I committed it yesterday after you said the method I proposed would work for your purposes. It's in the master branch now.

@radaniba
Copy link
Author

Awesome, thanks a lot @daler, I will update, I am generating a couple of examples on pybedtools usage and will be publishing some runnable examples at CodersCrowd soon

@daler
Copy link
Owner

daler commented Aug 28, 2014

OK. Closing this for now, but feel free to re-open if needed. Also, I opened #113 for detecting valid BedTool output as you mentioned.

@daler daler closed this as completed Aug 28, 2014
@radaniba
Copy link
Author

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants