You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Right now ProcessQueue -p removes all files from the process queue, calculates all possible things to run for them, and then executes them all. There are a few problems with this:
If ProcessQueue crashes in the middle of the run, the full state of the queue is lost
As the database gets bigger and the calculating step takes longer, it's longer before the parallel step (the actual processing) is entered
Eventually I'd like some means of prioritizing new processing over reprocessing, or at least not having to completely halt all new processing to do reprocessing. This one change won't completely address that want, but it's probably a requirement (since it will involve injecting new items to the ProcessQueue mid-process.)
So I'd like to have ProcessQueue -p handle the files on the queue in a "chunked" fashion.
Proposed enhancement
This is going to involve interweaving the calculation of the runMe objects and their execution, whereas before they've been pretty separate. So I'm guessing there needs to be a good round of code cleanup and testing in there. We probably also need some way of being able to maintain serialization.
Right now I believe we pull everything off the queue at once to check for redundant processing--if we can do a quick skim for redundancy first, and then pull one item at a time off the queue, that would make life a lot better. So currently it is:
Retrieve all file_ids from the queue, emptying it
Calculate all command lines (including some sort of redundancy check, will need to look into it)
While command lines remain to execute:
a. Start commands up to some process limit
b. Wait for commands to terminate
New proposal is:
Perform a redundancy check/sweep of the process queue
While the process queue is not empty
Pop one item from the database process queue (potentially move to "currently processing" queue, another table)
Calculate its possible outputs and command lines (runMe objects), add to Python queue of runMe
While there are runMe objects
While the process limit is not reached
Pop one from the runMe queue and start it
Wait for a process to terminate
This does wind up in a fairly complicated producer-consumer relationship because an entry in the process queue potentially maps to multiple runMe objects (it may be an input for several processes.) The "currently processing" queue is tempting because it would help with crash recovery, but it would require some notation of the relationship between runMe objects and queue entries, so we know when an entry from the queue is complete (and can be removed). I haven't noted that bookkeeping above.
Alternatives
Speeding up the calculation of outputs would help in some regards, but not with the crashing question or the ability to do some level of start/stop.
We could also update the process queue to contain not files to process, but all their possible outputs, moving the calculation step to when a file is created instead of when it is processed. This might cause consistency problems, since adding new files would change the equation for files that have already been added, but perhaps not: ultimately the file is only used to calculate "process plus UTC file day" and storing that seems reasonable, as well as a very easy thing to check for duplicates. (Pull first item off queue, see if same process plus day exists later, if so just throw the first one away.) This would remove the one-to-many problem of process queue entries to runMe objects.
It would probably help this analysis to figure out how much processing is going into calculating possible outputs (what days and products can be made from a file) vs. calculating up-to-date (does this need to be run?) vs. calculating command lines. If calculating the command line is the slow part, then there could be a big win to moving it to the parallel bit--i.e. make the runMe very late in the game. And if the up-to-date check is fast, then having the same date/process combination on the queue multiple times isn't a big loss.
OS, Python version, and dependency version information:
Obviously this enhancement will be updated as I do research and design, but closure would be when we have a commit that pulls from the process queue in bite-sized pieces.
The text was updated successfully, but these errors were encountered:
Right now
ProcessQueue -p
removes all files from the process queue, calculates all possible things to run for them, and then executes them all. There are a few problems with this:So I'd like to have ProcessQueue -p handle the files on the queue in a "chunked" fashion.
Proposed enhancement
This is going to involve interweaving the calculation of the runMe objects and their execution, whereas before they've been pretty separate. So I'm guessing there needs to be a good round of code cleanup and testing in there. We probably also need some way of being able to maintain serialization.
Right now I believe we pull everything off the queue at once to check for redundant processing--if we can do a quick skim for redundancy first, and then pull one item at a time off the queue, that would make life a lot better. So currently it is:
a. Start commands up to some process limit
b. Wait for commands to terminate
New proposal is:
This does wind up in a fairly complicated producer-consumer relationship because an entry in the process queue potentially maps to multiple runMe objects (it may be an input for several processes.) The "currently processing" queue is tempting because it would help with crash recovery, but it would require some notation of the relationship between runMe objects and queue entries, so we know when an entry from the queue is complete (and can be removed). I haven't noted that bookkeeping above.
Alternatives
Speeding up the calculation of outputs would help in some regards, but not with the crashing question or the ability to do some level of start/stop.
We could also update the process queue to contain not files to process, but all their possible outputs, moving the calculation step to when a file is created instead of when it is processed. This might cause consistency problems, since adding new files would change the equation for files that have already been added, but perhaps not: ultimately the file is only used to calculate "process plus UTC file day" and storing that seems reasonable, as well as a very easy thing to check for duplicates. (Pull first item off queue, see if same process plus day exists later, if so just throw the first one away.) This would remove the one-to-many problem of process queue entries to runMe objects.
It would probably help this analysis to figure out how much processing is going into calculating possible outputs (what days and products can be made from a file) vs. calculating up-to-date (does this need to be run?) vs. calculating command lines. If calculating the command line is the slow part, then there could be a big win to moving it to the parallel bit--i.e. make the runMe very late in the game. And if the up-to-date check is fast, then having the same date/process combination on the queue multiple times isn't a big loss.
OS, Python version, and dependency version information:
Version of dbprocessing
Current master
Closure condition
Obviously this enhancement will be updated as I do research and design, but closure would be when we have a commit that pulls from the process queue in bite-sized pieces.
The text was updated successfully, but these errors were encountered: