Estimating a ProcessGraph #285

vermesr · 2024-10-21T14:24:33Z

vermesr
Oct 21, 2024

Hello,

I am from Thales and I work in collaboration with the French Space Agency (CNES) on an Openeo backend.

One of our concerns is to estimate the resources required to compute a ProcessGraph. (We are using openeo-processes-dask on top of an HPC cluster a SLURM resource manager).

There are 2 parts to achieve this :

Have an estimation of the required resources for each process, based on the data size to work on
Estimate a process graph according to those estimations

Processes Estimation

The goal is to execute individually each openeo process from openeo-processes-dask, and time it for each set of parameters around :

Dask SlurmCluster configuration (mostly number of workers)
Data size to process

Current method

We implement a wrapper function for each process to call. For example our function for the 'round' process looks like this :

    def _round_loc(self):
        round_process = process_registry_cnes.get('round').implementation
        def _wrapper(x=None, p=3, positional_parameters=None, named_parameters=None):
            return round_process(x=x, p=p)

        process_apply = process_registry_cnes.get('apply').implementation
        round_process_wrapped = process_apply(data=self._get_data(), process=_wrapper)
        
        return round_process_wrapped`

In the estimation process, the datacubes required for the process are persisted in dask workers memory before we start the process execution (and monitor it - currently mostly time).
The execution is made like this :

result = process()
persisted_result = result.persist(pure=True)
Futures_res_wait = distributed.wait(da)

From this estimations, we build a json file containing each process different resources profiles, according to the data size to process.

Automatic method ?

I was wondering if we could have an automatic implementation that looks into the process spec and determine wether it can be called directly or it needs to be wrapped into the process 'apply' for example, and also determine the input arguments to pass.

My feeling about this is that it is difficult regarding some processes. For example the process round takes a number and an integer and we would like to pass our rastercube as the number and a number of decimals as the integer (and wrap the call into the process 'apply'), but it looks hard to determine an automatic value for the arguments, not 'automatically' knowing what they really stand for.

ProcessGraph Estimation

Integration of a ProcessGraph Estimation into our Data-Processing Service (for route /jobs/<job_id>/estimate and/or before starting batch job).
Here we have taken the openeo-pgparser-networkx.OpenEOProcessGraph _map_node_to_callable to extract data from processes that run into the inner "node_callable" function.
The dask graph is built by calling the pg_callable(), but we skip the result saving part.

This assumes everything is ran lazily.
For each process lazily ran, we get the size of arguments when they are RasterCube.

According the processes that need to be executed and the size of their data arguments to process, we estimate the duration and required resource profil for the ProcessGraph.

We are wondering if someone works on an estimation process as well.

We would be glad to exchange on this topic!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Estimating a ProcessGraph #285

{{title}}

Replies: 0 comments

Select a reply

Estimating a ProcessGraph #285

vermesr Oct 21, 2024

Processes Estimation

Current method

Automatic method ?

ProcessGraph Estimation

Replies: 0 comments

vermesr
Oct 21, 2024