You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am from Thales and I work in collaboration with the French Space Agency (CNES) on an Openeo backend.
One of our concerns is to estimate the resources required to compute a ProcessGraph. (We are using openeo-processes-dask on top of an HPC cluster a SLURM resource manager).
There are 2 parts to achieve this :
Have an estimation of the required resources for each process, based on the data size to work on
Estimate a process graph according to those estimations
Processes Estimation
The goal is to execute individually each openeo process from openeo-processes-dask, and time it for each set of parameters around :
Dask SlurmCluster configuration (mostly number of workers)
Data size to process
Current method
We implement a wrapper function for each process to call. For example our function for the 'round' process looks like this :
In the estimation process, the datacubes required for the process are persisted in dask workers memory before we start the process execution (and monitor it - currently mostly time).
The execution is made like this :
result = process()
persisted_result = result.persist(pure=True)
Futures_res_wait = distributed.wait(da)
From this estimations, we build a json file containing each process different resources profiles, according to the data size to process.
Automatic method ?
I was wondering if we could have an automatic implementation that looks into the process spec and determine wether it can be called directly or it needs to be wrapped into the process 'apply' for example, and also determine the input arguments to pass.
My feeling about this is that it is difficult regarding some processes. For example the process round takes a number and an integer and we would like to pass our rastercube as the number and a number of decimals as the integer (and wrap the call into the process 'apply'), but it looks hard to determine an automatic value for the arguments, not 'automatically' knowing what they really stand for.
ProcessGraph Estimation
Integration of a ProcessGraph Estimation into our Data-Processing Service (for route /jobs/<job_id>/estimate and/or before starting batch job).
Here we have taken the openeo-pgparser-networkx.OpenEOProcessGraph _map_node_to_callable to extract data from processes that run into the inner "node_callable" function.
The dask graph is built by calling the pg_callable(), but we skip the result saving part.
This assumes everything is ran lazily.
For each process lazily ran, we get the size of arguments when they are RasterCube.
According the processes that need to be executed and the size of their data arguments to process, we estimate the duration and required resource profil for the ProcessGraph.
We are wondering if someone works on an estimation process as well.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
I am from Thales and I work in collaboration with the French Space Agency (CNES) on an Openeo backend.
One of our concerns is to estimate the resources required to compute a ProcessGraph. (We are using openeo-processes-dask on top of an HPC cluster a SLURM resource manager).
There are 2 parts to achieve this :
Processes Estimation
The goal is to execute individually each openeo process from openeo-processes-dask, and time it for each set of parameters around :
Current method
We implement a wrapper function for each process to call. For example our function for the 'round' process looks like this :
In the estimation process, the datacubes required for the process are persisted in dask workers memory before we start the process execution (and monitor it - currently mostly time).
The execution is made like this :
From this estimations, we build a json file containing each process different resources profiles, according to the data size to process.
Automatic method ?
I was wondering if we could have an automatic implementation that looks into the process spec and determine wether it can be called directly or it needs to be wrapped into the process 'apply' for example, and also determine the input arguments to pass.
My feeling about this is that it is difficult regarding some processes. For example the process round takes a number and an integer and we would like to pass our rastercube as the number and a number of decimals as the integer (and wrap the call into the process 'apply'), but it looks hard to determine an automatic value for the arguments, not 'automatically' knowing what they really stand for.
ProcessGraph Estimation
Integration of a ProcessGraph Estimation into our Data-Processing Service (for route /jobs/<job_id>/estimate and/or before starting batch job).
Here we have taken the openeo-pgparser-networkx.OpenEOProcessGraph _map_node_to_callable to extract data from processes that run into the inner "node_callable" function.
The dask graph is built by calling the pg_callable(), but we skip the result saving part.
This assumes everything is ran lazily.
For each process lazily ran, we get the size of arguments when they are RasterCube.
According the processes that need to be executed and the size of their data arguments to process, we estimate the duration and required resource profil for the ProcessGraph.
We are wondering if someone works on an estimation process as well.
We would be glad to exchange on this topic!
Beta Was this translation helpful? Give feedback.
All reactions