-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add fit_curve and predict_curve #139
Conversation
Codecov Report
@@ Coverage Diff @@
## main #139 +/- ##
==========================================
+ Coverage 78.95% 79.64% +0.69%
==========================================
Files 28 29 +1
Lines 1207 1253 +46
==========================================
+ Hits 953 998 +45
- Misses 254 255 +1
... and 2 files with indirect coverage changes 📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
@LukeWeidenwalker can you clarify how many time steps and how many pixels in the spatial dimension are present in your test set? |
|
What is the chunk size? We would also need to check what happens if we increase the size in the spatial domain. Anyway, to make usage of this functionality it should work also with a larger area, like a Sentinel-2 tile (~10000x10000).
|
Chunk size after rechunking was
The time seems to scale linearly as far as I can see - doing half the number of pixels took ~50% of the time. Therefore we should expect a full sentinel-2 tile to take 50h with the current setup.
I will try doubling the dask cluster size and report back on how that changes runtime - although I don't expect that to help, because as I said above, most of the time is spent in single cores working on this vectorize__wrapper call.
That should totally work, only subject to resource limitations on our backend! |
Where is set the re-chunking size? From the experiments I did, I remember that overall it was performing much better using small chunks like 128x128, 256x256, ... |
Actually @clausmichele, do you still have performance characteristics from your experiments around? Would be interesting to compare the throughput (pixels/second)! If your numbers were higher than this, I think we could expect more low-hanging fruit and try optimising a bit further - otherwise, we should probably just accept this level of performance! |
I've also just launched and killed pre-emptively a few experiments on a 10000x10000 spatial extent - my dask cluster did not run out of memory handling this array along the way, so although it would take a fair bit longer, I'm fairly confident that this computation would run through! |
Thanks Lukas for the tests. I actually don't have my old ones any more, but they were showing that with "small" chunks I was getting better performance. |
Ah, I'm not calling this with the python client, just locally with a remote dask cluster attached. Are you suggesting that we merge this, deploy it to prod as an experimental process and you start testing on that? Or should we add predict_curve too first and then deploy? |
We should have |
@clausmichele @ValentinaHutter I've now also implemented |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! There are just some things, I was wondering about - see comments :)
openeo_processes_dask/process_implementations/ml/curve_fitting.py
Outdated
Show resolved
Hide resolved
@clausmichele would be great if you get some time to have a look over this PR this week - once this is merged, you should be able to start testing the usecase in short order! |
openeo_processes_dask/process_implementations/ml/curve_fitting.py
Outdated
Show resolved
Hide resolved
parameters = {f"param_{i}": v for i, v in enumerate(parameters)} | ||
|
||
# The dimension along which to predict cannot be chunked! | ||
rechunked_data = data.chunk({dimension: -1}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we do the same in predict_curve
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it matters for predict, because there each timestep can be inferenced for independently of the other steps!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LukeWeidenwalker that's true. But I was wondering if it would be faster when predicting on a datacube with many timesteps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm - not sure tbh, I haven't profiled predict_curve
at all yet - I think I'll merge and deploy this now so we can start a training run at least, and revisit this if performance of inference turns out to be a problem!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LukeWeidenwalker I left two comments, other than that it seems fine.
I've now taken a shot at implementing
fit_curve
too, and thought I'd just start from scratch to see what the problem really is. With just the xarray built-in functioncurvefit
, I get throughput of 1000pixels/sec with a vanilla OpenEO dask cluster (6 workers w/ 4CPUs and 12GB RAM each). Dask has added constant memory rechunking a while ago with their P2P rechunking scheme, so I'm somewhat hopeful that this will be okay for larger datacubes too. I haven't tried a truly humonguous dataset yet (the largest I tried was 1million pixels, 2x of what was used in the SRR2 notebook, which took 30 minutes), but I'm inclined to not worry about the rechunking anymore unless it comes up again.@clausmichele do you think this level of performance allows you to make progress on this usecase?
There's some open questions about the interface:
fit_curve
should be run throughapply_dimension
orreduce_dimension
, as was changed in fit_curve return schema openeo-processes#425, because this makes everything awkward. I've therefore added thedimension
parameter back in.ignore_nodata
right nowpredict_curve
) is fine to run as a reducer though! I haven't had a chance to take a closer look though, so might need changes there too!The parameter unpacking I've only kind of eyeballed so far, could use a closer look to confirm that what I'm doing there makes sense!
Note on performance: Most of the wall time is spent in a
vectorize__wrapper
step - afaict this is whereapply_ufunc
vectorizes the function before passing it to scipy's curve_fit, and this work isn't parallelising. I've already tried pre-compiling the function with numba before, but to no avail. Not entirely sure how we'd go about speeding this up further, but maybe the current throughput is already good enough to demonstrate the usecase?