Restructure parallelization and Caching documentation #609

AnesBenmerzoug · 2024-07-02T13:42:51Z

Description

This PR addresses part of #581

Changes

Created a how-to guide for speeding up computations using parallelization using existing documentation and adapting it to changes in Feature/refactor value #558.
Created a how-to guide for speeding up computations using caching using existing documentation and adapting it to changes in Feature/refactor value #558.

Checklist

~~Wrote Unit tests (if necessary)~~
Updated Documentation (if necessary)
~~Updated Changelog~~
~~If notebooks were added/changed, added boilerplate cells are tagged with "tags": ["hide"] or "tags": ["hide-input"]~~

mdbenito

Just a couple of comments. Haven't reviewed all of it.

A more general note is: does this really count as a "how to" guide? If I understood correctly, those should focus on answering the question, not on explaining how things work. Opinions anyone?

docs/how-to-guides/how-to-speedup-caching.md

janosg · 2024-07-02T15:42:41Z

A more general note is: does this really count as a "how to" guide? If I understood correctly, those should focus on answering the question, not on explaining how things work. Opinions anyone?

I would also say that in the current form it is mixing a how-to guide and explanations. An ideal how-to guide according to the diataxis framework is extremely focused on solving one specific user need. Users are usually in a hurry and don't read anything carefully. So the how-to guide needs to be very easy to skim. Less text and more code snippets is better. But you do need a bit of text (headings and introductory sentences) to reassure the reader they are at the right place or tell them where they should go instead.

I think I also have made the mistake of adding explanations in almost all how-to guides I wrote. But I'll try to make an example of how I understand how-to guides now.

An example of a specific situation could be: A user wants to do data valuation. It's slow but they have a big computer and wonder how they can use it.

Then the how-to guide could be structured as follows:

How to speed up data valuation with parallelization

This guide will show you how to speed up data valuation algorithms by using parallel hardware. For alternative ways to speed up data valuation see [caching]. For parallelization of influence functions see [...].

Parallelization on a single computer

This approach is a good idea if you are working on a laptop or small server.

# code snippet here
# the code should already make it obvious how one sets the number of cores
# no need to mention default backends or stuff

(Explain what the code does but do not describe in words how it could be changed or how joblib works. If you want to show variations, add a second code snippet. If there are many code snippets, use tabs.)

For advanced configuration see [joblib docs]

Parallelization on multiple machines

...

AnesBenmerzoug · 2024-07-08T09:14:24Z

Thanks for the feedback! I have restructured and updated the content.

I want your opinion on a few ideas:

I think we should write the how-to guide as notebook to make sure they always run on CI and thus reduce the likelihood of them going stale.
I think we should move the notebooks to the specific place where they're needed / used inside the docs directory. This would simplify the documentation build process because we can get rid of the build script that copies the notebooks everytime.

AnesBenmerzoug · 2024-07-09T08:18:22Z

@janosg, @schroedk I made a few more changes and added more information. This should be ready for a final review.

docs/how-to-guides/how-to-scale-up-if-with-parallel.md

Co-authored-by: Kristof Schröder <[email protected]>

janosg

I think it's very good now. Just have two minor comments we can discuss later.

janosg · 2024-07-15T06:27:04Z

docs/how-to-guides/how-to-speed-up-value-with-parallel.md

+)
+```
+
+## Parallelization


This section is very good but it is explaining background instead of showing how to get stuff done. I would keep it in this document but switch the order of "Local Parallelization" and "Parallelization" and rename "Parallelization" to something like "Understanding the pattern" or "Understanding what happened". That way people first see the code example they can copy and paste and then can decide on their own if they want to continue reading.

janosg · 2024-07-15T06:28:09Z

docs/how-to-guides/how-to-scale-up-if-with-parallel.md

+    see [[speed-up-value-with-parallel]].
+
+
+### Sequential Computation


I think the sequential computation should be explained in a separate document that is marked as prerequisite for this one.

AnesBenmerzoug added 3 commits July 2, 2024 15:23

Add a how-to guide for parallelization

ec8018f

Add a how-to guide for caching

8515574

Remove redundant documentation

5263160

AnesBenmerzoug self-assigned this Jul 2, 2024

AnesBenmerzoug changed the base branch from develop to docs/restructure July 2, 2024 13:43

AnesBenmerzoug requested a review from schroedk July 2, 2024 14:00

mdbenito reviewed Jul 2, 2024

View reviewed changes

docs/how-to-guides/how-to-speedup-caching.md Outdated Show resolved Hide resolved

docs/how-to-guides/how-to-speedup-caching.md Outdated Show resolved Hide resolved

AnesBenmerzoug added 3 commits July 7, 2024 22:53

Restructure and improve how-to guides

2cd2c3b

Rename how-to files for consistency

e2296ed

Fix and improve how-to guides' cards

933776c

AnesBenmerzoug added 7 commits July 8, 2024 13:11

Apply feedback to how-to guide for valuation parallelization

b8d3c1a

Add warning to if scale up guide

0a97284

Fix call to Dataset.from_sklearn

51d959f

Expand caching speed up guide

b93607d

Add conclusion to caching speed up guide

8af7e59

Split code blocks for readability

3f4d40b

Simplify how-to guides index page

4dbda45

schroedk reviewed Jul 10, 2024

View reviewed changes

docs/how-to-guides/how-to-scale-up-if-with-parallel.md Outdated Show resolved Hide resolved

schroedk reviewed Jul 10, 2024

View reviewed changes

docs/how-to-guides/how-to-scale-up-if-with-parallel.md Outdated Show resolved Hide resolved

schroedk reviewed Jul 10, 2024

View reviewed changes

docs/how-to-guides/how-to-scale-up-if-with-parallel.md Outdated Show resolved Hide resolved

schroedk reviewed Jul 10, 2024

View reviewed changes

docs/how-to-guides/how-to-scale-up-if-with-parallel.md Outdated Show resolved Hide resolved

AnesBenmerzoug and others added 3 commits July 11, 2024 09:31

Remove warning about speeding up influence function

04828a2

Add warning about deprecation of in-memory cache backend

8b8a240

Apply suggestions from code review

6e78519

Co-authored-by: Kristof Schröder <[email protected]>

janosg reviewed Jul 15, 2024

View reviewed changes

Split IF how-to guide into two guides

c1e303e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure parallelization and Caching documentation #609

Restructure parallelization and Caching documentation #609

AnesBenmerzoug commented Jul 2, 2024

mdbenito left a comment

janosg commented Jul 2, 2024

AnesBenmerzoug commented Jul 8, 2024

AnesBenmerzoug commented Jul 9, 2024

janosg left a comment

janosg Jul 15, 2024

janosg Jul 15, 2024

		see [[speed-up-value-with-parallel]].


		### Sequential Computation

Restructure parallelization and Caching documentation #609

Are you sure you want to change the base?

Restructure parallelization and Caching documentation #609

Conversation

AnesBenmerzoug commented Jul 2, 2024

Description

Changes

Checklist

mdbenito left a comment

Choose a reason for hiding this comment

janosg commented Jul 2, 2024

How to speed up data valuation with parallelization

Parallelization on a single computer

Parallelization on multiple machines

AnesBenmerzoug commented Jul 8, 2024

AnesBenmerzoug commented Jul 9, 2024

janosg left a comment

Choose a reason for hiding this comment

janosg Jul 15, 2024

Choose a reason for hiding this comment

janosg Jul 15, 2024

Choose a reason for hiding this comment