Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restructure parallelization and Caching documentation #609

Open
wants to merge 17 commits into
base: docs/restructure
Choose a base branch
from

Conversation

AnesBenmerzoug
Copy link
Collaborator

Description

This PR addresses part of #581

Changes

  • Created a how-to guide for speeding up computations using parallelization using existing documentation and adapting it to changes in Feature/refactor value #558.
  • Created a how-to guide for speeding up computations using caching using existing documentation and adapting it to changes in Feature/refactor value #558.

Checklist

  • Wrote Unit tests (if necessary)
  • Updated Documentation (if necessary)
  • Updated Changelog
  • If notebooks were added/changed, added boilerplate cells are tagged with "tags": ["hide"] or "tags": ["hide-input"]

@AnesBenmerzoug AnesBenmerzoug self-assigned this Jul 2, 2024
@AnesBenmerzoug AnesBenmerzoug changed the base branch from develop to docs/restructure July 2, 2024 13:43
@AnesBenmerzoug AnesBenmerzoug requested a review from schroedk July 2, 2024 14:00
Copy link
Collaborator

@mdbenito mdbenito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of comments. Haven't reviewed all of it.

A more general note is: does this really count as a "how to" guide? If I understood correctly, those should focus on answering the question, not on explaining how things work. Opinions anyone?

docs/how-to-guides/how-to-speedup-caching.md Outdated Show resolved Hide resolved
docs/how-to-guides/how-to-speedup-caching.md Outdated Show resolved Hide resolved
@janosg
Copy link
Collaborator

janosg commented Jul 2, 2024

A more general note is: does this really count as a "how to" guide? If I understood correctly, those should focus on answering the question, not on explaining how things work. Opinions anyone?

I would also say that in the current form it is mixing a how-to guide and explanations. An ideal how-to guide according to the diataxis framework is extremely focused on solving one specific user need. Users are usually in a hurry and don't read anything carefully. So the how-to guide needs to be very easy to skim. Less text and more code snippets is better. But you do need a bit of text (headings and introductory sentences) to reassure the reader they are at the right place or tell them where they should go instead.

I think I also have made the mistake of adding explanations in almost all how-to guides I wrote. But I'll try to make an example of how I understand how-to guides now.

An example of a specific situation could be: A user wants to do data valuation. It's slow but they have a big computer and wonder how they can use it.

Then the how-to guide could be structured as follows:

How to speed up data valuation with parallelization

This guide will show you how to speed up data valuation algorithms by using parallel hardware. For alternative ways to speed up data valuation see [caching]. For parallelization of influence functions see [...].

Parallelization on a single computer

This approach is a good idea if you are working on a laptop or small server.

# code snippet here
# the code should already make it obvious how one sets the number of cores
# no need to mention default backends or stuff

(Explain what the code does but do not describe in words how it could be changed or how joblib works. If you want to show variations, add a second code snippet. If there are many code snippets, use tabs.)

For advanced configuration see [joblib docs]

Parallelization on multiple machines

...

@AnesBenmerzoug
Copy link
Collaborator Author

Thanks for the feedback! I have restructured and updated the content.

I want your opinion on a few ideas:

  • I think we should write the how-to guide as notebook to make sure they always run on CI and thus reduce the likelihood of them going stale.
  • I think we should move the notebooks to the specific place where they're needed / used inside the docs directory. This would simplify the documentation build process because we can get rid of the build script that copies the notebooks everytime.

@AnesBenmerzoug
Copy link
Collaborator Author

@janosg, @schroedk I made a few more changes and added more information. This should be ready for a final review.

Copy link
Collaborator

@janosg janosg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's very good now. Just have two minor comments we can discuss later.

)
```

## Parallelization
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section is very good but it is explaining background instead of showing how to get stuff done. I would keep it in this document but switch the order of "Local Parallelization" and "Parallelization" and rename "Parallelization" to something like "Understanding the pattern" or "Understanding what happened". That way people first see the code example they can copy and paste and then can decide on their own if they want to continue reading.

see [[speed-up-value-with-parallel]].


### Sequential Computation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the sequential computation should be explained in a separate document that is marked as prerequisite for this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants