Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring MetaDataObject out of DenseMatrix #758

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

corepointer
Copy link
Collaborator

This PR moves the MetaDataObject (MDO) functionality out of DenseMatrix and generalizes it to be used by other classes derived from Structure as well.

Furthermore, this contains a performance improvement to prevent excessive allocation ID lookups and a separation of ranged and full allocations.

All tests are running except the distributed ones.

corepointer added a commit to corepointer/daphne that referenced this pull request Jul 22, 2024
* This commit introduces the meta data object to the CSR data type

* Memory pinning

To prevent excessive allocation ID lookups in the hot path when using --vec, this change "pins" memory by allocation type of previous accesses.
@corepointer corepointer mentioned this pull request Jul 22, 2024
corepointer added a commit to corepointer/daphne that referenced this pull request Jul 29, 2024
* This commit introduces the meta data object to the CSR data type

* Memory pinning

To prevent excessive allocation ID lookups in the hot path when using --vec, this change "pins" memory by allocation type of previous accesses.
corepointer added a commit to corepointer/daphne that referenced this pull request Aug 19, 2024
* This commit introduces the meta data object to the CSR data type

* Memory pinning

To prevent excessive allocation ID lookups in the hot path when using --vec, this change "pins" memory by allocation type of previous accesses.
corepointer added a commit to corepointer/daphne that referenced this pull request Oct 18, 2024
* This commit introduces the meta data object to the CSR data type

* Memory pinning

To prevent excessive allocation ID lookups in the hot path when using --vec, this change "pins" memory by allocation type of previous accesses.
@corepointer corepointer force-pushed the mdo_csr_cuda_refactor branch from df4702e to cfd8053 Compare October 18, 2024 15:15
corepointer added a commit to corepointer/daphne that referenced this pull request Oct 18, 2024
… Pinning

* This commit introduces the meta data object to the CSRMatrix data type
  To implement this change, handling of the AllocationDescriptors has been refactored out of DenseMatrix.

* Separate handling of ranges
  Since tracking of ranges of data is only used in the distributed setting for now, we will handle this separately and assume always a full allocation for local computation. This should result in less unnecessary "if range not null do this, else do that".

* Memory pinning
  To prevent excessive allocation ID lookups in the hot path, especially when using --vec, this change "pins" memory by allocation type of previous accesses.
  Simply put, as long as there is no different access type (e.g., call getValues() for host vs device memory) it is assumed, that the data is not changed and no query of the meta data object needs to be done.

Closes daphne-eu#758
@corepointer corepointer force-pushed the mdo_csr_cuda_refactor branch from cfd8053 to 17d3baa Compare October 18, 2024 15:31
corepointer added a commit to corepointer/daphne that referenced this pull request Oct 18, 2024
… Pinning

* This commit introduces the meta data object to the CSRMatrix data type
  To implement this change, handling of the AllocationDescriptors has been refactored out of DenseMatrix.

* Separate handling of ranges
  Since tracking of ranges of data is only used in the distributed setting for now, we will handle this separately and assume always a full allocation for local computation. This should result in less unnecessary "if range not null do this, else do that".

* Memory pinning
  To prevent excessive allocation ID lookups in the hot path, especially when using --vec, this change "pins" memory by allocation type of previous accesses.
  Simply put, as long as there is no different access type (e.g., call getValues() for host vs device memory) it is assumed, that the data is not changed and no query of the meta data object needs to be done.

Closes daphne-eu#758
@corepointer corepointer force-pushed the mdo_csr_cuda_refactor branch from 17d3baa to d9d1b59 Compare October 18, 2024 17:04
corepointer added a commit to corepointer/daphne that referenced this pull request Oct 18, 2024
… Pinning

* This commit introduces the meta data object to the CSRMatrix data type
  To implement this change, handling of the AllocationDescriptors has been refactored out of DenseMatrix.

* Separate handling of ranges
  Since tracking of ranges of data is only used in the distributed setting for now, we will handle this separately and assume always a full allocation for local computation. This should result in less unnecessary "if range not null do this, else do that".

* Memory pinning
  To prevent excessive allocation ID lookups in the hot path, especially when using --vec, this change "pins" memory by allocation type of previous accesses.
  Simply put, as long as there is no different access type (e.g., call getValues() for host vs device memory) it is assumed, that the data is not changed and no query of the meta data object needs to be done.

Closes daphne-eu#758
@corepointer corepointer force-pushed the mdo_csr_cuda_refactor branch from d9d1b59 to 9016ae9 Compare October 18, 2024 17:06
corepointer added a commit to corepointer/daphne that referenced this pull request Oct 18, 2024
… Pinning

* This commit introduces the meta data object to the CSRMatrix data type
  To implement this change, handling of the AllocationDescriptors has been refactored out of DenseMatrix.

* Separate handling of ranges
  Since tracking of ranges of data is only used in the distributed setting for now, we will handle this separately and assume always a full allocation for local computation. This should result in less unnecessary "if range not null do this, else do that".

* Memory pinning
  To prevent excessive allocation ID lookups in the hot path, especially when using --vec, this change "pins" memory by allocation type of previous accesses.
  Simply put, as long as there is no different access type (e.g., call getValues() for host vs device memory) it is assumed, that the data is not changed and no query of the meta data object needs to be done.

Closes daphne-eu#758
@corepointer corepointer force-pushed the mdo_csr_cuda_refactor branch from 9016ae9 to 6f6da3b Compare October 18, 2024 17:10
corepointer added a commit to corepointer/daphne that referenced this pull request Oct 18, 2024
… Pinning

* This commit introduces the meta data object to the CSRMatrix data type
  To implement this change, handling of the AllocationDescriptors has been refactored out of DenseMatrix.

* Separate handling of ranges
  Since tracking of ranges of data is only used in the distributed setting for now, we will handle this separately and assume always a full allocation for local computation. This should result in less unnecessary "if range not null do this, else do that".

* Memory pinning
  To prevent excessive allocation ID lookups in the hot path, especially when using --vec, this change "pins" memory by allocation type of previous accesses.
  Simply put, as long as there is no different access type (e.g., call getValues() for host vs device memory) it is assumed, that the data is not changed and no query of the meta data object needs to be done.

Closes daphne-eu#758
@corepointer corepointer force-pushed the mdo_csr_cuda_refactor branch from 6f6da3b to ce36921 Compare October 18, 2024 17:12
@corepointer
Copy link
Collaborator Author

The numerous force pushes are a result of my local clang-format disagreeing with the CI's clang-format:

 --- src/runtime/local/datastructures/AllocationDescriptorGRPC.h	(original)
+++ src/runtime/local/datastructures/AllocationDescriptorGRPC.h	(reformatted)
@@ -35,7 +35,7 @@
   public:
     AllocationDescriptorGRPC() = default;
     AllocationDescriptorGRPC(DaphneContext *ctx, const std::string &address, const DistributedData &data)
-        : ctx(ctx), workerAddress(address), distributedData(data) {};
+        : ctx(ctx), workerAddress(address), distributedData(data){};
 
     ~AllocationDescriptorGRPC() override = default;
     [[nodiscard]] ALLOCATION_TYPE getType() const override { return type; };

@corepointer corepointer marked this pull request as ready for review October 18, 2024 17:15
@corepointer corepointer added feature missing/requested features performance label for PRs of perf++ and issues of perf-- Accelerators Distributed Issues and PRs related to distributed computation labels Oct 18, 2024
@corepointer
Copy link
Collaborator Author

Explaining the labels:

  • feature: CUDA handling CSRMatrix is new
  • Accelerator: it's (also) about CUDA ops
  • Distributed: the refactoring affects this component
  • Performance: besides this one explaining itself, the pinning and being able to run sparse stuff on GPU help with performance 💪

@corepointer corepointer requested a review from pdamme October 18, 2024 17:20
…not throw

Changing the behavior of fileExists() to a boolean operation as suggested by the method's name. Throwing an exception us up to the caller of this method.

Closes daphne-eu#867
… Pinning

* This commit introduces the meta data object to the CSRMatrix data type
  To implement this change, handling of the AllocationDescriptors has been refactored out of DenseMatrix.

* Separate handling of ranges
  Since tracking of ranges of data is only used in the distributed setting for now, we will handle this separately and assume always a full allocation for local computation. This should result in less unnecessary "if range not null do this, else do that".

* Memory pinning
  To prevent excessive allocation ID lookups in the hot path, especially when using --vec, this change "pins" memory by allocation type of previous accesses.
  Simply put, as long as there is no different access type (e.g., call getValues() for host vs device memory) it is assumed, that the data is not changed and no query of the meta data object needs to be done.

Closes daphne-eu#758
Due to the use of ptr to local var the distributed (GRPC_SYNC) mode crashed in test cases. This patch fixes this by using std::unique_ptr appropriately.
Furthermore, a check for nullptr is performed before getting distributed data to add a message indicating that execution failed here.
@corepointer corepointer force-pushed the mdo_csr_cuda_refactor branch from ce36921 to d434bf5 Compare October 19, 2024 00:59
@philipportner
Copy link
Collaborator

Hi @corepointer, thanks for putting so much effort into improving this code.

This PR is pretty big and includes quite a few important changes at the core of our data structures.

As there's a lot of changes to this core code, but no changes to the documentation, could you please update the documentation describing the overall idea behind the whole MetaDataObject and IAllocatoinDescriptor functionality?

Especially the knowledge that is hard to get from just reading the code, like what it can and what it cannot do (and what it should do?). From what I can remember of previous discussions and you mention "ranged and full allocations", but maybe I'm wrong, it should handle parts of a DenseMatrix being allocated on CPU and parts of it being moved to GPU? Stuff like that :)

Additionally, it would be great if this can explicitly be tested as part of our test suite. Proper testing of this core functionality would be incredibly beneficial for working on this code in the future. This should be doable by providing mock implementations of the IAllocationDescriptor.

Lastly, could you please provide some performance results for these changes that measure these improvements you mention concerning excessive allocation ID lookups?

I'd also be happy to do a review of the PR if you like.

@pdamme
Copy link
Collaborator

pdamme commented Oct 24, 2024

I was also just finishing reading the code: Thanks for this contribution, @corepointer. Refactoring the use of the meta data objects to also support CSRMatrix and to solve some of the performance issues of the vectorized engine runtime is highly welcome!

Major comments:

  1. I think I have a very rough idea of how the meta data object etc. works on main, but I cannot claim that my understanding is deep enough to be able to judge all implications of your changes. While I didn't spot any obvious big problems, I totally second @philipportner's request for more developer documentation (e.g., a page about the ideas behind the meta data object and related entities and how they are used by the data structures etc.).
  2. Likewise about the performance implications: You mentioned that the "pinning" of the allocation can make the repeated calls to getValues() in the vectorized engine faster. Do you have a concrete example where this performance improvement can be observed? A quick experiment of bin/daphne --vec test/api/cli/algorithms/kmeans.daphne r=1000000 f=1000 c=5 i=10 didn’t show any improvement compared to the main branch on my system (but I didn’t try hard to find a good example).

A few detailed comments:

  1. CSRMatrix.h/cpp: The renaming of getValues() to getRowValues() and getColIdxs() to getColIdxsOfRow() is inconsistent. Could we call the former getValuesOfRow() (or, do we really need the renaming at all)?
  2. AggCum.h: You introduced variables arg_inc and res_inc, which are either the numCols or the rowSkip, depending on whether the respective matrix is a vew. However, the rowSkip is meant to abstract exactly that: no matter if a matrix is an "original" or a view, the rowSkip is always the number by which one must increase a pointer in order to get to the element below.

@corepointer
Copy link
Collaborator Author

Thanks for the feedback @philipportner and @pdamme. And thanks for a taste of my one medicine (I frequently note the lacking documentation and test cases in code reviews) 😆 I will, of course, fix this.

  • Pinning performance improvement: The idea originated from an observation of @philipportner that a lot of time is spent in getDataPlacementByID(), which ran a double for loop on every get/set of a value. Do you remember (@philipportner) what that script was where you encountered the issue? I'd use that as a starting point.
  • Adding a getRowValues() (in addition to getValues()) was an ad-hoc decision concerning the naming and can certainly be changed.
  • In the AggCum.h issue I can't remember off the top of my head now what exactly the issue was. I think I found a case where the increment in the algorithm needed to be numCols for some cases and rowSkip for other cases depending on whether there is a view or a "real" matrix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accelerators Distributed Issues and PRs related to distributed computation feature missing/requested features performance label for PRs of perf++ and issues of perf--
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants