Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Start on work on moving calculation of derivatives to apply loop for actions that pass vectors and matrices #1093

Draft
wants to merge 63 commits into
base: master
Choose a base branch
from

Conversation

gtribello
Copy link
Member

Description

This is the PR for the benchmarking that we discussed on Wednesday.

DO NOT MERGE

I will provide more explanation tomorrow.

Target release

I would like my code to appear in release 2.11

Type of contribution
  • changes to code or doc authored by PLUMED developers, or additions of code in the core or within the default modules
  • changes to a module not authored by you
  • new module contribution or edit of a module authored by you
Copyright
  • I agree to transfer the copyright of the code I have written to the PLUMED developers or to the author of the code I am modifying.
  • the module I added or modified contains a COPYRIGHT file with the correct license information. Code should be released under an open source license. I also used the command cd src && ./header.sh mymodulename in order to make sure the headers of the module are correct.
Tests
  • I added a new regtest or modified an existing regtest to validate my changes.
  • I verified that all regtests are passed successfully on GitHub Actions.

Gareth Aneurin Tribello added 2 commits July 4, 2024 16:07
@gtribello gtribello added the wip Do not merge label Jul 4, 2024
@gtribello gtribello requested a review from GiovanniBussi July 4, 2024 15:34
@gtribello
Copy link
Member Author

Hi @Iximiel and @GiovanniBussi

I ran out of time to give the details of this yesterday so here they are.

You can calculate a coordination number using this version of the code using the following input:

c1: CONTACT_MATRIX GROUPA=1-3 GROUPB=1-512 SWITCH={EXP D_0=4 R_0=0.5 D_MAX=6}
ones: ONES SIZE=512
cc: MATRIX_VECTOR_PRODUCT ARG=c1,ones
mts: SUM ARG=cc PERIODIC=NO
PRINT ARG=mts FIle=colvar
BIASVALUE ARG=mts

I think it should be possible to reproduce the benchmarks that Daniele was doing on the various versions of the coordination number using the input above.

By default everything the calculation will be run in the same way that it is run in the current master version of the code. That is why all the regtests are passing for this branch. Basically the regtests are doing the same thing that they would be doing in the current master. In other words, when the regtests are run all the linked lists are still used.

To run without using the linked lists you use the following environment variable:

export PLUMED_FORBID_CHAINS=yes

It then does all the calculation of forces during the backwards (apply) loop.

It would be really nice if we could look at the performance of the code with the following options:

  • Calculation without BIASVALUE (i.e. no derivatives) and without PLUMED_FORBID_CHAINS option set.
  • Calculation without BIASVALUE (i.e. no derivatives) and with PLUMED_FORBID_CHAINS=yes option set.
  • Calculation with BIASVALUE (i.e. with derivatives) and without PLUMED_FORBID_CHAINS option set.
  • Calculation with BIASVALUE (i.e. with derivatives) and with PLUMED_FORBID_CHAINS=yes option set.

If we could have some slides with the performance of those four options by next Wednesday so we can discuss next steps in the meeting that would be really great. The benchmarks I am thinking off would look at the performance of the code as you increase the number of atoms. Essentially the sorts of grants @Iximiel has been producing to look at the various GPU flavours.

Thanks

@Iximiel
Copy link
Member

Iximiel commented Jul 8, 2024

@gtribello I am setting up the benches right now

A couple of questions:

  • In how the atom systems of the benches are set up, the first 3 atoms in each system are in a corner, and I think that this might test the branch if dist>dmax then skip of the various systems for most of the couples
    • If I do not change the input file: How should I scale the systems? 3*n vs 512*n or 3 vs 512*n?

@gtribello
Copy link
Member Author

Hi @Iximiel

I don't understand the queries. However, I made the input more complicated than it needed to be. You can do:

c1: CONTACT_MATRIX GROUP=1-100 SWITCH={EXP D_0=4 R_0=0.5 D_MAX=6}
ones: ONES SIZE=100
cc: MATRIX_VECTOR_PRODUCT ARG=c1,ones
mts: SUM ARG=cc PERIODIC=NO
PRINT ARG=mts FIle=colvar
BIASVALUE ARG=mts

To increase the number of atoms, you change the number of indices specified to GROUP in the CONTACT_MATRIX. You must also increase the SIZE in ONES to the number of atoms.

Does that help?

@Iximiel
Copy link
Member

Iximiel commented Jul 8, 2024

Ok, I'll go with GROUP=@mdatoms then, I was wondering if there was a reason for the groupA-groupB input.

On that note, there is a special kw for the number of atoms for ONES SIZE=?

@gtribello
Copy link
Member Author

There is no special keyword for the number of atoms for ONES sorry.

@gtribello gtribello marked this pull request as draft July 10, 2024 07:52
@Iximiel
Copy link
Member

Iximiel commented Jul 10, 2024

As you asked, here's my benchmark set up, the example is for a local run, on LEONARDO I create a single run file per natoms-distr combination:

#!/bin/bash

nsteps=1000
list_of_natoms="500 2000 4000 6000 8000 10000 12000 14000 16000"
export PLUMED_MAXBACKUP=0

export PLUMED_NUM_THREADS=1
mpiprocesses=1
useDistr="line sc"

function setFname() {
  echo "${distr}_${mpiprocesses}mpi_${PLUMED_NUM_THREADS}threads_${natoms}_Steps${nsteps}"
}
mpicmd=""
if [[ $mpiprocesses -gt 1 ]]; then
  mpicmd="mpirun -n $mpiprocesses"
fi

if [[ $1 = run ]]; then
  module purge
  module load gnu9/9.4.0 openmpi4/4.1.1
  module load plumed/master09-07
  for distr in $useDistr; do

    for natoms in $list_of_natoms; do

      fname=$(setFname).out

      echo "output file: ${fname}"
      $mpicmd plumed benchmark \
        --plumed=plumed_master.dat:plumed_other.dat \
        --kernel=this:../../src/lib/install/libplumedKernel.so \
        --nsteps="${nsteps}" \
        --natoms="${natoms}" \
        --atom-distribution="${distr}" \
        >"${fname}"

      grep -B1 Comparative "${fname}"

    done
  done
fi
#postprocess the output
for distr in $useDistr; do
  echo "collecting ./$distr"
  {
    for natoms in $list_of_natoms; do
      echo -n "$natoms "
      fname=$(setFname).out
      grep Calculating "${fname}" |
        awk '{printf "%f ", $7}'
      echo ""
    done
  } >"timing_calculate_${distr}_s${nsteps}.data"
  {
    for natoms in $list_of_natoms; do
      echo -n "$natoms "
      fname=$(setFname).out
      sed -n '/PLUMED: *Cycles *Total *Average *Minimum *Maximum/{n ; p}' "${fname}" |
        awk '{printf "%f ", $3}'
      echo ""
    done
  } >"timing_total_${distr}_s${nsteps}.data"
done

--plumed is the list or the plumed file, --kernel is the kernel or the list of kernels
you can have a 1 inputfile - n kernels or a n inputfiles - 1 kernel or n inputfiles - n kernels and kernel1 will get input1, kernel2 input2 and so on

@gtribello
Copy link
Member Author

Hi @Iximiel

Thanks for the script for the benchmarking. I quickly looked at the code and found something obvious that was causing a massive slowdown. Can you quickly rerun the benchmarks we discussed in today's meeting with this new version of the code and see what difference it makes?

The two options might now be comparable in cost. It would be great if you could post the benchmarks here and tag me in the post.

Thanks

@Iximiel
Copy link
Member

Iximiel commented Jul 10, 2024

Hi @gtribello!
Here's the results with the last commit d15c12f (green) and the old version b9b8969 (red)
image

the old one uses the time I measured yesterday

@gtribello
Copy link
Member Author

Hi @Iximiel

Thanks for running the benchmarks again.

if( actionInChain() ) { ActionWithMatrix::performTask( task_index, myvals ); return; }

if( sumrows ) {
unsigned n=getNumberOfArguments()-1; Value* myvec = getPntrToArgument(n);

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable myvec is not used.
@gtribello
Copy link
Member Author

gtribello commented Jul 16, 2024

Hi @Iximiel and @GiovanniBussi

I had another look at the code in this branch and did some optimisations. These are the new timings against the master branch:

plumed-speed

To be clear, the master here is doing the derivatives in the forward loop. When the master code runs, all the mechanics with the linked lists are used.

For the tests with this branch, I am using the PLUMED_FORBID_CHAINS=yes environment variable and turning off all the linked lists. Derivatives are thus calculated during the backwards (apply) loop.

It would be great to have one more benchmark on Leonardo, like the one in @Iximiel's previous message. Can you please set that up, @Iximiel?

If @Iximiel's benchmark also shows similar performance between the two versions, we can (perhaps) justify making the more significant change. In other words, we can use forces calculated in the backwards (apply) loop everywhere. This change will simplify the code. Removing the linked lists should also make it easier to add parallelism with GPUs to this part of the code.

What do you think? Would you agree that this is an OK way to proceed?

To clarify these changes will be made for v2.11 and there will be no changes of syntax.

@Iximiel
Copy link
Member

Iximiel commented Jul 17, 2024

@gtribello here are the benchmarks I did on LEONARDO:
image

image

The color code is the usual: in lighter color the "4 Calculating (forward loop)" part.
Initially, I did the benches with export PLUMED_NUM_THREADS=8, but they diverged a lot from the one you posted, so I added a run with PLUMED_NUM_THREADS=1
everything was run with

fname=sc_${PLUMED_NUM_THREADS}threads_2000nat_500steps
plumed benchmark --plumed="sc_cm2000master.dat:sc_cm2000target.dat" \
  --kernel="/path/to/master/lib/libplumedKernel.so:/path/to/plumed-derivatives-from-backpropagation/lib/libplumedKernel.so" \
  --maxtime=840 \
  --natoms=2000 --nsteps=500 --atom-distribution=sc >"${fname}.out"

The plumed file is, adapted to the number of atoms

c1: CONTACT_MATRIX GROUP=1-4000 SWITCH={EXP D_0=4 R_0=0.5 D_MAX=6}
ones: ONES SIZE=4000
cc: MATRIX_VECTOR_PRODUCT ARG=c1,ones
mts: SUM ARG=cc PERIODIC=NO
PRINT ARG=mts FIle=sc_colvar-bias-master1706-4000
BIASVALUE ARG=mts
I also did the same calculations, but with `--atom-distribution=line`, they are in the spoiler here:

image

image

@gtribello
Copy link
Member Author

Hello @Iximiel and @GiovanniBussi

I've been working more on writing a version of PLUMED where the derivatives are calculated during the backward loop. I've also gotten confused. It would help me @Iximiel if you could run another benchmark just so that I can check that my conclusions about the state of the project are correct. Sorry and thank you in advance.

So, firstly, the confusion. I realised that the switching function parameters we used in that example input were not particularly sensible. With those switching function parameters, every single atom in the box with 8000 atoms was within the D_MAX cutoff of 6 nm that we set, which is not really how this type of CV is used.

I thus created a new input with more sensible switching function parameters. It is below:

c1: CONTACT_MATRIX GROUP=@mdatoms SWITCH={EXP D_0=1.0 R_0=0.1 D_MAX=2.0} 
ones: ONES SIZE=8000
cc: MATRIX_VECTOR_PRODUCT ARG=c1,ones
mts: SUM ARG=cc PERIODIC=NO 
PRINT ARG=mts FILE=colvar 
BIASVALUE ARG=mts 

Separately @Iximiel it would be useful to add some detail to the manual for plumed benchmark about how the various possible configurations that can be used for benchmarking are setup. I picked the switching function parameters above by trial and error. It would be useful to understand what the configurations are. In other words, it would be good to know the typical spacing between atoms for the configurations used for Benchmarking.

When I run benchmarks, I get the following results:

new_speed_update

I am thus concluding the following:

  1. My version of coordination is WAY faster than COORDINATION - hurrah for link cells.
  2. The new version of my coordination is faster than the version in master if you are not calculating derivatives
  3. The new version of my coordination is slower than the version in master if you are calculating derivatives.

Thanks.

@Iximiel
Copy link
Member

Iximiel commented Jul 19, 2024

Hi @gtribello, with coordination you simply used

c: COORDINATION GROUPA=@mdatoms SWITCH={EXP D_0=4 R_0=0.5 D_MAX=6}

PRINT ARG=* FILE=@distr@_COORDColvar@natoms@ FMT=%8.4f STRIDE=1

FLUSH STRIDE=1

or something more elaborate?


I'll write something in the --help for plumed benchmark.
The rule of thumb is that the mean density should be ~ 1 atom pern nm^3 for the 100% random distribution (sphere, cube, etc.) and that the distance between first neighbors oscillates around 1 nm in the line and sc(simple cubic) distributions:

  • the line one is a line of atoms 1x1xnatoms in dimension, and I don't remember if it is compatible with pbcs
  • with "sc" the atoms are placed in a simple cubic reticle. The fill rule is: take a sc reticle of qxqxq atoms where q^3 is the nearest upper cubic number to natoms (q^3=1,8,27...) and the atoms will fill the reticle from the bottom up

I think it is better to use sc for a "realistic system" and line for stressing the NN optimizations

@gtribello
Copy link
Member Author

Hello again @Iximiel

I did indeed write the input for coordination in the way you said.

I also used the sc for the atom distribution as I agree that it is a better test for the benchmarking.

Separately, if you could add the list of options for the --atom-distribution flag in its description that would be a good first thing to add to the documentation.

src/valtools/VStack.cpp Fixed Show fixed Hide fixed
@gtribello
Copy link
Member Author

Hello again

I found a bottleneck and got the new version to go a bit faster. You can see the plot here:

further-speed

What I wrote in my previous message is still true, though:

  • With one thread and no derivatives, the new version is faster
  • With four threads and no derivatives, master is faster
  • With derivatives, master is faster with one and four threads

…hat we already worked out that some of the elements of the matrix are zero and that the forces on these elements are zero
Gareth Aneurin Tribello added 5 commits September 6, 2024 19:56
…new scheme for forces

Changes to reference files are because of accumulated small differences because the calculations
are done in a different way.
I have removed the test on the forces for this regtest as the code with the new derivatives
performs this calculation very slowly because I am no longer able to work out the subset of
atoms on which to calculate the q6 parameter.  Also I think this test is not very robust when the calculation
is done in different ways.

I am pretty certain that the forces for this type of calculation are calculated correctly as there are other tests
that test forces that work in similar situations.  However, some careful benchmarking of this type of calculation
needs to be peformed before this branch is merged
See previous commit for explanation as to why test on forces has been removed.
Gareth Aneurin Tribello added 4 commits September 10, 2024 14:31
… by default.

Changes to regtest config files mostly remove code for setting this option on in some regtests.
A few tests don't yet work with the new implementation and have been changed so as to run with the old
implementation until I get them fixed.
…being calculated during the backwards loop

For the beta sheet variables all the strands cutoff functionality is done in separate DISTANCE and CUSTOM actions.
I prefer this way of doing things as it makes all the steps of the calculations visible in the expanded version of the
shortcut's input file.
…me. Now everything in the code can run with the new scheme
std::string nopbcstr=""; bool nopbc; parseFlag("NOPBC",nopbc); if( nopbc ) nopbcstr = " NOPBC";
if( strands_cutoff.length()>0 ) strands_cutoff=" CUTOFF_ATOMS=6,21 STRANDS_CUTOFF="+strands_cutoff;
// if( strands_cutoff.length()>0 ) strands_cutoff=" CUTOFF_ATOMS=6,21 STRANDS_CUTOFF="+strands_cutoff;

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
Gareth Aneurin Tribello and others added 4 commits September 12, 2024 11:17
… over matrix elements rather than over the rows of the matrix

These clases now inherit from ActionWithVector rather than ActionWithMatrix

Changes to regtests revert changes that were made in an earlier commit when I started  reworking the
derivatives.  In other words, the forces in these actions are back to the values that we had when
we were using the chains derivatives.  I had introduced an error in an earlier commit
src/core/Value.cpp Fixed Show fixed Hide fixed
src/core/Value.cpp Fixed Show fixed Hide fixed
@@ -320,7 +323,7 @@
action->log.printf(" with variables :");
for(unsigned i=0; i<var.size(); i++) action->log.printf(" %s",var[i].c_str());
action->log.printf("\n"); function.set( func, var, action );
std::vector<double> zeros( action->getNumberOfArguments(), 0 ); double fval = abs(function.evaluate(zeros));
std::vector<double> zeros( nargs, 0 ); double fval = abs(function.evaluate(zeros));

Check warning

Code scanning / CodeQL

Lossy function result cast Warning

Return value of type double is implicitly converted to int.
src/function/FunctionOfMatrix.h Fixed Show fixed Hide fixed
src/matrixtools/MatrixTimesVector.cpp Fixed Show fixed Hide fixed
for(unsigned i=0; i<getNumberOfComponents(); ++i) {
Value* myval = const_cast<Value*>( getConstPntrToComponent(i) ); unsigned ncols = myval->getNumberOfColumns();
if( myval->getRank()!=2 || myval->hasDerivatives() || ncols>=myval->getShape()[1] ) continue;
myval->matrix_bookeeping[task_index*(1+ncols)]=0;

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'unsigned int' before it is converted to 'size_type'.

Copilot Autofix AI 4 months ago

To fix the problem, we need to ensure that the multiplication is performed using a larger integer type to prevent overflow. This can be done by casting one of the operands to the larger type before performing the multiplication. In this case, we will cast task_index to size_t before the multiplication.

  • General fix: Cast one of the operands to a larger type before performing the multiplication.
  • Detailed fix: In the file src/core/ActionWithMatrix.cpp, on line 58, cast task_index to size_t before performing the multiplication.
  • Specific changes: Modify line 58 to cast task_index to size_t.
  • Requirements: No additional methods, imports, or definitions are needed.
Suggested changeset 1
src/core/ActionWithMatrix.cpp

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/src/core/ActionWithMatrix.cpp b/src/core/ActionWithMatrix.cpp
--- a/src/core/ActionWithMatrix.cpp
+++ b/src/core/ActionWithMatrix.cpp
@@ -57,3 +57,3 @@
     if( myval->getRank()!=2 || myval->hasDerivatives() || ncols>=myval->getShape()[1] ) continue;
-    myval->matrix_bookeeping[task_index*(1+ncols)]=0;
+    myval->matrix_bookeeping[static_cast<size_t>(task_index)*(1+ncols)]=0;
   }
EOF
@@ -57,3 +57,3 @@
if( myval->getRank()!=2 || myval->hasDerivatives() || ncols>=myval->getShape()[1] ) continue;
myval->matrix_bookeeping[task_index*(1+ncols)]=0;
myval->matrix_bookeeping[static_cast<size_t>(task_index)*(1+ncols)]=0;
}
Copilot is powered by AI and may make mistakes. Always verify output.
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
matpos=p;
ncols = myarg->getNumberOfColumns(); matrix_bookeeping.resize( myarg->matrix_bookeeping.size() );
for(unsigned i=0; i<matrix_bookeeping.size(); ++i) matrix_bookeeping[i] = myarg->matrix_bookeeping[i];
data.resize( shape[0]*ncols ); inputForce.resize( shape[0]*ncols );

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'unsigned int' before it is converted to 'size_type'.

Copilot Autofix AI 4 months ago

To fix the problem, we need to ensure that the multiplication is performed using a larger integer type before the result is used. This can be achieved by casting one of the operands to std::size_t before performing the multiplication. This way, the multiplication will be done using the larger type, preventing overflow.

The specific changes required are:

  1. Cast shape[0] to std::size_t before multiplying it by ncols in the reshapeMatrixStore and copyBookeepingArrayFromArgument methods.
Suggested changeset 1
src/core/Value.cpp

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/src/core/Value.cpp b/src/core/Value.cpp
--- a/src/core/Value.cpp
+++ b/src/core/Value.cpp
@@ -275,3 +275,3 @@
   ncols=n; if( ncols>shape[1] ) ncols=shape[1];
-  unsigned size=shape[0]*ncols;
+  unsigned size=static_cast<std::size_t>(shape[0])*ncols;
   if( matrix_bookeeping.size()!=(size+shape[0]) ) {
@@ -292,3 +292,3 @@
   for(unsigned i=0; i<matrix_bookeeping.size(); ++i) matrix_bookeeping[i] = myarg->matrix_bookeeping[i];
-  data.resize( shape[0]*ncols ); inputForce.resize( shape[0]*ncols );
+  data.resize( static_cast<std::size_t>(shape[0])*ncols ); inputForce.resize( static_cast<std::size_t>(shape[0])*ncols );
 }
EOF
@@ -275,3 +275,3 @@
ncols=n; if( ncols>shape[1] ) ncols=shape[1];
unsigned size=shape[0]*ncols;
unsigned size=static_cast<std::size_t>(shape[0])*ncols;
if( matrix_bookeeping.size()!=(size+shape[0]) ) {
@@ -292,3 +292,3 @@
for(unsigned i=0; i<matrix_bookeeping.size(); ++i) matrix_bookeeping[i] = myarg->matrix_bookeeping[i];
data.resize( shape[0]*ncols ); inputForce.resize( shape[0]*ncols );
data.resize( static_cast<std::size_t>(shape[0])*ncols ); inputForce.resize( static_cast<std::size_t>(shape[0])*ncols );
}
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
matpos=p;
ncols = myarg->getNumberOfColumns(); matrix_bookeeping.resize( myarg->matrix_bookeeping.size() );
for(unsigned i=0; i<matrix_bookeeping.size(); ++i) matrix_bookeeping[i] = myarg->matrix_bookeeping[i];
data.resize( shape[0]*ncols ); inputForce.resize( shape[0]*ncols );

Check failure

Code scanning / CodeQL

Multiplication result converted to larger type High

Multiplication result may overflow 'unsigned int' before it is converted to 'size_type'.

Copilot Autofix AI 4 months ago

To fix the problem, we need to ensure that the multiplication is performed using a larger integer type before the result is assigned to the size variable. This can be achieved by casting one of the operands to std::size_t before performing the multiplication. This way, the multiplication will be done using the larger type, preventing any overflow.

  • General Fix: Cast one of the operands to std::size_t before performing the multiplication.
  • Detailed Fix: In the reshapeMatrixStore and copyBookeepingArrayFromArgument methods, cast shape[0] to std::size_t before multiplying it by ncols.
  • Specific Changes: Modify lines 276 and 293 in the file src/core/Value.cpp.
Suggested changeset 1
src/core/Value.cpp

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/src/core/Value.cpp b/src/core/Value.cpp
--- a/src/core/Value.cpp
+++ b/src/core/Value.cpp
@@ -275,3 +275,3 @@
   ncols=n; if( ncols>shape[1] ) ncols=shape[1];
-  unsigned size=shape[0]*ncols;
+  unsigned size=static_cast<std::size_t>(shape[0])*ncols;
   if( matrix_bookeeping.size()!=(size+shape[0]) ) {
@@ -292,3 +292,3 @@
   for(unsigned i=0; i<matrix_bookeeping.size(); ++i) matrix_bookeeping[i] = myarg->matrix_bookeeping[i];
-  data.resize( shape[0]*ncols ); inputForce.resize( shape[0]*ncols );
+  data.resize( static_cast<std::size_t>(shape[0])*ncols ); inputForce.resize( static_cast<std::size_t>(shape[0])*ncols );
 }
EOF
@@ -275,3 +275,3 @@
ncols=n; if( ncols>shape[1] ) ncols=shape[1];
unsigned size=shape[0]*ncols;
unsigned size=static_cast<std::size_t>(shape[0])*ncols;
if( matrix_bookeeping.size()!=(size+shape[0]) ) {
@@ -292,3 +292,3 @@
for(unsigned i=0; i<matrix_bookeeping.size(); ++i) matrix_bookeeping[i] = myarg->matrix_bookeeping[i];
data.resize( shape[0]*ncols ); inputForce.resize( shape[0]*ncols );
data.resize( static_cast<std::size_t>(shape[0])*ncols ); inputForce.resize( static_cast<std::size_t>(shape[0])*ncols );
}
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
unsigned kind = myarg->getRowIndex( index1, i );
addDerivativeOnMatrixArgument( stored_matrix1, 0, 0, index1, kind, dvec1[i]/(2*matval), myvals );
addDerivativeOnMatrixArgument( stored_matrix2, 0, 1, kind, ind2, dvec2[i]/(2*matval), myvals );
unsigned kind = myarg->getRowIndex( index1, i ); if( colno[i]<0 ) continue;

Check notice

Code scanning / CodeQL

Unused local variable Note

Variable kind is not used.
src/matrixtools/MatrixTimesVector.cpp Fixed Show fixed Hide fixed
@gtribello
Copy link
Member Author

gtribello commented Sep 23, 2024

Hello @Iximiel and @GiovanniBussi

I am finally in a position to discuss how we move forward with this tomorrow morning. The latest timings for Coordination number, Q6 and LQ6 are shown below. The bars show the times for 20 steps. You can see that the new code performs much better than the current master.

latest-times

In the graph above, the colours are as follows:

Blue = Current master 1 thread
Orange = New version 1 thread
Green = Current master 4 threads
Red = New version 4 threads

A small problem is that the code crashes when you try to run calculations of Q6 with derivatives for 8000 atoms with 4 threads. I think my laptop doesn't have enough memory to run calculations this large. This problem can probably be fixed. Before working too much on that, however, I would like to try to work out how to involve Daniele in this project and how we can start having some of this code run on GPUs.

See you tomorrow.

--

For reference, input for Q6 is:

q6: Q6 SPECIES=@mdatoms SWITCH={EXP D_0=1.0 R_0=0.1 D_MAX=2.0} SUM 
PRINT ARG=q6_sum FILE=colvar 
BIASVALUE ARG=q6_sum

and for LQ6

q6: Q6 SPECIES=@mdatoms SWITCH={EXP D_0=1.0 R_0=0.1 D_MAX=2.0} 
lq6: LOCAL_Q6 SPECIES=q6 SWITCH={EXP D_0=1.0 R_0=0.1 D_MAX=2.0} SUM 
PRINT ARG=lq6_sum FILE=colvar 
BIASVALUE ARG=lq6_sum

@GiovanniBussi
Copy link
Member

Thanks @gtribello

I guess there's some wrong x-label, but if they all refer to something describing an increase of the number of atoms I think it's very promising. A few comments/questions:

  • If I understand correctly, the diffference is how you handle backpropagation. The new way on handling it is both better for the number of operations (so it gets faster now) and easier to port on GPU.
  • I am a bit confused by the timings. In the blue bar, are they expected to grow quadratically with N or linearly? If the correct x-label is the one in the top right figure I guess they are growing linearly, right? Do you change the scaling when you pass to the new implementation?
  • It's interesting that the memory problem is only with 4 threads. Perhaps the way you allocate the buffers for OMP reduction can be modified?
  • Again about the memory issue, would it be conceivable to rewrite this variable in blocks? Like, for coordination, it would be GROUPA=1-10 GROUPB=11-20 then GROUPA=11-20 GROUPB=1-10 then GROUPA=1-10 GROUPB=11-12 then GROUPA=11-20 GROUPB=11-20 and then take the sum? This could be even done automatically.

@gtribello
Copy link
Member Author

Ah yes sorry. The number of atoms is always 500, 2000, 4000 and 8000. I forgot to change the tick labels on all the subfigs.

@gtribello
Copy link
Member Author

I think the timings should grow linearly with the number of atoms because we are using a neighbour list.

@gtribello
Copy link
Member Author

OK @GiovanniBussi and @Iximiel

I've added some further code and the memory problem is now fixed. The times including the 4 threads and 8000 atoms are as follows:

latest-times-2

Comment on lines +279 to +286
const ActionWithMatrix* am = dynamic_cast<const ActionWithMatrix*>(this);
myvals.setTaskIndex(current); myvals.vector_call=true; performTask( current, myvals );
for(unsigned i=0; i<getNumberOfComponents(); ++i) {
const Value* myval = getConstPntrToComponent(i);
if( am || myval->hasDerivatives() ) continue;
Value* myv = const_cast<Value*>( myval );
if( getName()=="RMSD_VECTOR" && myv->getRank()==2 ) continue;
myv->set( current, myvals.get( i ) );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a const function in which non-const things happen, it is called from non-const functions(runAllTasks and checkForForces), so it does not need the const modifier.
It may be easier to unflag this from its constness and change to code to something like, so there is not need to const_cast the values to modify them:

Suggested change
const ActionWithMatrix* am = dynamic_cast<const ActionWithMatrix*>(this);
myvals.setTaskIndex(current); myvals.vector_call=true; performTask( current, myvals );
for(unsigned i=0; i<getNumberOfComponents(); ++i) {
const Value* myval = getConstPntrToComponent(i);
if( am || myval->hasDerivatives() ) continue;
Value* myv = const_cast<Value*>( myval );
if( getName()=="RMSD_VECTOR" && myv->getRank()==2 ) continue;
myv->set( current, myvals.get( i ) );
myvals.setTaskIndex(current);
myvals.vector_call=true;
performTask( current, myvals );
if (const ActionWithMatrix* am = dynamic_cast<const ActionWithMatrix*>(this);!am){
const bool IamRMSD_VECTOR=getName()=="RMSD_VECTOR";
for(unsigned i=0; i<getNumberOfComponents(); ++i) {
if(Value* myval = getPntrToComponent(i);
!myval->hasDerivatives() &&
!( IamRMSD_VECTOR && myval->getRank()==2 )) {
myval->set( current, myvals.get( i ) );
}
}

I also moved some conditions outside the loop and changed the scope of am

Maybe the discrimination for RMSD_VECTOR can be done in it? Or with an ad-hoc overloaded method virtual bool skipIf(Value *)const{return true};?

Then for the parallelization, we can manipulate MultiValue into a Data sitter for cpu/gpu adresses

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wip Do not merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants