-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Start on work on moving calculation of derivatives to apply loop for actions that pass vectors and matrices #1093
base: master
Are you sure you want to change the base?
Conversation
…oring values as they are only needed on the backward pass to calculate forces
Hi @Iximiel and @GiovanniBussi I ran out of time to give the details of this yesterday so here they are. You can calculate a coordination number using this version of the code using the following input:
I think it should be possible to reproduce the benchmarks that Daniele was doing on the various versions of the coordination number using the input above. By default everything the calculation will be run in the same way that it is run in the current master version of the code. That is why all the regtests are passing for this branch. Basically the regtests are doing the same thing that they would be doing in the current master. In other words, when the regtests are run all the linked lists are still used. To run without using the linked lists you use the following environment variable:
It then does all the calculation of forces during the backwards (apply) loop. It would be really nice if we could look at the performance of the code with the following options:
If we could have some slides with the performance of those four options by next Wednesday so we can discuss next steps in the meeting that would be really great. The benchmarks I am thinking off would look at the performance of the code as you increase the number of atoms. Essentially the sorts of grants @Iximiel has been producing to look at the various GPU flavours. Thanks |
@gtribello I am setting up the benches right now A couple of questions:
|
Hi @Iximiel I don't understand the queries. However, I made the input more complicated than it needed to be. You can do:
To increase the number of atoms, you change the number of indices specified to GROUP in the CONTACT_MATRIX. You must also increase the SIZE in ONES to the number of atoms. Does that help? |
Ok, I'll go with On that note, there is a special kw for the number of atoms for |
There is no special keyword for the number of atoms for ONES sorry. |
As you asked, here's my benchmark set up, the example is for a local run, on LEONARDO I create a single run file per natoms-distr combination: #!/bin/bash
nsteps=1000
list_of_natoms="500 2000 4000 6000 8000 10000 12000 14000 16000"
export PLUMED_MAXBACKUP=0
export PLUMED_NUM_THREADS=1
mpiprocesses=1
useDistr="line sc"
function setFname() {
echo "${distr}_${mpiprocesses}mpi_${PLUMED_NUM_THREADS}threads_${natoms}_Steps${nsteps}"
}
mpicmd=""
if [[ $mpiprocesses -gt 1 ]]; then
mpicmd="mpirun -n $mpiprocesses"
fi
if [[ $1 = run ]]; then
module purge
module load gnu9/9.4.0 openmpi4/4.1.1
module load plumed/master09-07
for distr in $useDistr; do
for natoms in $list_of_natoms; do
fname=$(setFname).out
echo "output file: ${fname}"
$mpicmd plumed benchmark \
--plumed=plumed_master.dat:plumed_other.dat \
--kernel=this:../../src/lib/install/libplumedKernel.so \
--nsteps="${nsteps}" \
--natoms="${natoms}" \
--atom-distribution="${distr}" \
>"${fname}"
grep -B1 Comparative "${fname}"
done
done
fi
#postprocess the output
for distr in $useDistr; do
echo "collecting ./$distr"
{
for natoms in $list_of_natoms; do
echo -n "$natoms "
fname=$(setFname).out
grep Calculating "${fname}" |
awk '{printf "%f ", $7}'
echo ""
done
} >"timing_calculate_${distr}_s${nsteps}.data"
{
for natoms in $list_of_natoms; do
echo -n "$natoms "
fname=$(setFname).out
sed -n '/PLUMED: *Cycles *Total *Average *Minimum *Maximum/{n ; p}' "${fname}" |
awk '{printf "%f ", $3}'
echo ""
done
} >"timing_total_${distr}_s${nsteps}.data"
done
|
Hi @Iximiel Thanks for the script for the benchmarking. I quickly looked at the code and found something obvious that was causing a massive slowdown. Can you quickly rerun the benchmarks we discussed in today's meeting with this new version of the code and see what difference it makes? The two options might now be comparable in cost. It would be great if you could post the benchmarks here and tag me in the post. Thanks |
Hi @gtribello! the old one uses the time I measured yesterday |
Hi @Iximiel Thanks for running the benchmarks again. |
Hi @Iximiel and @GiovanniBussi I had another look at the code in this branch and did some optimisations. These are the new timings against the master branch: To be clear, the master here is doing the derivatives in the forward loop. When the master code runs, all the mechanics with the linked lists are used. For the tests with this branch, I am using the It would be great to have one more benchmark on Leonardo, like the one in @Iximiel's previous message. Can you please set that up, @Iximiel? If @Iximiel's benchmark also shows similar performance between the two versions, we can (perhaps) justify making the more significant change. In other words, we can use forces calculated in the backwards (apply) loop everywhere. This change will simplify the code. Removing the linked lists should also make it easier to add parallelism with GPUs to this part of the code. What do you think? Would you agree that this is an OK way to proceed? To clarify these changes will be made for v2.11 and there will be no changes of syntax. |
@gtribello here are the benchmarks I did on LEONARDO: The color code is the usual: in lighter color the "4 Calculating (forward loop)" part. fname=sc_${PLUMED_NUM_THREADS}threads_2000nat_500steps
plumed benchmark --plumed="sc_cm2000master.dat:sc_cm2000target.dat" \
--kernel="/path/to/master/lib/libplumedKernel.so:/path/to/plumed-derivatives-from-backpropagation/lib/libplumedKernel.so" \
--maxtime=840 \
--natoms=2000 --nsteps=500 --atom-distribution=sc >"${fname}.out" The plumed file is, adapted to the number of atoms
|
Hello @Iximiel and @GiovanniBussi I've been working more on writing a version of PLUMED where the derivatives are calculated during the backward loop. I've also gotten confused. It would help me @Iximiel if you could run another benchmark just so that I can check that my conclusions about the state of the project are correct. Sorry and thank you in advance. So, firstly, the confusion. I realised that the switching function parameters we used in that example input were not particularly sensible. With those switching function parameters, every single atom in the box with 8000 atoms was within the D_MAX cutoff of 6 nm that we set, which is not really how this type of CV is used. I thus created a new input with more sensible switching function parameters. It is below:
Separately @Iximiel it would be useful to add some detail to the manual for When I run benchmarks, I get the following results: I am thus concluding the following:
Thanks. |
Hi @gtribello, with coordination you simply used
or something more elaborate? I'll write something in the --help for plumed benchmark.
I think it is better to use sc for a "realistic system" and line for stressing the NN optimizations |
Hello again @Iximiel I did indeed write the input for coordination in the way you said. I also used the sc for the atom distribution as I agree that it is a better test for the benchmarking. Separately, if you could add the list of options for the --atom-distribution flag in its description that would be a good first thing to add to the documentation. |
Hello again I found a bottleneck and got the new version to go a bit faster. You can see the plot here: What I wrote in my previous message is still true, though:
|
…hat we already worked out that some of the elements of the matrix are zero and that the forces on these elements are zero
…new scheme for forces Changes to reference files are because of accumulated small differences because the calculations are done in a different way.
I have removed the test on the forces for this regtest as the code with the new derivatives performs this calculation very slowly because I am no longer able to work out the subset of atoms on which to calculate the q6 parameter. Also I think this test is not very robust when the calculation is done in different ways. I am pretty certain that the forces for this type of calculation are calculated correctly as there are other tests that test forces that work in similar situations. However, some careful benchmarking of this type of calculation needs to be peformed before this branch is merged
See previous commit for explanation as to why test on forces has been removed.
… by default. Changes to regtest config files mostly remove code for setting this option on in some regtests. A few tests don't yet work with the new implementation and have been changed so as to run with the old implementation until I get them fixed.
…being calculated during the backwards loop For the beta sheet variables all the strands cutoff functionality is done in separate DISTANCE and CUSTOM actions. I prefer this way of doing things as it makes all the steps of the calculations visible in the expanded version of the shortcut's input file.
…me. Now everything in the code can run with the new scheme
std::string nopbcstr=""; bool nopbc; parseFlag("NOPBC",nopbc); if( nopbc ) nopbcstr = " NOPBC"; | ||
if( strands_cutoff.length()>0 ) strands_cutoff=" CUTOFF_ATOMS=6,21 STRANDS_CUTOFF="+strands_cutoff; | ||
// if( strands_cutoff.length()>0 ) strands_cutoff=" CUTOFF_ATOMS=6,21 STRANDS_CUTOFF="+strands_cutoff; |
Check notice
Code scanning / CodeQL
Commented-out code Note
…BondProductMatrix
… over matrix elements rather than over the rows of the matrix These clases now inherit from ActionWithVector rather than ActionWithMatrix Changes to regtests revert changes that were made in an earlier commit when I started reworking the derivatives. In other words, the forces in these actions are back to the values that we had when we were using the chains derivatives. I had introduced an error in an earlier commit
@@ -320,7 +323,7 @@ | |||
action->log.printf(" with variables :"); | |||
for(unsigned i=0; i<var.size(); i++) action->log.printf(" %s",var[i].c_str()); | |||
action->log.printf("\n"); function.set( func, var, action ); | |||
std::vector<double> zeros( action->getNumberOfArguments(), 0 ); double fval = abs(function.evaluate(zeros)); | |||
std::vector<double> zeros( nargs, 0 ); double fval = abs(function.evaluate(zeros)); |
Check warning
Code scanning / CodeQL
Lossy function result cast Warning
for(unsigned i=0; i<getNumberOfComponents(); ++i) { | ||
Value* myval = const_cast<Value*>( getConstPntrToComponent(i) ); unsigned ncols = myval->getNumberOfColumns(); | ||
if( myval->getRank()!=2 || myval->hasDerivatives() || ncols>=myval->getShape()[1] ) continue; | ||
myval->matrix_bookeeping[task_index*(1+ncols)]=0; |
Check failure
Code scanning / CodeQL
Multiplication result converted to larger type High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix AI 4 months ago
To fix the problem, we need to ensure that the multiplication is performed using a larger integer type to prevent overflow. This can be done by casting one of the operands to the larger type before performing the multiplication. In this case, we will cast task_index
to size_t
before the multiplication.
- General fix: Cast one of the operands to a larger type before performing the multiplication.
- Detailed fix: In the file
src/core/ActionWithMatrix.cpp
, on line 58, casttask_index
tosize_t
before performing the multiplication. - Specific changes: Modify line 58 to cast
task_index
tosize_t
. - Requirements: No additional methods, imports, or definitions are needed.
-
Copy modified line R58
@@ -57,3 +57,3 @@ | ||
if( myval->getRank()!=2 || myval->hasDerivatives() || ncols>=myval->getShape()[1] ) continue; | ||
myval->matrix_bookeeping[task_index*(1+ncols)]=0; | ||
myval->matrix_bookeeping[static_cast<size_t>(task_index)*(1+ncols)]=0; | ||
} |
matpos=p; | ||
ncols = myarg->getNumberOfColumns(); matrix_bookeeping.resize( myarg->matrix_bookeeping.size() ); | ||
for(unsigned i=0; i<matrix_bookeeping.size(); ++i) matrix_bookeeping[i] = myarg->matrix_bookeeping[i]; | ||
data.resize( shape[0]*ncols ); inputForce.resize( shape[0]*ncols ); |
Check failure
Code scanning / CodeQL
Multiplication result converted to larger type High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix AI 4 months ago
To fix the problem, we need to ensure that the multiplication is performed using a larger integer type before the result is used. This can be achieved by casting one of the operands to std::size_t
before performing the multiplication. This way, the multiplication will be done using the larger type, preventing overflow.
The specific changes required are:
- Cast
shape[0]
tostd::size_t
before multiplying it byncols
in thereshapeMatrixStore
andcopyBookeepingArrayFromArgument
methods.
-
Copy modified line R276 -
Copy modified line R293
@@ -275,3 +275,3 @@ | ||
ncols=n; if( ncols>shape[1] ) ncols=shape[1]; | ||
unsigned size=shape[0]*ncols; | ||
unsigned size=static_cast<std::size_t>(shape[0])*ncols; | ||
if( matrix_bookeeping.size()!=(size+shape[0]) ) { | ||
@@ -292,3 +292,3 @@ | ||
for(unsigned i=0; i<matrix_bookeeping.size(); ++i) matrix_bookeeping[i] = myarg->matrix_bookeeping[i]; | ||
data.resize( shape[0]*ncols ); inputForce.resize( shape[0]*ncols ); | ||
data.resize( static_cast<std::size_t>(shape[0])*ncols ); inputForce.resize( static_cast<std::size_t>(shape[0])*ncols ); | ||
} |
matpos=p; | ||
ncols = myarg->getNumberOfColumns(); matrix_bookeeping.resize( myarg->matrix_bookeeping.size() ); | ||
for(unsigned i=0; i<matrix_bookeeping.size(); ++i) matrix_bookeeping[i] = myarg->matrix_bookeeping[i]; | ||
data.resize( shape[0]*ncols ); inputForce.resize( shape[0]*ncols ); |
Check failure
Code scanning / CodeQL
Multiplication result converted to larger type High
Show autofix suggestion
Hide autofix suggestion
Copilot Autofix AI 4 months ago
To fix the problem, we need to ensure that the multiplication is performed using a larger integer type before the result is assigned to the size
variable. This can be achieved by casting one of the operands to std::size_t
before performing the multiplication. This way, the multiplication will be done using the larger type, preventing any overflow.
- General Fix: Cast one of the operands to
std::size_t
before performing the multiplication. - Detailed Fix: In the
reshapeMatrixStore
andcopyBookeepingArrayFromArgument
methods, castshape[0]
tostd::size_t
before multiplying it byncols
. - Specific Changes: Modify lines 276 and 293 in the file
src/core/Value.cpp
.
-
Copy modified line R276 -
Copy modified line R293
@@ -275,3 +275,3 @@ | ||
ncols=n; if( ncols>shape[1] ) ncols=shape[1]; | ||
unsigned size=shape[0]*ncols; | ||
unsigned size=static_cast<std::size_t>(shape[0])*ncols; | ||
if( matrix_bookeeping.size()!=(size+shape[0]) ) { | ||
@@ -292,3 +292,3 @@ | ||
for(unsigned i=0; i<matrix_bookeeping.size(); ++i) matrix_bookeeping[i] = myarg->matrix_bookeeping[i]; | ||
data.resize( shape[0]*ncols ); inputForce.resize( shape[0]*ncols ); | ||
data.resize( static_cast<std::size_t>(shape[0])*ncols ); inputForce.resize( static_cast<std::size_t>(shape[0])*ncols ); | ||
} |
unsigned kind = myarg->getRowIndex( index1, i ); | ||
addDerivativeOnMatrixArgument( stored_matrix1, 0, 0, index1, kind, dvec1[i]/(2*matval), myvals ); | ||
addDerivativeOnMatrixArgument( stored_matrix2, 0, 1, kind, ind2, dvec2[i]/(2*matval), myvals ); | ||
unsigned kind = myarg->getRowIndex( index1, i ); if( colno[i]<0 ) continue; |
Check notice
Code scanning / CodeQL
Unused local variable Note
… lot of malloc and optimise the code
Hello @Iximiel and @GiovanniBussi I am finally in a position to discuss how we move forward with this tomorrow morning. The latest timings for Coordination number, Q6 and LQ6 are shown below. The bars show the times for 20 steps. You can see that the new code performs much better than the current master. In the graph above, the colours are as follows: Blue = Current master 1 thread A small problem is that the code crashes when you try to run calculations of Q6 with derivatives for 8000 atoms with 4 threads. I think my laptop doesn't have enough memory to run calculations this large. This problem can probably be fixed. Before working too much on that, however, I would like to try to work out how to involve Daniele in this project and how we can start having some of this code run on GPUs. See you tomorrow. -- For reference, input for Q6 is:
and for LQ6
|
Thanks @gtribello I guess there's some wrong x-label, but if they all refer to something describing an increase of the number of atoms I think it's very promising. A few comments/questions:
|
Ah yes sorry. The number of atoms is always 500, 2000, 4000 and 8000. I forgot to change the tick labels on all the subfigs. |
I think the timings should grow linearly with the number of atoms because we are using a neighbour list. |
OK @GiovanniBussi and @Iximiel I've added some further code and the memory problem is now fixed. The times including the 4 threads and 8000 atoms are as follows: |
const ActionWithMatrix* am = dynamic_cast<const ActionWithMatrix*>(this); | ||
myvals.setTaskIndex(current); myvals.vector_call=true; performTask( current, myvals ); | ||
for(unsigned i=0; i<getNumberOfComponents(); ++i) { | ||
const Value* myval = getConstPntrToComponent(i); | ||
if( am || myval->hasDerivatives() ) continue; | ||
Value* myv = const_cast<Value*>( myval ); | ||
if( getName()=="RMSD_VECTOR" && myv->getRank()==2 ) continue; | ||
myv->set( current, myvals.get( i ) ); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a const function in which non-const things happen, it is called from non-const functions(runAllTasks and checkForForces), so it does not need the const modifier.
It may be easier to unflag this from its constness and change to code to something like, so there is not need to const_cast
the values to modify them:
const ActionWithMatrix* am = dynamic_cast<const ActionWithMatrix*>(this); | |
myvals.setTaskIndex(current); myvals.vector_call=true; performTask( current, myvals ); | |
for(unsigned i=0; i<getNumberOfComponents(); ++i) { | |
const Value* myval = getConstPntrToComponent(i); | |
if( am || myval->hasDerivatives() ) continue; | |
Value* myv = const_cast<Value*>( myval ); | |
if( getName()=="RMSD_VECTOR" && myv->getRank()==2 ) continue; | |
myv->set( current, myvals.get( i ) ); | |
myvals.setTaskIndex(current); | |
myvals.vector_call=true; | |
performTask( current, myvals ); | |
if (const ActionWithMatrix* am = dynamic_cast<const ActionWithMatrix*>(this);!am){ | |
const bool IamRMSD_VECTOR=getName()=="RMSD_VECTOR"; | |
for(unsigned i=0; i<getNumberOfComponents(); ++i) { | |
if(Value* myval = getPntrToComponent(i); | |
!myval->hasDerivatives() && | |
!( IamRMSD_VECTOR && myval->getRank()==2 )) { | |
myval->set( current, myvals.get( i ) ); | |
} | |
} |
I also moved some conditions outside the loop and changed the scope of am
Maybe the discrimination for RMSD_VECTOR
can be done in it? Or with an ad-hoc overloaded method virtual bool skipIf(Value *)const{return true};
?
Then for the parallelization, we can manipulate MultiValue into a Data sitter for cpu/gpu adresses
…atrices and grids with this command.
Description
This is the PR for the benchmarking that we discussed on Wednesday.
DO NOT MERGE
I will provide more explanation tomorrow.
Target release
I would like my code to appear in release 2.11
Type of contribution
Copyright
COPYRIGHT
file with the correct license information. Code should be released under an open source license. I also used the commandcd src && ./header.sh mymodulename
in order to make sure the headers of the module are correct.Tests