Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing (quasi-Newton solver) #693

Merged
merged 27 commits into from
Sep 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
7a939b7
added notes and some drafty interface
cnpetra Aug 26, 2024
181f7f9
added draft of the api for checkpointing
cnpetra Aug 28, 2024
0e428f5
fixed compilation issues
cnpetra Aug 28, 2024
354b82b
integrated AXOM
cnpetra Aug 28, 2024
a14a445
added user options for checkpointing
cnpetra Aug 28, 2024
d497955
more work on load checkpoint EOD
cnpetra Sep 3, 2024
8b2342c
Merge branch 'develop' into chkpnt-dev
cnpetra Sep 4, 2024
d4900a9
semi-operation checkpointing
cnpetra Sep 4, 2024
3579828
removed save checkpoint callback from the interface
cnpetra Sep 4, 2024
d213774
fixed typos in comments
cnpetra Sep 4, 2024
4a9fbf1
moved sidre-related code from Algorithm class to a "utils" helper
cnpetra Sep 7, 2024
0498564
switched to refs; some testing of options-based checkpointing
cnpetra Sep 8, 2024
085eb88
added sidre copy to/from dense matrices
cnpetra Sep 11, 2024
b21c3c5
instrumentation for saving quasi-Newton internals to sidre
cnpetra Sep 11, 2024
bfe1b40
updated iteration counter to keep track of total number over restarts
cnpetra Sep 12, 2024
5bc2af6
updated doc; replace all #
cnpetra Sep 13, 2024
a664e35
added example on how to use checkpoint API
cnpetra Sep 13, 2024
261ccf9
clean up
cnpetra Sep 13, 2024
0506d38
added metadata
cnpetra Sep 14, 2024
b785475
testing and clean up
cnpetra Sep 14, 2024
adf30e6
Merge branch 'develop' into chkpnt-dev
cnpetra Sep 22, 2024
4e673d5
update user manual with checkpointing
cnpetra Sep 22, 2024
77c88a2
updated pdf user manual
cnpetra Sep 22, 2024
5631b6c
fix ci errors (compilation)
cnpetra Sep 23, 2024
ea5cafd
fix adtl compilation issues
cnpetra Sep 23, 2024
28c4567
fixed compil error
cnpetra Sep 23, 2024
27d15f2
addresed reviews
cnpetra Sep 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 15 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,8 @@ option(HIOP_USE_EIGEN "Build with Eigen support" ON)
option(HIOP_USE_MPI "Build with MPI support" ON)
option(HIOP_USE_GPU "Build with support for GPUs - CUDA or HIP libraries" OFF)
option(HIOP_TEST_WITH_BSUB "Use `jsrun` instead of `mpirun` commands when running tests" OFF)
option(HIOP_USE_RAJA "Build with portability abstraction library RAJA" OFF)
option(HIOP_USE_RAJA "Build with portability abstraction library RAJA" OFF)
option(HIOP_USE_AXOM "Build with AXOM to use Sidre for scalable checkpointing" OFF)
option(HIOP_DEEPCHECKS "Extra checks and asserts in the code with a high penalty on performance" OFF)
option(HIOP_WITH_KRON_REDUCTION "Build Kron Reduction code (requires UMFPACK)" OFF)
option(HIOP_DEVELOPER_MODE "Build with extended warnings and options" OFF)
Expand Down Expand Up @@ -289,6 +290,19 @@ if(HIOP_USE_RAJA)
message(STATUS "Found umpire pkg-config: ${umpire_CONFIG}")
endif()

if(HIOP_USE_AXOM)
if(HIOP_USE_MPI)
find_package(AXOM CONFIG
PATHS ${AXOM_DIR} ${AXOM_DIR}/lib/cmake/
REQUIRED)
target_link_libraries(hiop_tpl INTERFACE axom)
message(STATUS "Found AXOM pkg-config: ${AXOM_CONFIG}")
elseif(HIOP_USE_MPI)
message(FATAL_ERROR "Error: HIOP_USE_MPI is required when HIOP_USE_AXOM is ON")
endif()
endif()


if(HIOP_WITH_KRON_REDUCTION)
set(HIOP_UMFPACK_DIR CACHE PATH "Path to UMFPACK directory")
include(FindUMFPACK)
Expand Down
Binary file modified doc/hiop_usermanual.pdf
Binary file not shown.
10 changes: 10 additions & 0 deletions doc/src/sections/solver_options.tex
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,16 @@ \subsubsection{Problem preprocessing}
\medskip


\subsubsection{Checkpointing of the solver state and restarting}\label{sec:checkpoint}
As detailed in Section~\ref{sec:checkpoint_API}, \Hi can save/load its internal state to/from disk. All the options in this section require an Axom-enabled build (use ``-DHIOP\_USE\_AXOM=ON'' with cmake) and are supported only by the quasi-Newton IPM solver (\texttt{hiopAlgFilterIPMQuasiNewton} class) for the \texttt{hiopInterfaceDenseConstraints} NLP formulation/interface.

\noindent \textbf{checkpoint\_save}: Save state of NLP solver to file indicated by value of option ``checkpoint\_file''. String values ``yes'' or ``no'', default ``no''.

\noindent \textbf{checkpoint\_load\_on\_start} On (re)start the NLP solver will load checkpoint file specified by ``checkpoint\_file`` option. String values ``yes'' or ``no'', default ``no''.

\noindent \textbf{checkpoint\_file} Path to checkpoint file to load from or save to. If present, the character ``\#'' is replaced with the iteration number at which the checkpointing is saved (but \textit{not} when loaded). \Hi adds a ``.root'' extension internally if the value of the option is a directory. If this option is not specified and loading or saving checkpoints is enabled, \Hi will use a file named ``hiop\_state\_chk''.

\noindent \textbf{checkpoint\_save\_every\_N\_iter} Iteration frequency of saving checkpoints to disk if ``checkpoint\_save'' is ``yes''. Takes positive integer values with a default value $10$.


\subsubsection{Miscellaneous options}
Expand Down
27 changes: 25 additions & 2 deletions doc/src/techrep_main.tex
Original file line number Diff line number Diff line change
Expand Up @@ -133,7 +133,7 @@
\vspace{3cm}

{\huge\bfseries \Hi\ -- User Guide} \\[14pt]
{\large\bfseries version 1.03}
{\large\bfseries version 1.1.0}

\vspace{3cm}

Expand All @@ -155,7 +155,7 @@
\vspace{4.75cm}

\textcolor{violet}{{\large\bfseries Oct 15, 2017} \\
{\large\bfseries Updated Feb 5, 2024}}
{\large\bfseries Updated Sept 22, 2024}}

\vspace{0.75cm}

Expand Down Expand Up @@ -474,6 +474,29 @@ \subsubsection{Calling \Hi for a \texttt{hiopInterfaceDenseConstraints} formulat
\end{lstlisting}
The standalone drivers \texttt{NlpDenseConsEx1}, \texttt{NlpDenseConsEx2}, and \texttt{NlpDenseConsEx3} inside directory \texttt{src/Drivers/} under the \Hi's root directory contain more detailed examples of the use of \Hi.

\subsubsection{Checkpointing}\label{sec:checkpoint_API}
File checkpointing is available for \Hi's quasi-Newton IPM solver, which is used exclusively to solve \texttt{hiopInterfaceDenseConstraints} formulation. This can be helpful when running a job on
a cluster that enforces limits on the job’s running time.
Later, this feature will also be provided for other solvers, such as the Newton IPM (used exclusively with sparse NLP) and HiOp-PriDec.

The checkpointing I/O is based on Axom's scalable Sidre data manager (see \url{https://axom.readthedocs.io/en/develop/axom/sidre/docs/sphinx/index.html} for more information) and, thus, requires an Axom-enabled build (use ``-DHIOP\_USE\_AXOM=ON'' with cmake).

There are two ways to use \Hi's checkpointing. The first is via the quasi-Newton solver's API, namely, the methods
\begin{lstlisting}
void load_state_from_sidre_group(const ::axom::sidre::Group& group);
void save_state_to_sidre_group(::axom::sidre::Group& group);
\end{lstlisting}
of \texttt{hiopAlgFilterIPMQuasiNewton} solver class. New Sidre views will be created (or reused) within the group passed as argument to load / save state variables of the quasi-Newton solver. Alternatively, \texttt{hiopAlgFilterIPMQuasiNewton} solver class offers similar methods to work directly with a file, namely,
\begin{lstlisting}
bool load_state_from_file(const ::std::string& path) noexcept;
bool save_state_to_file(const ::std::string& path) noexcept;
\end{lstlisting}
These two methods will create the Sidre group internally and checkpoint to/from it using the first two methods.

A second avenue to checkpoint is via user options. This is detailed in Section~\ref{sec:checkpoint}.

\warningcp{Note:} A couple of particularities stemming from the use of Sidre must be acknowledged. First, a checkpoint file should be loaded using HiOp with the same number of MPI ranks as when it was saved. Second, checkpointing is not available for non-MPI builds due to Axom having MPI as a dependency. Finally, when loading from or saving to a checkpoint file, the sizes of the file's variables (Sidre views) must match the sizes of the HiOp variables to which the data is loaded or saved, meaning \Hi will throw an exception if an existing file is (re)used to load or save a algorithm state for a problem that changed sizes since the file was created.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% NLP Sparse
Expand Down
61 changes: 59 additions & 2 deletions src/Drivers/Dense/NlpDenseConsEx1.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,14 @@
#include <cstdio>
#include <cassert>

#ifdef HIOP_USE_AXOM
#include <axom/sidre/core/DataStore.hpp>
#include <axom/sidre/core/Group.hpp>
#include <axom/sidre/core/View.hpp>
#include <axom/sidre/spio/IOManager.hpp>
using namespace axom;
#endif

using namespace hiop;

Ex1Meshing1D::Ex1Meshing1D(double a, double b, size_type glob_n, double r, MPI_Comm comm_)
Expand Down Expand Up @@ -178,10 +186,59 @@ void DiscretizedFunction::setFunctionValue(index_type i_global, const double& va
this->data_[i_local]=value;
}



/* DenseConsEx1 class implementation */

bool DenseConsEx1::iterate_callback(int iter,
double obj_value,
double logbar_obj_value,
int n,
const double* x,
const double* z_L,
const double* z_U,
int m_ineq,
const double* s,
int m,
const double* g,
const double* lambda,
double inf_pr,
double inf_du,
double onenorm_pr,
double mu,
double alpha_du,
double alpha_pr,
int ls_trials)
{
#ifdef HIOP_USE_AXOM
//save state to sidre::Group every 5 iterations if a solver/algorithm object was provided
if(iter > 0 && (iter % 5 == 0) && nullptr!=solver_) {
//
//Example of how to save HiOp state to axom::sidre::Group
//

//We first manufacture a Group. User code supposedly already has one.
sidre::DataStore ds;
sidre::Group* group = ds.getRoot()->createGroup("HiOp quasi-Newton alg state");

//the actual saving of state to group
try {
solver_->save_state_to_sidre_group(*group);
} catch(std::runtime_error& e) {
//user chooses action when an error occured in saving the state...
//we choose to stop HiOp
return false;
}

//User code can further inspect the Group or add addtl info to DataStore, with the end goal
//of saving it to file before HiOp starts next iteration. Here we just save it.
sidre::IOManager writer(comm);
int n_files;
MPI_Comm_size(comm, &n_files);
writer.write(ds.getRoot(), n_files, "hiop_state_ex1", sidre::Group::getDefaultIOProtocol());
}
#endif
return true;
}

/*set c to
* c(t) = 1-10*t, for 0<=t<=1/10,
* 0, for 1/10<=t<=1.
Expand Down
35 changes: 33 additions & 2 deletions src/Drivers/Dense/NlpDenseConsEx1.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ class Ex1Meshing1D
MPI_Comm comm;
int my_rank, comm_size;
index_type* col_partition;

friend class DiscretizedFunction;

private:
Expand Down Expand Up @@ -112,7 +112,9 @@ class DenseConsEx1 : public hiop::hiopInterfaceDenseConstraints
{
public:
DenseConsEx1(int n_mesh_elem=100, double mesh_ratio=1.0)
: n_vars(n_mesh_elem), comm(MPI_COMM_WORLD)
: n_vars(n_mesh_elem),
comm(MPI_COMM_WORLD),
solver_(nullptr)
{
//create the members
_mesh = new Ex1Meshing1D(0.0,1.0, n_vars, mesh_ratio, comm);
Expand Down Expand Up @@ -218,6 +220,31 @@ class DenseConsEx1 : public hiop::hiopInterfaceDenseConstraints
}
return true;
}

inline void set_solver(hiop::hiopAlgFilterIPM* alg_obj)
{
solver_ = alg_obj;
}

bool iterate_callback(int iter,
double obj_value,
double logbar_obj_value,
int n,
const double* x,
const double* z_L,
const double* z_U,
int m_ineq,
const double* s,
int m,
const double* g,
const double* lambda,
double inf_pr,
double inf_du,
double onenorm_pr,
double mu,
double alpha_du,
double alpha_pr,
int ls_trials);
private:
int n_vars;
MPI_Comm comm;
Expand All @@ -228,6 +255,10 @@ class DenseConsEx1 : public hiop::hiopInterfaceDenseConstraints
DiscretizedFunction* c;
DiscretizedFunction* x; //proxy for taking hiop's variable in and working with it as a function

/// Pointer to the solver, to be used to checkpoint
hiop::hiopAlgFilterIPM* solver_;

private:
//populates the linear term c
void set_c();
};
Expand Down
109 changes: 95 additions & 14 deletions src/Drivers/Dense/NlpDenseConsEx1Driver.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,23 @@
#include <cstdlib>
#include <string>

#ifdef HIOP_USE_AXOM
#include <axom/sidre/core/DataStore.hpp>
#include <axom/sidre/core/Group.hpp>
#include <axom/sidre/core/View.hpp>
#include <axom/sidre/spio/IOManager.hpp>
using namespace axom;
#endif


using namespace hiop;

static bool self_check(size_type n, double obj_value);

#ifdef HIOP_USE_AXOM
static bool do_load_checkpoint_test(const size_type& mesh_size,
const double& ratio,
const double& obj_val_expected);
#endif
static bool parse_arguments(int argc, char **argv, size_type& n, double& distortion_ratio, bool& self_check)
{
n = 20000; distortion_ratio=1.; self_check=false; //default options
Expand Down Expand Up @@ -67,24 +80,27 @@ int main(int argc, char **argv)
err = MPI_Init(&argc, &argv); assert(MPI_SUCCESS==err);
err = MPI_Comm_rank(MPI_COMM_WORLD,&rank); assert(MPI_SUCCESS==err);
err = MPI_Comm_size(MPI_COMM_WORLD,&numRanks); assert(MPI_SUCCESS==err);
if(0==rank) printf("Support for MPI is enabled\n");
if(0==rank) {
printf("Support for MPI is enabled\n");
}
#endif
bool selfCheck; size_type mesh_size; double ratio;
if(!parse_arguments(argc, argv, mesh_size, ratio, selfCheck)) { usage(argv[0]); return 1;}

bool selfCheck;
size_type mesh_size;
double ratio;
double objective = 0.;
if(!parse_arguments(argc, argv, mesh_size, ratio, selfCheck)) {
usage(argv[0]);
return 1;
}

DenseConsEx1 problem(mesh_size, ratio);
//if(rank==0) printf("interface created\n");
hiop::hiopNlpDenseConstraints nlp(problem);
//if(rank==0) printf("nlp formulation created\n");

//nlp.options->SetIntegerValue("verbosity_level", 4);
//nlp.options->SetNumericValue("tolerance", 1e-4);
//nlp.options->SetStringValue("duals_init", "zero");
//nlp.options->SetIntegerValue("max_iter", 2);

hiop::hiopAlgFilterIPM solver(&nlp);
problem.set_solver(&solver);

hiop::hiopSolveStatus status = solver.run();
double objective = solver.getObjective();
objective = solver.getObjective();

//this is used for testing when the driver is called with -selfcheck
if(selfCheck) {
Expand All @@ -97,7 +113,19 @@ int main(int argc, char **argv)
}
}

if(0==rank) printf("Objective: %18.12e\n", objective);
if(0==rank) {
printf("Objective: %18.12e\n", objective);
}

#ifdef HIOP_USE_AXOM
// example/test for HiOp's load checkpoint API.
if(!do_load_checkpoint_test(mesh_size, ratio, objective)) {
if(rank==0) {
printf("Load checkpoint and restart test failed.");
}
return -1;
}
#endif
#ifdef HIOP_USE_MPI
MPI_Finalize();
#endif
Expand Down Expand Up @@ -134,3 +162,56 @@ static bool self_check(size_type n, double objval)

return true;
}

#ifdef HIOP_USE_AXOM
/**
* An illustration on how to use load_state_from_sidre_group API method of HiOp's algorithm class.
*
*
*/
static bool do_load_checkpoint_test(const size_type& mesh_size,
const double& ratio,
const double& obj_val_expected)
{
//Pretend this is new job and recreate the HiOp objects.
DenseConsEx1 problem(mesh_size, ratio);
hiop::hiopNlpDenseConstraints nlp(problem);

hiop::hiopAlgFilterIPM solver(&nlp);

//
// example of how to use load_state_sidre_group to warm-start
//

//Supposedly, the user code should have the group in hand before asking HiOp to load from it.
//We will manufacture it by loading a sidre checkpoint file. Here the checkpoint file
//"hiop_state_ex1.root" was created from the interface class' iterate_callback method
//(saved every 5 iterations)
sidre::DataStore ds;

try {
sidre::IOManager reader(MPI_COMM_WORLD);
reader.read(ds.getRoot(), "hiop_state_ex1.root", false);
} catch(std::exception& e) {
printf("Failed to read checkpoint file. Error: [%s]", e.what());
return false;
}


//the actual API call
try {
const sidre::Group* group = ds.getRoot()->getGroup("HiOp quasi-Newton alg state");
solver.load_state_from_sidre_group(*group);
} catch(std::runtime_error& e) {
printf("Failed to load from sidre::group. Error: [%s]", e.what());
return false;
}

hiop::hiopSolveStatus status = solver.run();
double obj_val = solver.getObjective();
if(obj_val != obj_val_expected) {
return false;
}
return true;
}
#endif // HIOP_USE_AXOM
6 changes: 3 additions & 3 deletions src/Interface/hiopInterface.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -467,8 +467,8 @@ class hiopInterfaceBase
}

/**
* This method is used to provide an user all the hiop iterate
* procedure. @see solution_callback() for an explanation of the parameters.
* This method is used to provide user all the internal hiop iterates. @see solution_callback()
* for an explanation of the parameters.
*
* @param[in] x array of (local) entries of the primal variables (managed by Umpire, see note below)
* @param[in] z_L array of (local) entries of the dual variables for lower bounds (managed by Umpire, see note below)
Expand Down Expand Up @@ -496,7 +496,7 @@ class hiopInterfaceBase
{
return true;
}

/**
* A wildcard function used to change the primal variables.
*
Expand Down
Loading
Loading