Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed C and C++ implementation #31

Open
nychiang opened this issue Oct 6, 2023 · 15 comments
Open

Mixed C and C++ implementation #31

nychiang opened this issue Oct 6, 2023 · 15 comments

Comments

@nychiang
Copy link
Collaborator

nychiang commented Oct 6, 2023

I compiled Exago on two different LLNL platforms, using both gcc-8.3.1 and gcc-12.1.1. In both cases, I encountered segmentation faults at runtime. Upon investigation, I discovered that certain objects were not being initialized correctly. For instance, the opflow->modelname variable is a C++ std::string object, but it was not being allocated memory using the c-type calloc function.

This issue led to a segmentation fault at the line where opflow->modelname was being assigned a value:
opflow->modelname = OPFLOWOptions::model.opt

For more information about this type of error, you can refer to this discussion on Stack Overflow:
See Here.

It's worth noting that when I used new to initialize the opflow object and removed the PetscCalloc function like this:

  OPFLOW opflow = new _p_OPFLOW;

I was able to compile and run Exago to solve test problems. However, I encountered segmentation faults when Petsc was used to release memory allocated for the opflow object. (It shouldn't since the memory is not managed/allocated by Petsc).

@cameronrutherford @jaelynlitz @abhyshr

exago@develop%[email protected]+cuda+mpi+hiop+raja+ipopt+python cuda_arch=70
    ^[email protected]~hdf5~hypre~mumps~superlu-dist 
    ^hiop@develop+cuda+raja+sparse+mpi+deepchecking+ginkgo+cusolver_lu cuda_arch=70
    ^[email protected]+cuda~examples~exercises~ipo+openmp~rocm+shared~tests cuda_arch=70 
    ^[email protected]+cuda cuda_arch=70
    ^[email protected]~c+cuda~deviceconst~examples~fortran~ipo~numa~openmp~rocm~shared cuda_arch=70 tests=none
    ^[email protected]

This is the configuration that I got segmentation faults.

@abhyshr
Copy link
Collaborator

abhyshr commented Oct 6, 2023

@cameronrutherford have we seen this behavior on other platforms or other GCC compilers? What's different here? I don't know much about C++. In C, one would allocate memory for a string and then use it.

@abhyshr
Copy link
Collaborator

abhyshr commented Oct 6, 2023

The object creation via PetscCalloc has nothing to do with the options. So, we need to fix the issue with options directly. The issue you are seeing with new is because PetscFree gets called on the struct when it's being destroyed and it requires the object to be created with PetscMalloc or PetscCalloc.

@cameronrutherford
Copy link
Contributor

This is the first time I've seen this issue @abhyshr. Quite a few bugs have cropped up recently for the first time which is very mysterious but we seem to be squashing them...

@nychiang please provide a way to reproduce so we can try and fix things (spack spec I assume?)

If we are using new to allocate the OPFLOW struct (we should be), then the string should allocate memory just fine.

After using new, it sounds like the next error is instead in how we are freeing memory, so perhaps we need to look into some of the cleanup code.

We did plan to fully transition ExaGO to C++ classes one day, but that isn't today.

@abhyshr
Copy link
Collaborator

abhyshr commented Oct 7, 2023

Hmm...what are these bugs? Have we documented them somewhere? Still feel using new may resolve this issue, but it is related to some string memory not being allocated correctly. When was the last time anyone ran Valgrind on ExaGO?

@bjpalmer
Copy link
Collaborator

bjpalmer commented Oct 9, 2023

I just ran a SCOPFLOW test on newell with Valgrind. Is there some way of uploading a file to an issue? I don't see anything that allows you to do that but then, I'm an idiot...

@nychiang
Copy link
Collaborator Author

nychiang commented Oct 9, 2023

@abhyshr @cameronrutherford
new cannot fully solve this issue. I tried the followings:

  1. PetscCalloc + PetscFree: Segmentation fault immediately appears when [opflow->modelname = OPFLOWOptions::model.opt]. GDB shows there is no access to opflow->modelname`.
  2. new to create the object + PetscCalloc + PetscFree: PetscCalloc will wipe out the memory allocation assigned to string variable by new and leaves it un-allocated. Segmentation fault immediately appears when [opflow->modelname = OPFLOWOptions::model.opt](https://github.com/pnnl/ExaGO/blob/2a20d5c225b1899018b405c726de8f7089cce9dd/src/opflow/interface/opflow.cpp#L749-L765) is called.
  3. new to create the object +PetscFree, no PetscCalloc: ExaGo can solve the optimization problem and got segmentation fault when PetscFree is called to free Jacobian matrix.

@cameronrutherford
Copy link
Contributor

I just ran a SCOPFLOW test on newell with Valgrind. Is there some way of uploading a file to an issue? I don't see anything that allows you to do that but then, I'm an idiot...

@bjpalmer When creating a comment in GitHub, there is some text below the text box that says "attach files by dragging & dropping, selecting or pasting them". I have uploaded a screenshot of what I am seeing for reference.
Screenshot 2023-10-10 at 11 49 20 AM

Hmm...what are these bugs? Have we documented them somewhere? Still feel using new may resolve this issue, but it is related to some string memory not being allocated correctly. When was the last time anyone ran Valgrind on ExaGO?

@abhyshr This is the first time that I am seeing this. I still would like @nychiang to provide a more concrete spec of how ExaGO was built, so that we can reproduce. Very hard to try and implement a fix without a reproducer, or an error in our pipelines to catch this moving forward.

@abhyshr @cameronrutherford
new cannot fully solve this issue. I tried the followings:

  1. PetscCalloc + PetscFree: Segmentation fault immediately appears when [opflow->modelname = OPFLOWOptions::model.opt]. GDB shows there is no access to opflow->modelname`.
  2. new to create the object + PetscCalloc + PetscFree: PetscCalloc will wipe out the memory allocation assigned to string variable by new and leaves it un-allocated. Segmentation fault immediately appears when opflow->modelname = OPFLOWOptions::model.opt is called.
  3. new to create the object +PetscFree, no PetscCalloc: ExaGo can solve the optimization problem and got segmentation fault when PetscFree is called to free Jacobian matrix.

@nychiang despite not having a reproducer, I think I have followed from your links and debugging process.

To summarize, the issue occurs because the OPFLOW struct has an std::string member object. When the OPFLOW struct is initialized (be it through PetscCalloc, new, or some combination - they will all have the same error), the memory for the various std::string objects is not allocated (per SO Post that you linked).

To fix this, I think the easiest approach is to have opflow->modelname and opflow->solvername just be pointers. That way the memory for the pointer will be allocated, and then we can initialize the strings at runtime with opflow->modelname = new std::string(OPFLOWOptions::model.opt).

Only tricky part would be to make sure that we free the associated memory created with new once the OPFLOW object is destroyed? This would all be easier if OPFLOW and ExaGO apps were C++ classes, but I digress.

@nychiang
Copy link
Collaborator Author

exago@develop%[email protected]+cuda+mpi+hiop+raja+ipopt+python cuda_arch=70
    ^[email protected]~hdf5~hypre~mumps~superlu-dist 
    ^hiop@develop+cuda+raja+sparse+mpi+deepchecking+ginkgo+cusolver_lu cuda_arch=70
    ^[email protected]+cuda~examples~exercises~ipo+openmp~rocm+shared~tests cuda_arch=70 
    ^[email protected]+cuda cuda_arch=70
    ^[email protected]~c+cuda~deviceconst~examples~fortran~ipo~numa~openmp~rocm~shared cuda_arch=70 tests=none
    ^[email protected]

This is the configuration that I got segmentation faults. @cameronrutherford

@cameronrutherford
Copy link
Contributor


exago@develop%[email protected]+cuda+mpi+hiop+raja+ipopt+python cuda_arch=70

    ^[email protected]~hdf5~hypre~mumps~superlu-dist 

    ^hiop@develop+cuda+raja+sparse+mpi+deepchecking+ginkgo+cusolver_lu cuda_arch=70

    ^[email protected]+cuda~examples~exercises~ipo+openmp~rocm+shared~tests cuda_arch=70 

    ^[email protected]+cuda cuda_arch=70

    ^[email protected]~c+cuda~deviceconst~examples~fortran~ipo~numa~openmp~rocm~shared cuda_arch=70 tests=none

    ^[email protected]

This is the configuration that I got segmentation faults. @cameronrutherford

Okay, so this sounds like something that is unique to hiop@develop, and so is it a follow on from #28?

It sounds like we can get some of the builds passing, but now this issue is specific to runtime errors that are happening with hiop@develop. I imagine there are some new compiler flags that are being passed around / +/~deepchecking is to be considered

@nychiang
Copy link
Collaborator Author

nychiang commented Oct 12, 2023

umm... I don't think this comes from hiop@develop, since this segmentation fault happens at line
opflow->modelname = OPFLOWOptions::model.opt. This runtime error happens when ExaGo tries to access one C++ string member variable of class OPFLOW, while this C++ object is initialized/allocated by PetscCalloc.

I guess maybe some newly added cflags introduce this error into exago@develop

@cameronrutherford
Copy link
Contributor

cameronrutherford commented Oct 12, 2023

umm... I don't think this comes from hiop@develop, since this segmentation fault happens at line opflow->modelname = OPFLOWOptions::model.opt. This runtime error happens when ExaGo tries to access one C++ string member variable of class OPFLOW, while this C++ object is initialized/allocated by PetscCalloc.

I guess maybe some newly added cflags introduce this error into exago@develop

To verify this, could you try and run the same build with [email protected] and compare it to hiop@develop? None of our tests are currently catching this error running with 1.0.0.

With #20 merged, we now have an easy way to kick off builds on Ascent, Incline, Newell and Deception too run tests, so if you want a new build of hiop@develop tested on those platforms for comparison, I can do that for you.

I currently have no path forward to reproducing this on our current 1.0.0 builds. We are seeing no issues with [email protected] in testing, and nothing in your spack spec is noticeably different apart from the hiop version. I ask because we are going to work on getting hiop@develop compatible after the release, and if this is also reproducible with 1.0.0 that is important.

EDIT: Tentatively assigning 1.6.1 release unless this is proven to be reproducible with 1.0.0

@nychiang
Copy link
Collaborator Author

I just tried it with [email protected] and I still got the same segmentation faults.
Note that:

  1. I firstly build everything via spack and generate modules for all the dependencies.
  2. I load all the modules and build exago@develop in Debug model, links it to hiop (1.0.0 or develop)
  3. In order to build exago in debug mode, I comment out the code here. . See issue Build ExaGO in Debug mode #46 .

BTW, I think these errors cannot be detected by CI since they are removed from CI, due to opflow fails with hiop, at the moment. See https://gitlab.pnnl.gov/exasgd/frameworks/exago/-/issues/469
Note that I have filed a hiop PR to fix this opflow failure.

I will suggest let's move on first, i.e., assign ExaGo 1.6.1 release with HiOp v1.0.1.

@cameronrutherford
Copy link
Contributor

cameronrutherford commented Oct 13, 2023

Thanks for clarifying @nychiang, this is making some more sense. There are a set of overlapping issues here, but we are getting somewhere.

Order of operations seems to be:

Build errors:

Runtime errors:

Releases:

Again, happy to provide specific builds/tests on our CI systems. Even though we have all these test failures, I think this release schedule makes the most sense.

@cameronrutherford
Copy link
Contributor

Thanks for clarifying @nychiang, this is making some more sense. There are a set of overlapping issues here, but we are getting somewhere.

Order of operations seems to be:

Build errors:

Runtime errors:

Releases:

Again, happy to provide specific builds/tests on our CI systems. Even though we have all these test failures, I think this release schedule makes the most sense.

I have made a meta issue in #80 to keep track of this

@cameronrutherford
Copy link
Contributor

@nychiang since #80 was resolved, is this issue still open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants