-
Notifications
You must be signed in to change notification settings - Fork 616
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak occurs while using lightning.kokkos device for machine learning[BUG] #6297
Comments
Thanks for opening this @JustinS6626 . Is this memory leak specific to |
Thank you very much for getting back to me! I tried my large-scale model with |
It looks like the same thing happens when I run the code with |
Thanks for confirming @JustinS6626 We're looking into this to check if it's due to the scale of the system or something else. Thanks for confirming that it's not only a lightning.kokkos issue, it helps a lot. |
Thank you! I tried my code out on PennyLane V0.38.1 and the same thing happens on that version. |
Hi @JustinS6626 thanks again for providing the above example. Since you are using Torch's CUDA device target, it is known that Torch will cache CUDA data (e.g. see https://stackoverflow.com/questions/55322434/how-to-clear-cuda-memory-in-pytorch and https://discuss.pytorch.org/t/about-torch-cuda-empty-cache/34232 for two representative discussions). It may be worth trying to see if freeing these caches give you back some memory for your workload. As you have a somewhat deep circuit with many parameters ( I will attach the output of https://github.com/bloomberg/memray 's live analysis, which shows memory ownership and allocations on the heap for the given workload: You can repeat the above with If you think I may be incorrect about the above analysis, feel free to let us know. If this is the case, some more details about the scale of your problem in full, the hardware it runs on, the Torch version you are using, or a smaller problem size that replicates the given report may help to identify the issue. |
Hey @JustinS6626 If you click the |
Thanks! I tried using the |
Based on the feedback that I got from memray, it looks like the trouble spots are in the PennyLane-PyTorch interface. |
Hi @JustinS6626 Some suggestions that may be useful to identify the root cause:
|
Thanks for getting back to me again! My PyTorch version is V2.4.1, which I think is the latest stable version. The loss calculation only happens for actual updates, without validation chacks. From looking at the memray feedback, it seems that at least part of the root of the problem is in the |
Thanks for your patience @JustinS6626 . When you saw the issues with Could you also try setting
That will follow a slightly different logical pathway, and should actually be substantially more efficient for your type of problem (overall scalar function, quantum component with many observables). If Each instance of the class should only exist inside a single execution, so I'd be rather concerned if the Python garbage collector is somehow broken and not collecting up local variables between function calls. The EDIT: Additional idea: If you want to avoid caching the jacobian on the forward pass, you can also set |
Thanks! I tried it with both the default and adjoint methods, and the memory leak still happens in both cases. I tried also with the |
I tried out the |
Given it happens with both Just to double-check, how many wires are you using? Because if you are running into simulation memory issues with 8 wires, I am even more confused. |
Just as a follow up, were you able to run the classical part without hitting PennyLane too? |
The same issue did not appear to be happening in the classical version, so I think it is PennyLane-specific. The amount of memory consumed increases steadily between optimization iterations, and does not decrease to its original level when the model begins its next update process. In the attached file is the code that is producing the memory leak that I am observing. It's a modification of the code that published with the following paper: https://proceedings.neurips.cc/paper_files/paper/2022/file/69413f87e5a34897cd010ca698097d0a-Paper-Conference.pdf It should allow the duplication of the memory leak problem. In order to run it, you will need to install the Gymnasium and Minigrid packages from Farama, as well as imageio. The code can be executed with Multi-Agent-Transformer/mat/scripts/train_minigrid.sh. The main process for gradient calculation and updated is controlled from the file Multi-Agent-Transformer/mat/algorithms/mat/mat_trainer.py. Within the latest PennyLane version, there is an error on line 539 of the file pennylane.workflow.execution.py - the |
Thanks again for your input @JustinS6626 as we recognize this is an important issue for your workload. At the current time, the best we can do right now is to track this as an item to investigate on our roadmap, and provide feedback when possible. Per your comment on availability over the weekend, we likely will not be available to respond again until next week. If you have any insights before then, we can try to help out once we are again free to look into it. |
Thank you very much! In the mean time, is there something that I can do to implement manual control over the memory use of the simulation process while the gradient is being calculated? Also, if someone would be willing to try running my code to reproduce the issue and see if I have made any mistakes that would trick the simulator into holding onto memory, I would really appreciate that. |
Hi @JustinS6626 , unfortunately we're not able to test your code at the moment, and since we couldn't replicate your issue with the code you shared earlier there's not much more we can do. At this point, we know that it's not any specific device (happens on default.qubit, lightning.qubit, and lightning.kokkos), no specific differentiation method (both backprop and adjoint). Given it still occurs with backprop, that means it has nothing to do with how we bind jacobians to torch, as we don't manually do so with backprop. I suggest we keep this issue open as well as the thread you made in the Forum. This way if someone else has either the same issue or a solution, they can post it and we can look into it further. We've noted the issue and will keep an eye on anything that could give us clues about the memory issues you're seeing. We'll post here if we find anything. |
Thanks for getting back to me again! I am trying a workaround right now, and I will let you know what happens. |
Update: My workaround was successful! It seems that when you implement PennyLane variational quantum circuit layers as PyTorch |
Thanks for letting us know @JustinS6626 !! And great work finding the workaround!! |
For sure! I can give an example here:
|
Thanks @JustinS6626 ! Good example. |
I'll close the issue now that we know it's a Torch issue. |
Apologies for the delay! I was able to fix the memory leak when I moved the quantum network architecture to the code for a different model, but the issue remains in the original code, and I have learned that it actually caused by something other the setup of |
Expected behavior
I apologize for reposting this issue from the forum https://discuss.pennylane.ai/t/memory-leak-in-when-using-lighning-kokkos-device/5218, but this issue is a major roadblock for a time-sensitive project. The issue that I am reporting is a possible bug which may be in the QNode class or in one of the Pennylane devices. This issue causes a memory leak when the QNode object is called while Pytorch is set to calculate a gradient. Instead of releasing the memory once the gradient is calculated, Pennylane/Pytorch keeps holding onto it.
Actual behavior
The memory leak is shown through the profiling tool used in the code example below. The problem was originally spotted in a large-scale quantum machine learning project. In this larger project, the training process halted early as a result running out of memory. The output shown is not an actual error, but rather the result of tracking memory usage in the example.
Additional information
No response
Source code
Tracebacks
System information
Name: PennyLane Version: 0.38.0 Summary: PennyLane is a cross-platform Python library for quantum computing, quantum machine learning, and quantum chemistry. Train a quantum computer the same way as a neural network. Home-page: https://github.com/PennyLaneAI/pennylane Author: Author-email: License: Apache License 2.0 Location: /usr/local/lib/python3.11/dist-packages Requires: appdirs, autograd, autoray, cachetools, networkx, numpy, packaging, pennylane-lightning, requests, rustworkx, scipy, toml, typing-extensions Required-by: PennyLane-qiskit, pennylane-qulacs, PennyLane_Lightning, PennyLane_Lightning_GPU, PennyLane_Lightning_Kokkos Platform info: Linux-6.8.0-40-generic-x86_64-with-glibc2.35 Python version: 3.11.0 Numpy version: 1.26.3 Scipy version: 1.12.0 Installed devices: - lightning.kokkos (PennyLane_Lightning_Kokkos-0.38.0) - qiskit.aer (PennyLane-qiskit-0.37.0) - qiskit.basicaer (PennyLane-qiskit-0.37.0) - qiskit.basicsim (PennyLane-qiskit-0.37.0) - qiskit.ibmq (PennyLane-qiskit-0.37.0) - qiskit.ibmq.circuit_runner (PennyLane-qiskit-0.37.0) - qiskit.ibmq.sampler (PennyLane-qiskit-0.37.0) - qiskit.remote (PennyLane-qiskit-0.37.0) - default.clifford (PennyLane-0.38.0) - default.gaussian (PennyLane-0.38.0) - default.mixed (PennyLane-0.38.0) - default.qubit (PennyLane-0.38.0) - default.qubit.autograd (PennyLane-0.38.0) - default.qubit.jax (PennyLane-0.38.0) - default.qubit.legacy (PennyLane-0.38.0) - default.qubit.tf (PennyLane-0.38.0) - default.qubit.torch (PennyLane-0.38.0) - default.qutrit (PennyLane-0.38.0) - default.qutrit.mixed (PennyLane-0.38.0) - default.tensor (PennyLane-0.38.0) - null.qubit (PennyLane-0.38.0) - lightning.qubit (PennyLane_Lightning-0.38.0) - lightning.gpu (PennyLane_Lightning_GPU-0.35.1) - qulacs.simulator (pennylane-qulacs-0.36.0)
Existing GitHub issues
The text was updated successfully, but these errors were encountered: