-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: faster and less memory-intensive model [re]quantization #290
perf: faster and less memory-intensive model [re]quantization #290
Conversation
if qmodule is None: | ||
return None | ||
with torch.no_grad(): | ||
qmodule.weight.copy_(module.weight) | ||
qmodule.weight = module.weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not entirely sure if the copy is necessary. The copy_ call uses extra RAM if module.weight is lazily loaded (i.e. module.weight hasn't been loaded yet but will load because of the copy_ call) which has caused my computer to run out of memory in the past (i.e. loading 12B model with 32 GB RAM)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copy_
is not strictly necessary, since so far the models are quantized in place. I had planned to change this, but it has proven very convenient and memory efficient.
The only issue I could see is when using tied weights should we want to modify them during calibration.
Anyway, when freezing, new quantized weights are created and even if the weights were still tied they would be untied.
@@ -200,13 +201,15 @@ def from_module( | |||
activations: Optional[qtype] = None, | |||
optimizer: Optional[Optimizer] = None, | |||
): | |||
qmodule = cls.qcreate(module, weights, activations, optimizer) | |||
with init_empty_weights(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init_empty_weights() is called since it prevents random weight initialization from happening, which was the main cause of the slow performance of quantize/requantize
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your solution is correct, but accelerate
is only an optional dependency: see an alternate solution in #291.
@@ -133,12 +133,11 @@ def move_tensor(t, device): | |||
setattr(m, name, torch.nn.Parameter(move_tensor(param, "cpu"))) | |||
for name, param in m.named_buffers(recurse=False): | |||
setattr(m, name, move_tensor(param, "cpu")) | |||
# Freeze model and move to target device | |||
freeze(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why freeze(model) was called here. In quantize(), it's called so all of the weights get set to their quantized versions, but we're already setting quantized weights in requantize() via model.load_state_dict(), so I don't think the freeze(model) call does anything here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're correct, but only because when loading the state dict we force an assign if the module is unfrozen (see line 186 of qmodule.py
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for tracking down the memory issues and useless operations: this is a very valuable contribution. Since I would like to avoid a direct dependency to accelerate, could you rebase on the branch I referenced in the comments ?
@@ -200,13 +201,15 @@ def from_module( | |||
activations: Optional[qtype] = None, | |||
optimizer: Optional[Optimizer] = None, | |||
): | |||
qmodule = cls.qcreate(module, weights, activations, optimizer) | |||
with init_empty_weights(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your solution is correct, but accelerate
is only an optional dependency: see an alternate solution in #291.
if qmodule is None: | ||
return None | ||
with torch.no_grad(): | ||
qmodule.weight.copy_(module.weight) | ||
qmodule.weight = module.weight |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The copy_
is not strictly necessary, since so far the models are quantized in place. I had planned to change this, but it has proven very convenient and memory efficient.
The only issue I could see is when using tied weights should we want to modify them during calibration.
Anyway, when freezing, new quantized weights are created and even if the weights were still tied they would be untied.
@@ -133,12 +133,11 @@ def move_tensor(t, device): | |||
setattr(m, name, torch.nn.Parameter(move_tensor(param, "cpu"))) | |||
for name, param in m.named_buffers(recurse=False): | |||
setattr(m, name, move_tensor(param, "cpu")) | |||
# Freeze model and move to target device | |||
freeze(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're correct, but only because when loading the state dict we force an assign if the module is unfrozen (see line 186 of qmodule.py
)
Rebased and merged as #297 |
What does this PR do?
Currently optimum.quanto's quantize/requantize functions run slowly for large models as quantized modules (e.g. QLinear) are initialized with random weights which immediately get replaced with pretrained weights. This also causes these methods to use more CPU RAM than necessary (which is especially visible with models whose weights are lazily loaded). This PR essentially makes quantize/requantize run instantly while using less RAM for lazily-loaded models.
Repro
Results for above code (before fix)
Results for above code (after fix)
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.