[scripts] implement max-change within customized SGD optimizer #4032

aadps · 2020-04-08T10:13:36Z

Needs further tests and reviews. The total change per minibatch is logged, should be very easy to add to TensorBoard plot at a later point.

danpovey · 2020-04-08T10:54:31Z

egs/aishell/s10/chain/sgd_mc.py

+from torch.optim.optimizer import Optimizer, required
+
+
+class SGD_MC(Optimizer):


I think SgdMaxChange might be clearer? and called sgd_max_change.py? probably closer to google style guide.
Please make sure the added parameters are documented.

danpovey · 2020-04-08T10:55:10Z

... and how about the results? Does it actually perform better than Adam?

aadps · 2020-04-10T03:34:40Z

Want to doublecheck several things. The max_change and max_change_per_layer we trying to implement are the norms of the proposed tensor delta. But in the case of SGD, we may first get the norm of the gradient, but the gradient and tensor delta miss by a factor of the learning rate?

So for individul layers, it should be like:
if norm * group['lr'] > max_change_per_layer:
d_p.mul_(max_change_per_layer / norm / group['lr'])
?

Then, when computing the norm for the entire model, should we use norms of individual layers before or after the adjustment of max_change_per_layer?

Lastly, if the max_change constraint works as intended, we no longer need to apply the pytorch gradient clipping?

danpovey · 2020-04-10T04:37:46Z

Want to doublecheck several things. The max_change and max_change_per_layer we trying to implement are the norms of the proposed tensor delta. But in the case of SGD, we may first get the norm of the gradient, but the gradient and tensor delta miss by a factor of the learning rate?

Yes.

So for individul layers, it should be like: if norm * group['lr'] > max_change_per_layer: d_p.mul_(max_change_per_layer / norm / group['lr'])

sounds right, although you mean a / (b / c), not a / b / c.

? Then, when computing the norm for the entire model, should we use norms of individual layers before or after the adjustment of max_change_per_layer?

After.

Lastly, if the max_change constraint works as intended, we no longer need to apply the pytorch gradient clipping?

Likely, yes. But it's still worthwhile comparing whether there is any advantage in doing it with max-change.

…

— You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4032 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO7S6FM23N3TRFMJ3H3RL2HV3ANCNFSM4MDZ4I4A> .

aadps · 2020-04-11T03:32:34Z

Some initial results (I am still working on the SgdMaxChange implementation):

Adam global average objf:

SgdMaxChange global average objf:

SgdMaxChange change for the whole model:

What other quantities would you like to see and compare?

aadps · 2020-04-11T03:40:25Z

although you mean a / (b / c), not a / b / c.

For this one I wasn't sure. norm is the norm of d_p (gradient adjusted by weight_decay, momentum, etc.), so norm * group['lr'] should be the proposed change to the matrix?

If it is greater than the max_change, we should limit it by multiplying max_change / (norm * group['lr']) or max_change / norm / group['lr'], which would be a factor less than 1?

danpovey · 2020-04-11T05:37:29Z

Oh yes, max_change / (norm * group['lr']). I always avoid a / b / c if not using parentheses, because not everyone remembers the associativity of '/'.

…

On Sat, Apr 11, 2020 at 11:40 AM Xin Chen ***@***.***> wrote: although you mean a / (b / c), not a / b / c. For this one I wasn't sure. norm is the norm of d_p (gradient adjusted by weight_decay, momentum, etc.), so norm * group['lr'] should be the proposed change to the matrix? If it is greater than the max_change, we should limit it by multiplying max_change / (norm * group['lr']) or max_change / norm / group['lr'], which would be a factor less than 1? — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4032 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLOZ6JKRXPEG2FUXD4JTRL7RDJANCNFSM4MDZ4I4A> .

aadps · 2020-04-11T06:59:16Z

My bad, I just went for a quick fix but this is indeed poor coding style.

danpovey · 2020-04-11T07:18:16Z

Let us know the effect on WER. You can't always predict the effect on WER just from the objective values.

…

On Sat, Apr 11, 2020 at 2:59 PM Xin Chen ***@***.***> wrote: My bad, I just went for a quick fix but this is indeed poor coding style. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4032 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO3ZX6AAEHN4ZCZPB23RMAINDANCNFSM4MDZ4I4A> .

aadps · 2020-04-11T08:30:29Z

Adam:
==> exp/chain_pybind/tdnn_sp/train/decode_res/test/scoring_kaldi/best_cer <==
%WER 7.15 [ 7491 / 104765, 178 ins, 465 del, 6848 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/test/cer_10_0.5

==> exp/chain_pybind/tdnn_sp/train/decode_res/test/scoring_kaldi/best_wer <==
%WER 15.47 [ 9968 / 64428, 918 ins, 1511 del, 7539 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/test/wer_12_0.0
==> exp/chain_pybind/tdnn_sp/train/decode_res/dev/scoring_kaldi/best_cer <==
%WER 6.06 [ 12439 / 205341, 321 ins, 591 del, 11527 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/dev/cer_10_0.0

==> exp/chain_pybind/tdnn_sp/train/decode_res/dev/scoring_kaldi/best_wer <==
%WER 13.79 [ 17608 / 127698, 1454 ins, 2772 del, 13382 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/dev/wer_11_0.0

SgdMaxChange:
==> exp/chain_pybind/tdnn_sp/train/decode_res/test/scoring_kaldi/best_cer <==
%WER 7.36 [ 7715 / 104765, 187 ins, 474 del, 7054 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/test/cer_10_0.5

==> exp/chain_pybind/tdnn_sp/train/decode_res/test/scoring_kaldi/best_wer <==
%WER 15.83 [ 10202 / 64428, 804 ins, 1685 del, 7713 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/test/wer_11_0.5
==> exp/chain_pybind/tdnn_sp/train/decode_res/dev/scoring_kaldi/best_cer <==
%WER 6.29 [ 12908 / 205341, 296 ins, 555 del, 12057 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/dev/cer_9_0.5

==> exp/chain_pybind/tdnn_sp/train/decode_res/dev/scoring_kaldi/best_wer <==
%WER 14.13 [ 18048 / 127698, 1583 ins, 2644 del, 13821 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/dev/wer_10_0.0

danpovey · 2020-04-11T08:34:30Z

What are the learning rate schedules?

…

On Sat, Apr 11, 2020 at 4:30 PM Xin Chen ***@***.***> wrote: Adam: ==> exp/chain_pybind/tdnn_sp/train/decode_res/test/scoring_kaldi/best_cer <== %WER 7.15 [ 7491 / 104765, 178 ins, 465 del, 6848 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/test/cer_10_0.5 ==> exp/chain_pybind/tdnn_sp/train/decode_res/test/scoring_kaldi/best_wer <== %WER 15.47 [ 9968 / 64428, 918 ins, 1511 del, 7539 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/test/wer_12_0.0 ==> exp/chain_pybind/tdnn_sp/train/decode_res/dev/scoring_kaldi/best_cer <== %WER 6.06 [ 12439 / 205341, 321 ins, 591 del, 11527 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/dev/cer_10_0.0 ==> exp/chain_pybind/tdnn_sp/train/decode_res/dev/scoring_kaldi/best_wer <== %WER 13.79 [ 17608 / 127698, 1454 ins, 2772 del, 13382 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/dev/wer_11_0.0 SgdMaxChange: ==> exp/chain_pybind/tdnn_sp/train/decode_res/test/scoring_kaldi/best_cer <== %WER 7.36 [ 7715 / 104765, 187 ins, 474 del, 7054 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/test/cer_10_0.5 ==> exp/chain_pybind/tdnn_sp/train/decode_res/test/scoring_kaldi/best_wer <== %WER 15.83 [ 10202 / 64428, 804 ins, 1685 del, 7713 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/test/wer_11_0.5 ==> exp/chain_pybind/tdnn_sp/train/decode_res/dev/scoring_kaldi/best_cer <== %WER 6.29 [ 12908 / 205341, 296 ins, 555 del, 12057 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/dev/cer_9_0.5 ==> exp/chain_pybind/tdnn_sp/train/decode_res/dev/scoring_kaldi/best_wer <== %WER 14.13 [ 18048 / 127698, 1583 ins, 2644 del, 13821 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/dev/wer_10_0.0 — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#4032 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAZFLO2CVYTUCLETNISLOXTRMATDDANCNFSM4MDZ4I4A> .

aadps · 2020-04-11T08:53:01Z

Learning rate schedule is 1e-3 * pow(0.4, epoch)
Btw I have updated my commit.

danpovey · 2020-04-11T09:05:13Z

Try with double the learning rate.

aadps · 2020-04-12T11:16:17Z

Double the learning rate:

==> exp/chain_pybind/tdnn_sp/train/decode_res/test/scoring_kaldi/best_cer <==
%WER 7.33 [ 7676 / 104765, 189 ins, 447 del, 7040 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/test/cer_10_0.5

==> exp/chain_pybind/tdnn_sp/train/decode_res/test/scoring_kaldi/best_wer <==
%WER 15.79 [ 10172 / 64428, 947 ins, 1492 del, 7733 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/test/wer_12_0.0
==> exp/chain_pybind/tdnn_sp/train/decode_res/dev/scoring_kaldi/best_cer <==
%WER 6.18 [ 12700 / 205341, 285 ins, 519 del, 11896 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/dev/cer_9_0.5

==> exp/chain_pybind/tdnn_sp/train/decode_res/dev/scoring_kaldi/best_wer <==
%WER 14.03 [ 17917 / 127698, 1600 ins, 2626 del, 13691 sub ] exp/chain_pybind/tdnn_sp/train/decode_res/dev/wer_10_0.0

danpovey · 2020-04-12T11:25:28Z

OK. It looks like right now this isn't giving us improvement over Adam: let's merge the code, but please change the top-level script so it still uses Adam, as I don't want to regress the results.
At some point we need to come up with a mechanism to run different-versoined experiments; but for now the way it is is OK, I think.

aadps · 2020-04-13T11:00:27Z

Top-level script reverted to Adam.

danpovey · 2020-04-13T12:26:18Z

Thanks!! @songmeixu do you want to go through this? Or should I just merge?

megazone87 · 2020-04-14T09:11:36Z

Thanks!! @songmeixu do you want to go through this? Or should I just merge?

Please give me two days to go through this. I am doing it now. Thank @aadps for waiting!

stale · 2020-06-19T06:36:00Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2020-07-19T06:24:01Z

This issue has been automatically closed by a bot strictly because of inactivity. This does not mean that we think that this issue is not important! If you believe it has been closed hastily, add a comment to the issue and mention @kkm000, and I'll gladly reopen it.

stale · 2020-09-17T10:41:31Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

jtrmal · 2022-08-16T14:09:54Z

@songmeixu ?

stale · 2022-10-15T17:58:28Z

This issue has been automatically marked as stale by a bot solely because it has not had recent activity. Please add any comment (simply 'ping' is enough) to prevent the issue from being closed for 60 more days if you believe it should be kept open.

danpovey reviewed Apr 8, 2020

View reviewed changes

megazone87 self-requested a review April 10, 2020 02:18

aadps force-pushed the aadps branch from 6b58186 to 5ccb456 Compare April 11, 2020 08:46

[scripts] implement max-change within customized SGD

cd351bb

aadps force-pushed the aadps branch from 5ccb456 to cd351bb Compare April 13, 2020 10:58

stale bot added the stale Stale bot on the loose label Jun 19, 2020

stale bot closed this Jul 19, 2020

kkm000 reopened this Jul 19, 2020

stale bot removed the stale Stale bot on the loose label Jul 19, 2020

stale bot added the stale Stale bot on the loose label Sep 17, 2020

stale bot removed the stale Stale bot on the loose label Aug 16, 2022

stale bot added the stale Stale bot on the loose label Oct 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[scripts] implement max-change within customized SGD optimizer #4032

[scripts] implement max-change within customized SGD optimizer #4032

aadps commented Apr 8, 2020

danpovey Apr 8, 2020

danpovey commented Apr 8, 2020

aadps commented Apr 10, 2020

danpovey commented Apr 10, 2020 via email

aadps commented Apr 11, 2020

aadps commented Apr 11, 2020

danpovey commented Apr 11, 2020 via email

aadps commented Apr 11, 2020

danpovey commented Apr 11, 2020 via email

aadps commented Apr 11, 2020

danpovey commented Apr 11, 2020 via email

aadps commented Apr 11, 2020

danpovey commented Apr 11, 2020

aadps commented Apr 12, 2020

danpovey commented Apr 12, 2020

aadps commented Apr 13, 2020

danpovey commented Apr 13, 2020

megazone87 commented Apr 14, 2020

stale bot commented Jun 19, 2020

stale bot commented Jul 19, 2020

stale bot commented Sep 17, 2020

jtrmal commented Aug 16, 2022

stale bot commented Oct 15, 2022

		from torch.optim.optimizer import Optimizer, required


		class SGD_MC(Optimizer):

[scripts] implement max-change within customized SGD optimizer #4032

Are you sure you want to change the base?

[scripts] implement max-change within customized SGD optimizer #4032

Conversation

aadps commented Apr 8, 2020

danpovey Apr 8, 2020

Choose a reason for hiding this comment

danpovey commented Apr 8, 2020

aadps commented Apr 10, 2020

danpovey commented Apr 10, 2020 via email

aadps commented Apr 11, 2020

aadps commented Apr 11, 2020

danpovey commented Apr 11, 2020 via email

aadps commented Apr 11, 2020

danpovey commented Apr 11, 2020 via email

aadps commented Apr 11, 2020

danpovey commented Apr 11, 2020 via email

aadps commented Apr 11, 2020

danpovey commented Apr 11, 2020

aadps commented Apr 12, 2020

danpovey commented Apr 12, 2020

aadps commented Apr 13, 2020

danpovey commented Apr 13, 2020

megazone87 commented Apr 14, 2020

stale bot commented Jun 19, 2020

stale bot commented Jul 19, 2020

stale bot commented Sep 17, 2020

jtrmal commented Aug 16, 2022

stale bot commented Oct 15, 2022