You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I use my owner data to train teacher model and student model. The teacher model is normal; but the KL loss of student model becomes nan at about 50k step. The log information is like as follows:
Global Step : 67438, [55, 100] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.301 nan]
100 Step Time : 143.40849208831787
Global Step : 67538, [55, 200] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3145 nan]
100 Step Time : 142.97523188591003
Global Step : 67638, [55, 300] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3899 nan]
100 Step Time : 142.70189571380615
Global Step : 67738, [55, 400] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.2744 nan]
100 Step Time : 142.38775205612183
Global Step : 67838, [55, 500] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.436 nan]
100 Step Time : 142.91834259033203
Global Step : 67938, [55, 600] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3337 nan]
100 Step Time : 142.9723343849182
Global Step : 68038, [55, 700] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.5294 nan]
100 Step Time : 142.89931344985962
Global Step : 68138, [55, 800] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3567 nan]
100 Step Time : 143.14595890045166
Global Step : 68238, [55, 900] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3591 nan]
100 Step Time : 143.44508004188538
Global Step : 68338, [55, 1000] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.2139 nan]
100 Step Time : 143.32597756385803
Global Step : 68438, [55, 1100] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.402 nan]
100 Step Time : 142.90216040611267
Global Step : 68538, [55, 1200] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.4041 nan]
100 Step Time : 142.8380832672119
55 Epoch Training Loss : nan
100 [Total, KL, Reg, Frame Loss] : [ nan nan 2.042 nan]
The text was updated successfully, but these errors were encountered:
Hi, I use my owner data to train teacher model and student model. The teacher model is normal; but the KL loss of student model becomes nan at about 50k step. The log information is like as follows:
Global Step : 67438, [55, 100] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.301 nan]
100 Step Time : 143.40849208831787
Global Step : 67538, [55, 200] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3145 nan]
100 Step Time : 142.97523188591003
Global Step : 67638, [55, 300] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3899 nan]
100 Step Time : 142.70189571380615
Global Step : 67738, [55, 400] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.2744 nan]
100 Step Time : 142.38775205612183
Global Step : 67838, [55, 500] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.436 nan]
100 Step Time : 142.91834259033203
Global Step : 67938, [55, 600] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3337 nan]
100 Step Time : 142.9723343849182
Global Step : 68038, [55, 700] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.5294 nan]
100 Step Time : 142.89931344985962
Global Step : 68138, [55, 800] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3567 nan]
100 Step Time : 143.14595890045166
Global Step : 68238, [55, 900] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.3591 nan]
100 Step Time : 143.44508004188538
Global Step : 68338, [55, 1000] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.2139 nan]
100 Step Time : 143.32597756385803
Global Step : 68438, [55, 1100] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.402 nan]
100 Step Time : 142.90216040611267
Global Step : 68538, [55, 1200] [Total Loss, KL Loss, Reg Loss, Frame Loss] : [ nan nan 2.4041 nan]
100 Step Time : 142.8380832672119
55 Epoch Training Loss : nan
100 [Total, KL, Reg, Frame Loss] : [ nan nan 2.042 nan]
The text was updated successfully, but these errors were encountered: