From c65e7780c1a1096013e8a4c1f8249fb464588117 Mon Sep 17 00:00:00 2001
From: Erik Dervishi <132947518+ErikDervishi03@users.noreply.github.com>
Date: Sun, 6 Oct 2024 16:17:03 +0200
Subject: [PATCH 1/2] Update callbacks.md

Refactored the gradient clipping section in the README to reduce repetition and improve clarity
---
 doc/callbacks.md | 39 +++++++++++++--------------------------
 1 file changed, 13 insertions(+), 26 deletions(-)
diff --git a/doc/callbacks.md b/doc/callbacks.md
index b80689547..60fbc2826 100644
--- a/doc/callbacks.md
+++ b/doc/callbacks.md
@@ -144,19 +144,17 @@ smorms3.Optimize(lrfTrain, coordinates, cb);
 
 </details>
 
-### GradClipByNorm
-
-One difficulty with optimization is that large parameter gradients can lead an
-optimizer to update the parameters strongly into a region where the loss
-function is much greater, effectively undoing much of the work done to get to
-the current solution. Such large updates during the optimization can cause a
-numerical overflow or underflow, often referred to as "exploding gradients". The
-exploding gradient problem can be caused by: Choosing the wrong learning rate
-which leads to huge updates in the gradients.  Failing to scale a data set
-leading to very large differences between data points.  Applying a loss function
-that computes very large error values.
-
-A common answer to the exploding gradients problem is to change the derivative
+### Gradient Clipping
+One challenge in optimization is dealing with "exploding gradients," where large parameter gradients can cause the optimizer to make excessively large updates, potentially pushing the model into regions of high loss or causing numerical instability. This can happen due to:
+
+* A high learning rate, leading to large gradient updates.
+* Poorly scaled datasets, resulting in significant variance between data points.
+* A loss function that generates disproportionately large error values.
+    
+Common solutions for this problem are:
+
+#### GradClipByNorm
+In this method, the solution is to change the derivative
 of the error before applying the update step.  One option is to clip the norm
 `||g||` of the gradient `g` before a parameter update. So given the gradient,
 and a maximum norm value, the callback normalizes the gradient so that its
@@ -186,19 +184,8 @@ arma::mat coordinates = f.GetInitialPoint();
 optimizer.Optimize(f, coordinates, GradClipByNorm(0.3));
 ```
 
-### GradClipByValue
-
-One difficulty with optimization is that large parameter gradients can lead an
-optimizer to update the parameters strongly into a region where the loss
-function is much greater, effectively undoing much of the work done to get to
-the current solution. Such large updates during the optimization can cause a
-numerical overflow or underflow, often referred to as "exploding gradients". The
-exploding gradient problem can be caused by: Choosing the wrong learning rate
-which leads to huge updates in the gradients.  Failing to scale a data set
-leading to very large differences between data points.  Applying a loss function
-that computes very large error values.
-
-A common answer to the exploding gradients problem is to change the derivative
+#### GradClipByValue
+In this method, the solution is to change the derivative
 of the error before applying the update step.  One option is to clip the
 parameter gradient element-wise before a parameter update.
 

From 1168dc5a4c37f07ed236171ab26c43e1d92b20f4 Mon Sep 17 00:00:00 2001
From: Erik Dervishi <132947518+ErikDervishi03@users.noreply.github.com>
Date: Mon, 7 Oct 2024 21:15:57 +0200
Subject: [PATCH 2/2] Apply suggestions from code review

changes to maintain document consistency

Co-authored-by: Ryan Curtin <ryan@ratml.org>
---
 doc/callbacks.md | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/doc/callbacks.md b/doc/callbacks.md
index 60fbc2826..3f608e2e0 100644
--- a/doc/callbacks.md
+++ b/doc/callbacks.md
@@ -145,7 +145,11 @@ smorms3.Optimize(lrfTrain, coordinates, cb);
 </details>
 
 ### Gradient Clipping
-One challenge in optimization is dealing with "exploding gradients," where large parameter gradients can cause the optimizer to make excessively large updates, potentially pushing the model into regions of high loss or causing numerical instability. This can happen due to:
+
+One challenge in optimization is dealing with "exploding gradients", where large
+parameter gradients can cause the optimizer to make excessively large updates,
+potentially pushing the model into regions of high loss or causing numerical
+instability. This can happen due to:
 
 * A high learning rate, leading to large gradient updates.
 * Poorly scaled datasets, resulting in significant variance between data points.
@@ -154,6 +158,7 @@ One challenge in optimization is dealing with "exploding gradients," where large
 Common solutions for this problem are:
 
 #### GradClipByNorm
+
 In this method, the solution is to change the derivative
 of the error before applying the update step.  One option is to clip the norm
 `||g||` of the gradient `g` before a parameter update. So given the gradient,
@@ -185,6 +190,7 @@ optimizer.Optimize(f, coordinates, GradClipByNorm(0.3));
 ```
 
 #### GradClipByValue
+
 In this method, the solution is to change the derivative
 of the error before applying the update step.  One option is to clip the
 parameter gradient element-wise before a parameter update.