From 55ea20241068516903e50141614ebdb8df0a1ab1 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Val=C3=A9rian=20Rey?= <valerian.rey@gmail.com>
Date: Sat, 20 Dec 2025 17:42:04 +0100
Subject: [PATCH 1/6] docs: Add intuitive explanation of JD

---
 docs/source/index.rst           |  1 +
 docs/source/jacobian_descent.md | 63 +++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+)
 create mode 100644 docs/source/jacobian_descent.md

diff --git a/docs/source/index.rst b/docs/source/index.rst
index 6da0f4d53..133a1d653 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -61,6 +61,7 @@ TorchJD is open-source, under MIT License. The source code is available on
     :hidden:
 
     installation.md
+    jacobian_descent.md
     examples/index.rst
 
 .. toctree::
diff --git a/docs/source/jacobian_descent.md b/docs/source/jacobian_descent.md
new file mode 100644
index 000000000..63a2cd7a9
--- /dev/null
+++ b/docs/source/jacobian_descent.md
@@ -0,0 +1,63 @@
+# Jacobian descent
+
+This guide briefly explains what Jacobian descent is and on what kind of problems it can be used.
+For a more theoretical explanation, please read our article
+[Jacobian Descent for Multi-Objective Optimization](https://arxiv.org/pdf/2406.16232).
+
+## Introduction
+
+The goal of Jacobian descent is to train models with multiple conflicting losses. When you have
+multiple losses, your options are:
+- Gradient descent: Sum the losses into a single loss, compute the gradient of this loss with
+  respect to the model parameters, and update them using this vector. This is the standard approach
+  in the machine learning community.
+- Jacobian descent: Compute the gradient of each loss (stacked in the so-called Jacobian matrix),
+  **aggregate** that matrix into a single update vector, and update the model parameters using this
+  vector.
+
+There are many different ways to aggregate the Jacobian matrix. For instance, we may be tempted to
+just sum its rows. By linearity of differentiation, this is actually equivalent to summing the
+losses and then computing the gradient, so doing that is equivalent to doing gradient descent.
+
+If you have two gradients with a negative inner product and quite different norms, their sum will
+have a negative inner product with the smallest gradient. So, given a sufficiently small learning
+rate, a step of gradient descent will **increase** that loss. There are, however, different ways of
+aggregating the Jacobian leading to an update that has non-negative inner product with each
+gradient. We call these aggregators **non-conflicting**. The one that we have developed ourselves
+and that we recommend for most problems is :doc:`UPGrad <docs/aggregation/upgrad>`.
+
+## Which problems are multi-objective?
+
+Many optimization problems are multi-objective. In multitask learning, for example, the loss of each
+task can be considered as a separate objective. More interestingly to us, many problems that are
+traditionally considered as single-objective can actually be seen as multi-objective. Here are a few
+examples:
+- We can consider separately the loss of each element in the mini-batch, instead of averaging them.
+  We call this paradigm instance-wise risk minimization (IWRM).
+- We can split a multi-class classification problem with M classes into M binary classification
+  problems, each one with its own loss.
+- When dealing with sequences (text, time series, etc.), we can consider the loss obtained for each
+  sequence element separately rather than averaging them.
+
+## When to use Jacobian descent?
+
+JD should be used to try new approaches to train neural networks, where GD generally struggles due
+to gradient conflict. If you have an idea where JD could be interesting, you should start by
+verifying that the pairwise inner products between your gradients are sometimes negative. Then, you
+should use TorchJD to solve this conflict, and look at training and testing metrics to see if this
+helps to solve your problem.
+
+## When not to use Jacobian descent?
+
+- If training efficiency is critical (e.g. you're training LLMs with billions of parameters), it's
+  likely that the memory overhead of JD will not make it worthwhile.
+- If you have too many (e.g. more than 64) losses, JD will likely take too much memory to store the
+  full Jacobian, and the aggregation will be too long with most aggregators. In that case, you could
+  try to average some of your losses so that you end up with a reasonable number of losses.
+- If the inner products between pairs of gradients are never negative, you're most likely good to go
+  with GD.
+
+## Getting started
+
+To start using TorchJD, :doc:`intall <installation.md>` it and read the :doc:`basic usage example
+<examples/basic_usage>`.

From f7d7d00706fc5064e500fe644cc89da1c10dca05 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Val=C3=A9rian=20Rey?= <valerian.rey@gmail.com>
Date: Sat, 20 Dec 2025 17:44:21 +0100
Subject: [PATCH 2/6] Use rst

---
 docs/source/{jacobian_descent.md => jacobian_descent.rst} | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 rename docs/source/{jacobian_descent.md => jacobian_descent.rst} (100%)

diff --git a/docs/source/jacobian_descent.md b/docs/source/jacobian_descent.rst
similarity index 100%
rename from docs/source/jacobian_descent.md
rename to docs/source/jacobian_descent.rst

From f41ec63b347a4f680a391c78f4bd218891fd2ea2 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Val=C3=A9rian=20Rey?= <valerian.rey@gmail.com>
Date: Sat, 20 Dec 2025 17:52:05 +0100
Subject: [PATCH 3/6] fixes

---
 docs/source/index.rst            |  2 +-
 docs/source/jacobian_descent.rst | 19 +++++++++++--------
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/docs/source/index.rst b/docs/source/index.rst
index 133a1d653..e2152ae57 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -60,8 +60,8 @@ TorchJD is open-source, under MIT License. The source code is available on
     :caption: Getting Started
     :hidden:
 
-    installation.md
     jacobian_descent.md
+    installation.md
     examples/index.rst
 
 .. toctree::
diff --git a/docs/source/jacobian_descent.rst b/docs/source/jacobian_descent.rst
index 63a2cd7a9..ab539fe27 100644
--- a/docs/source/jacobian_descent.rst
+++ b/docs/source/jacobian_descent.rst
@@ -1,13 +1,15 @@
-# Jacobian descent
+Jacobian Descent
+================
 
 This guide briefly explains what Jacobian descent is and on what kind of problems it can be used.
 For a more theoretical explanation, please read our article
 [Jacobian Descent for Multi-Objective Optimization](https://arxiv.org/pdf/2406.16232).
 
-## Introduction
+**Introduction**
 
 The goal of Jacobian descent is to train models with multiple conflicting losses. When you have
 multiple losses, your options are:
+
 - Gradient descent: Sum the losses into a single loss, compute the gradient of this loss with
   respect to the model parameters, and update them using this vector. This is the standard approach
   in the machine learning community.
@@ -26,20 +28,21 @@ aggregating the Jacobian leading to an update that has non-negative inner produc
 gradient. We call these aggregators **non-conflicting**. The one that we have developed ourselves
 and that we recommend for most problems is :doc:`UPGrad <docs/aggregation/upgrad>`.
 
-## Which problems are multi-objective?
+**Which problems are multi-objective?**
 
 Many optimization problems are multi-objective. In multitask learning, for example, the loss of each
 task can be considered as a separate objective. More interestingly to us, many problems that are
 traditionally considered as single-objective can actually be seen as multi-objective. Here are a few
 examples:
+
 - We can consider separately the loss of each element in the mini-batch, instead of averaging them.
-  We call this paradigm instance-wise risk minimization (IWRM).
+  We call this paradigm instance-wise risk minimization (:doc:`IWRM <examples/iwrm>`).
 - We can split a multi-class classification problem with M classes into M binary classification
   problems, each one with its own loss.
 - When dealing with sequences (text, time series, etc.), we can consider the loss obtained for each
   sequence element separately rather than averaging them.
 
-## When to use Jacobian descent?
+**When to use Jacobian descent?**
 
 JD should be used to try new approaches to train neural networks, where GD generally struggles due
 to gradient conflict. If you have an idea where JD could be interesting, you should start by
@@ -47,7 +50,7 @@ verifying that the pairwise inner products between your gradients are sometimes
 should use TorchJD to solve this conflict, and look at training and testing metrics to see if this
 helps to solve your problem.
 
-## When not to use Jacobian descent?
+**When not to use Jacobian descent?**
 
 - If training efficiency is critical (e.g. you're training LLMs with billions of parameters), it's
   likely that the memory overhead of JD will not make it worthwhile.
@@ -57,7 +60,7 @@ helps to solve your problem.
 - If the inner products between pairs of gradients are never negative, you're most likely good to go
   with GD.
 
-## Getting started
+**Getting started**
 
-To start using TorchJD, :doc:`intall <installation.md>` it and read the :doc:`basic usage example
+To start using TorchJD, :doc:`install <installation>` it and read the :doc:`basic usage example
 <examples/basic_usage>`.

From 683f424ae839b876007173ab8367c73d47cdd607 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Val=C3=A9rian=20Rey?= <valerian.rey@gmail.com>
Date: Sat, 20 Dec 2025 17:56:54 +0100
Subject: [PATCH 4/6] fixes

---
 docs/source/jacobian_descent.rst | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/docs/source/jacobian_descent.rst b/docs/source/jacobian_descent.rst
index ab539fe27..6342602cb 100644
--- a/docs/source/jacobian_descent.rst
+++ b/docs/source/jacobian_descent.rst
@@ -30,10 +30,10 @@ and that we recommend for most problems is :doc:`UPGrad <docs/aggregation/upgrad
 
 **Which problems are multi-objective?**
 
-Many optimization problems are multi-objective. In multitask learning, for example, the loss of each
-task can be considered as a separate objective. More interestingly to us, many problems that are
-traditionally considered as single-objective can actually be seen as multi-objective. Here are a few
-examples:
+Many optimization problems are multi-objective. In multi-task learning, for example, the loss of
+each task can be considered as a separate objective. More interestingly to us, many problems that
+are traditionally considered as single-objective can actually be seen as multi-objective. Here are a
+few examples:
 
 - We can consider separately the loss of each element in the mini-batch, instead of averaging them.
   We call this paradigm instance-wise risk minimization (:doc:`IWRM <examples/iwrm>`).

From 16bc35617b77e2bec1383803ff06df8078adb57d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Val=C3=A9rian=20Rey?= <valerian.rey@gmail.com>
Date: Sat, 20 Dec 2025 17:58:53 +0100
Subject: [PATCH 5/6] Add link to monitoring

---
 docs/source/jacobian_descent.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/source/jacobian_descent.rst b/docs/source/jacobian_descent.rst
index 6342602cb..4390d6781 100644
--- a/docs/source/jacobian_descent.rst
+++ b/docs/source/jacobian_descent.rst
@@ -46,7 +46,8 @@ few examples:
 
 JD should be used to try new approaches to train neural networks, where GD generally struggles due
 to gradient conflict. If you have an idea where JD could be interesting, you should start by
-verifying that the pairwise inner products between your gradients are sometimes negative. Then, you
+verifying that the pairwise inner products between your gradients are sometimes negative. To easily
+do that, you can start by following the :doc:`Monitoring <examples/monitoring>` examples. Then, you
 should use TorchJD to solve this conflict, and look at training and testing metrics to see if this
 helps to solve your problem.
 

From 91c6aa504496f74c12421005a62de079d4234aa7 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Val=C3=A9rian=20Rey?= <valerian.rey@gmail.com>
Date: Sat, 20 Dec 2025 18:01:38 +0100
Subject: [PATCH 6/6] Add mention of norm imbalance

---
 docs/source/jacobian_descent.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/docs/source/jacobian_descent.rst b/docs/source/jacobian_descent.rst
index 4390d6781..60242ee1a 100644
--- a/docs/source/jacobian_descent.rst
+++ b/docs/source/jacobian_descent.rst
@@ -47,7 +47,8 @@ few examples:
 JD should be used to try new approaches to train neural networks, where GD generally struggles due
 to gradient conflict. If you have an idea where JD could be interesting, you should start by
 verifying that the pairwise inner products between your gradients are sometimes negative. To easily
-do that, you can start by following the :doc:`Monitoring <examples/monitoring>` examples. Then, you
+do that, you can start by following the :doc:`Monitoring <examples/monitoring>` examples. In
+general, the effect of JD will be even greater if the gradients also have norm imbalance. Then, you
 should use TorchJD to solve this conflict, and look at training and testing metrics to see if this
 helps to solve your problem.