-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Dear TorchJD maintainers,
First, thank you for this incredible work, as I could easily integrate it into my research on multi-task learning for pre-training foundation models. I’m not sure if this is an issue you have encountered before, or even well-known, but I have seemed to run into the Jacobian analogue of the vanishing gradient problem. After several training steps while pre-training my foundation model with TorchJD, the Jacobian values of the tensors connected to the shared features have turned into NaNs during backpropagation:
File ".local/lib/python3.10/site-packages/torchjd/autojac/mtl_backward.py", line 117, in mtl_backward
backward_transform(EmptyTensorDict())
File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 79, in __call__
return self.outer(intermediate)
File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 79, in __call__
return self.outer(intermediate)
File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 78, in __call__
intermediate = self.inner(input)
File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/aggregate.py", line 27, in __call__
return self.transform(input)
File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 79, in __call__
return self.outer(intermediate)
File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 78, in __call__
intermediate = self.inner(input)
File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/aggregate.py", line 49, in __call__
return self._aggregate_group(ordered_matrices, self.aggregator)
File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/aggregate.py", line 83, in _aggregate_group
united_gradient_vector = aggregator(united_jacobian_matrix)
File ".local/lib/python3.10/site-packages/torchjd/aggregation/bases.py", line 36, in __call__
return super().__call__(matrix)
File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
return forward_call(*args, **kwargs)
File ".local/lib/python3.10/site-packages/torchjd/aggregation/bases.py", line 89, in forward
self._check_is_finite(matrix)
File ".local/lib/python3.10/site-packages/torchjd/aggregation/bases.py", line 23, in _check_is_finite
raise ValueError(
ValueError: Parameter `matrix` should be a tensor of finite elements (no nan, inf or -inf values). Found `matrix = tensor([[nan, nan, nan, ..., nan, nan, nan],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0')`. Matrix shape: torch.Size([6, 106799]).
Note: I modified the bases.py file to print out the shapes of the offending matrices when the error is thrown.
Have you encountered this issue before, whether this is a bug on my end or a well-known problem in Jacobian descent?
I currently have multiple contrastive losses during pre-training that manipulate embedding outputs at different layers of the model, i.e., intermediate losses between layers. These losses specify the "shared_features" as the embeddings outputted before the first loss, which I suspect could be something I am implementing wrong on my end. Let me know if you have any other questions about my training setup to help understand the issue, but I’m afraid I’m not allowed to share everything.
Sincerely,
Matthew Chen