Skip to content

Vanishing Jacobian problem #340

@MatthewChen37

Description

@MatthewChen37

Dear TorchJD maintainers,

First, thank you for this incredible work, as I could easily integrate it into my research on multi-task learning for pre-training foundation models. I’m not sure if this is an issue you have encountered before, or even well-known, but I have seemed to run into the Jacobian analogue of the vanishing gradient problem. After several training steps while pre-training my foundation model with TorchJD, the Jacobian values of the tensors connected to the shared features have turned into NaNs during backpropagation:

 File ".local/lib/python3.10/site-packages/torchjd/autojac/mtl_backward.py", line 117, in mtl_backward
    backward_transform(EmptyTensorDict())
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 79, in __call__
    return self.outer(intermediate)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 79, in __call__
    return self.outer(intermediate)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 78, in __call__
    intermediate = self.inner(input)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/aggregate.py", line 27, in __call__
    return self.transform(input)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 79, in __call__
    return self.outer(intermediate)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/base.py", line 78, in __call__
    intermediate = self.inner(input)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/aggregate.py", line 49, in __call__
    return self._aggregate_group(ordered_matrices, self.aggregator)
  File ".local/lib/python3.10/site-packages/torchjd/autojac/_transform/aggregate.py", line 83, in _aggregate_group
    united_gradient_vector = aggregator(united_jacobian_matrix)
  File ".local/lib/python3.10/site-packages/torchjd/aggregation/bases.py", line 36, in __call__
    return super().__call__(matrix)
  File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File ".local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File ".local/lib/python3.10/site-packages/torchjd/aggregation/bases.py", line 89, in forward
    self._check_is_finite(matrix)
  File ".local/lib/python3.10/site-packages/torchjd/aggregation/bases.py", line 23, in _check_is_finite
    raise ValueError(
ValueError: Parameter `matrix` should be a tensor of finite elements (no nan, inf or -inf values). Found `matrix = tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.],
        [0., 0., 0.,  ..., 0., 0., 0.]], device='cuda:0')`. Matrix shape: torch.Size([6, 106799]).

Note: I modified the bases.py file to print out the shapes of the offending matrices when the error is thrown.

Have you encountered this issue before, whether this is a bug on my end or a well-known problem in Jacobian descent?

I currently have multiple contrastive losses during pre-training that manipulate embedding outputs at different layers of the model, i.e., intermediate losses between layers. These losses specify the "shared_features" as the embeddings outputted before the first loss, which I suspect could be something I am implementing wrong on my end. Let me know if you have any other questions about my training setup to help understand the issue, but I’m afraid I’m not allowed to share everything.

Sincerely,
Matthew Chen

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions