Softplus leaks memory (and is no longer needed)

The TensorFlow probability implementation of softplus leaks memory, and appears to no longer be needed. That is, I think the standard `tf.nn.softplus` implementation can be used now, as numerical stability issues appear to have been solved.

Currently the implementation of softplus is as follows (from [here](https://github.com/tensorflow/probability/blob/303e844fd6e80202410932fd52570850a5957500/tensorflow_probability/python/bijectors/softplus.py#L35)):

```python
# TODO(b/155501444): Remove this when tf.nn.softplus is fixed.
if JAX_MODE:
  _stable_grad_softplus = tf.nn.softplus
else:

  @tf.custom_gradient
  def _stable_grad_softplus(x):
    """A (more) numerically stable softplus than `tf.nn.softplus`."""
    x = tf.convert_to_tensor(x)
    if x.dtype == tf.float64:
      cutoff = -20
    else:
      cutoff = -9

    y = tf.where(x < cutoff, tf.math.log1p(tf.exp(x)), tf.nn.softplus(x))

    def grad_fn(dy):
      return dy * tf.where(x < cutoff, tf.exp(x), tf.nn.sigmoid(x))

    return y, grad_fn
```

This leaks memory (in non-JAX mode) due to a couple of issues:

- The `grad_fn` closure captures the tensor represented by `x`. This closure then ends up in the gradient [registry](https://github.com/tensorflow/tensorflow/blob/8613b4328548be573f779c11df0f3ffba0edd592/tensorflow/python/framework/registry.py#L33), which is never cleared. So the tensor represented by `x` hangs around forever.
- For a similar reason TensorFlow's `custom_gradient` implementation also leaks memory. See [97697](https://github.com/tensorflow/tensorflow/issues/97697) for more details.

[Here is a Colab notebook to demonstrate the memory leak](https://colab.research.google.com/drive/1kRfrHVdwmpkVzUJ9AL9XbPELasfwvt6H?authuser=0#scrollTo=PA1IkLTMoY57).

However, I believe that the numerical stability issues with `tf.nn.softplus` have been solved. Specifically:

- The `tf.nn.softplus` implementation now uses `log1p` as of [this commit](https://github.com/tensorflow/tensorflow/commit/ce92fd6ae5d53f997f78d68cbade402b52a62adb) on May 1 2020.
- The gradient computation for `tf.nn.softplus` now uses `math_ops.sigmoid` as of [this commit](https://github.com/tensorflow/tensorflow/commit/862d326bffbf053753fce3f293ccc4704c771474) on April 4 2019.
- The Eigen implementation of sigmoid (which I think is [here](https://gitlab.com/libeigen/eigen/-/blob/master/Eigen/src/Core/functors/UnaryFunctors.h#L1191)) computes this as `e^x / 1.0 + e^x`, so using the approximation of `e^x` in `_stable_grad_softplus` seems unnecessary to me. If `e^x` is very small then `1.0 + e^x` will be exactly 1.0, so this is equivalent to `e^x`. If `e^x` > 1.0 then the result of `e^x / 1.0 + e^x` will be (I think) more accurate than just approximating the gradient to `e^x`. But I am not a numerical stability expert, so I may be wrong.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Softplus leaks memory (and is no longer needed) #2008

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Softplus leaks memory (and is no longer needed) #2008

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions