Why can the weights of a zero-convolution layer obtain different gradients during training? #550
Replies: 2 comments 4 replies
-
|
there is a link on the front page with an explanation, see here https://github.com/lllyasviel/ControlNet/blob/main/docs/faq.md 2h video from a guy explaining the paper https://www.youtube.com/watch?v=Mp-HMQcB_M4 |
Beta Was this translation helpful? Give feedback.
-
|
That answers how the first zero convolution is trained, but it is even more complicated than that, because you forgot to mention that there is another zero convolution. That's a second layer that has a zero initialization and it's at the output, which means it prevents the gradient from flowing to the first zero convolution. It works anyway, because the input and output zero convolutions are on different residual layers and the forward pass is guaranteed to produce a non zero output/loss by both consulting the original model. But even that isn't enough. Even if the gradient of the zero conv weights doesn't depend on the weights, it still depends on the output of the previous layer and that non-zero output can only be obtained by making sure the trainable copy is not zero initialized or has a residual connection. To be fair, the question in the title of this discussion still doesn't make much sense. If you define the loss as the sum of square errors (y_1 - label_1)^2 + (y_2 - label_2)^2 and take the derivative of the loss with respect to y_1 and y_2, then you end up with two different gradients 2*y_1 -2*label_1 and 2*y_2 -2*label_2. Which means it is impossible for the gradients to ever go in the same direction unless both x_1=x_2 and label_1=label_2, but that is just a different way of saying that your neural network only has one input and output. |
Beta Was this translation helpful? Give feedback.


Uh oh!
There was an error while loading. Please reload this page.
-
A common understanding in deep learning is that Initializing all the weights with zeros leads the neurons to learn the same features during training [1]. This is because all the weights get exactly the same gradient value and therefore receive exactly the same update during training. This is problematic because the layer effectively have only 1 neuron (aka channel) regardless how many neuron it has.
It seems to me that 1*1 zero-convolution layers, which are used throughout the paper, will suffer from this problem as well,.
That said, the huge success of ControlNet seems to imply that it doesn't suffer from this problem. Could anyone help me understand why?
Beta Was this translation helpful? Give feedback.
All reactions