Why can the weights of a zero-convolution layer obtain different gradients during training? #550

sieve-github-access · 2023-10-01T00:10:52Z

sieve-github-access
Oct 1, 2023

A common understanding in deep learning is that Initializing all the weights with zeros leads the neurons to learn the same features during training [1]. This is because all the weights get exactly the same gradient value and therefore receive exactly the same update during training. This is problematic because the layer effectively have only 1 neuron (aka channel) regardless how many neuron it has.

It seems to me that 1*1 zero-convolution layers, which are used throughout the paper, will suffer from this problem as well,.

That said, the huge success of ControlNet seems to imply that it doesn't suffer from this problem. Could anyone help me understand why?

geroldmeisinger · 2023-10-01T08:10:09Z

geroldmeisinger
Oct 1, 2023

there is a link on the front page with an explanation, see here https://github.com/lllyasviel/ControlNet/blob/main/docs/faq.md

2h video from a guy explaining the paper https://www.youtube.com/watch?v=Mp-HMQcB_M4

4 replies

sieve-github-access Oct 13, 2023
Author

The link doesn't explain why the weights of a zero-convolution layer won't be all the same , which is what this question is about ---
All weights starts with zero, receives the same gradient, so will they all be the same after training ?

xiechun298 Oct 4, 2024

I am here to ask the same question.
the FAQ page only explains why the gradients are not zeros, but the real question is why the gradients are not the same for all the neurons in the zero initialized layer

yeswecan Nov 23, 2024

@sieve-github-access @xiechun-tsukuba did you guys reach any conclusion on the topic? Sorry to recover the thread but I came here wondering the same thing

xiechun298 Nov 23, 2024

@sieve-github-access @xiechun-tsukuba did you guys reach any conclusion on the topic? Sorry to recover the thread but I came here wondering the same thing

Hi, I asked several 'knowledgeable' people this question, and they all told me to read the FAQ page. However, when I explained why it did not answer my question, they went silent. Interesting.

Back to our question, I finally figured it out by writing out the gradient of one edge in the first (zero-initialized) layer. Assuming a simple MLP with one hidden layer and one output node, the gradient of the edge connecting input i and hidden node j is

where

You can see the gradient is depending on v_j and x_i. As v_j is not zero-initialized(imagining it is a replacement of the pretrained U-Net), w_ij in the first layer will not have an identical gradient.

When will there be a problem? Suppose we use the same initial values for each layer when initializing the network. In that case, v_j will be identical for every j, and the edges connecting the same input i will have an identical gradient.

I hope this helps others who have the same question.

Vorlent · 2025-10-17T13:59:39Z

Vorlent
Oct 17, 2025

That answers how the first zero convolution is trained, but it is even more complicated than that, because you forgot to mention that there is another zero convolution. That's a second layer that has a zero initialization and it's at the output, which means it prevents the gradient from flowing to the first zero convolution.

It works anyway, because the input and output zero convolutions are on different residual layers and the forward pass is guaranteed to produce a non zero output/loss by both consulting the original model.

But even that isn't enough. Even if the gradient of the zero conv weights doesn't depend on the weights, it still depends on the output of the previous layer and that non-zero output can only be obtained by making sure the trainable copy is not zero initialized or has a residual connection.

To be fair, the question in the title of this discussion still doesn't make much sense. If you define the loss as the sum of square errors (y_1 - label_1)^2 + (y_2 - label_2)^2 and take the derivative of the loss with respect to y_1 and y_2, then you end up with two different gradients 2*y_1 -2*label_1 and 2*y_2 -2*label_2. Which means it is impossible for the gradients to ever go in the same direction unless both x_1=x_2 and label_1=label_2, but that is just a different way of saying that your neural network only has one input and output.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why can the weights of a zero-convolution layer obtain different gradients during training? #550

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Why can the weights of a zero-convolution layer obtain different gradients during training? #550

Uh oh!

sieve-github-access Oct 1, 2023

Replies: 2 comments · 4 replies

Uh oh!

Uh oh!

geroldmeisinger Oct 1, 2023

Uh oh!

sieve-github-access Oct 13, 2023 Author

Uh oh!

xiechun298 Oct 4, 2024

Uh oh!

yeswecan Nov 23, 2024

Uh oh!

Uh oh!

xiechun298 Nov 23, 2024

Uh oh!

Uh oh!

Vorlent Oct 17, 2025

sieve-github-access
Oct 1, 2023

Replies: 2 comments 4 replies

geroldmeisinger
Oct 1, 2023

sieve-github-access Oct 13, 2023
Author

Vorlent
Oct 17, 2025