hi~ why scale for each tensor computation? such as in bert model. 
hi~

why scale for each tensor computation?
such as in bert model.