Related: #143
I'm consistently getting higher score for gpt-4o evaluation, compared to the score reported in the paper.
Observation
I re-evaluated provided generated samples (Step1x-Edit-v1.1) on english dataset with gpt-4o, and got 7.497 where the reported score is 6.969.
In particular, motion_change and ps_human category's semantic consistency score (SC) was higher with my re-evalution. Difference is in the range of 1.0~3.0, significantly higher than possible variance.
Inspecting per-sample evaluation result, many samples of those categories in the reported scores have SC=0, suggesting the SC scoring call with gpt-4o failed somehow.
It seems that many samples from those two categories involves human, so I'm suspecting the evaluation was affected by contents filtering policy somehow (but in non-reproducible way)
Related: #143
I'm consistently getting higher score for gpt-4o evaluation, compared to the score reported in the paper.
Observation
I re-evaluated provided generated samples (Step1x-Edit-v1.1) on english dataset with gpt-4o, and got 7.497 where the reported score is 6.969.
In particular,
motion_changeandps_humancategory's semantic consistency score (SC) was higher with my re-evalution. Difference is in the range of 1.0~3.0, significantly higher than possible variance.Inspecting per-sample evaluation result, many samples of those categories in the reported scores have SC=0, suggesting the SC scoring call with gpt-4o failed somehow.
It seems that many samples from those two categories involves human, so I'm suspecting the evaluation was affected by contents filtering policy somehow (but in non-reproducible way)