Cannot reproduce benchmark score with gpt-4o

Related: #143 

I'm consistently getting higher score for gpt-4o evaluation, compared to the score reported in the paper.

### Observation
I re-evaluated provided generated samples (Step1x-Edit-v1.1) on english dataset with gpt-4o, and got 7.497 where the reported score is 6.969.

In particular, `motion_change` and `ps_human` category's semantic consistency score (SC) was higher with my re-evalution. Difference is in the range of 1.0~3.0, significantly higher than possible variance.

Inspecting per-sample evaluation result, many samples of those categories in the reported scores have SC=0, suggesting the SC scoring call with gpt-4o failed somehow.

It seems that many samples from those two categories involves human, so I'm suspecting the evaluation was affected by contents filtering policy somehow (but in non-reproducible way)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot reproduce benchmark score with gpt-4o #145

Observation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Cannot reproduce benchmark score with gpt-4o #145

Description

Observation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions