Skip to content

Cannot reproduce benchmark score with gpt-4o #145

@chiheonk

Description

@chiheonk

Related: #143

I'm consistently getting higher score for gpt-4o evaluation, compared to the score reported in the paper.

Observation

I re-evaluated provided generated samples (Step1x-Edit-v1.1) on english dataset with gpt-4o, and got 7.497 where the reported score is 6.969.

In particular, motion_change and ps_human category's semantic consistency score (SC) was higher with my re-evalution. Difference is in the range of 1.0~3.0, significantly higher than possible variance.

Inspecting per-sample evaluation result, many samples of those categories in the reported scores have SC=0, suggesting the SC scoring call with gpt-4o failed somehow.

It seems that many samples from those two categories involves human, so I'm suspecting the evaluation was affected by contents filtering policy somehow (but in non-reproducible way)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions