Skip to content

The evaluation results using GroundingDINO aren't accurate. #2

Description

@zhenyuZ-HUST

By comparing human observation results with the counts from GroundingDINO, I noticed that GroundingDINO has huge issues. Especially in videos where the content doesn’t change across different frames, GroundingDINO still counts different numbers of objects. One specific example is the video generated by Wan2.1-T2V-1.3B with the prompt: 'Two astronauts repairing a satellite as five small debris-pieces fly by and a distant star glows.' In the video, the overall background and the astronauts stay still, with the main changes coming from the astronauts’ hand movements and the flying debris. Yet GroundingDINO gave the following counts for different frames in this video:

Frame | astronauts(2) | debris(5) | star(1) | Score

  1   |            3✗ |        1✗ |      0✗ | 0.00
  2   |            3✗ |        2✗ |      0✗ | 0.00
  3   |            3✗ |        1✗ |      1✓ | 0.33
  4   |            3✗ |        1✗ |      2✗ | 0.00
  5   |            3✗ |        1✗ |      1✓ | 0.33
  6   |            3✗ |        1✗ |      2✗ | 0.00
  7   |            3✗ |        1✗ |      1✓ | 0.33
  8   |            3✗ |        1✗ |      1✓ | 0.33
  9   |            3✗ |        2✗ |      0✗ | 0.00
 10   |            3✗ |        2✗ |      0✗ | 0.00
 11   |            3✗ |        2✗ |      1✓ | 0.33
 12   |            3✗ |        2✗ |      1✓ | 0.33
 13   |            3✗ |        2✗ |      1✓ | 0.33
 14   |            3✗ |        1✗ |      1✓ | 0.33
 15   |            3✗ |        2✗ |      1✓ | 0.33
 16   |            3✗ |        1✗ |      1✓ | 0.33
 17   |            3✗ |        1✗ |      1✓ | 0.33
 18   |            3✗ |        3✗ |      1✓ | 0.33
 19   |            3✗ |        1✗ |      2✗ | 0.00
 20   |            3✗ |        3✗ |      1✓ | 0.33
 21   |            3✗ |        2✗ |      1✓ | 0.33
 22   |            3✗ |        3✗ |      0✗ | 0.00
 23   |            3✗ |        2✗ |      0✗ | 0.00
 24   |            3✗ |        2✗ |      0✗ | 0.00
 25   |            3✗ |        2✗ |      0✗ | 0.00
 26   |            3✗ |        1✗ |      0✗ | 0.00
 27   |            3✗ |        1✗ |      0✗ | 0.00
 28   |            3✗ |        1✗ |      0✗ | 0.00
 29   |            3✗ |        1✗ |      0✗ | 0.00
 30   |            3✗ |        1✗ |      0✗ | 0.00
 31   |            3✗ |        1✗ |      0✗ | 0.00
 32   |            3✗ |        1✗ |      0✗ | 0.00
 33   |            3✗ |        1✗ |      0✗ | 0.00
 34   |            3✗ |        1✗ |      1✓ | 0.33
 35   |            3✗ |        1✗ |      1✓ | 0.33
 36   |            3✗ |        1✗ |      1✓ | 0.33
 37   |            3✗ |        1✗ |      1✓ | 0.33
 38   |            3✗ |        1✗ |      1✓ | 0.33
 39   |            3✗ |        1✗ |      1✓ | 0.33
 40   |            3✗ |        0✗ |      2✗ | 0.00
 41   |            3✗ |        1✗ |      1✓ | 0.33
 42   |            3✗ |        0✗ |      0✗ | 0.00
 43   |            3✗ |        0✗ |      1✓ | 0.33
 44   |            3✗ |        0✗ |      2✗ | 0.00
 45   |            3✗ |        0✗ |      1✓ | 0.33
 46   |            3✗ |        0✗ |      1✓ | 0.33
 47   |            3✗ |        0✗ |      1✓ | 0.33
 48   |            3✗ |        0✗ |      1✓ | 0.33
 49   |            3✗ |        1✗ |      1✓ | 0.33
 50   |            3✗ |        1✗ |      1✓ | 0.33
 51   |            3✗ |        1✗ |      0✗ | 0.00
 52   |            3✗ |        1✗ |      0✗ | 0.00
 53   |            3✗ |        1✗ |      0✗ | 0.00
 54   |            3✗ |        1✗ |      0✗ | 0.00
 55   |            3✗ |        1✗ |      0✗ | 0.00
 56   |            3✗ |        1✗ |      0✗ | 0.00
 57   |            3✗ |        2✗ |      0✗ | 0.00
 58   |            3✗ |        0✗ |      0✗ | 0.00
 59   |            3✗ |        0✗ |      0✗ | 0.00
 60   |            3✗ |        0✗ |      0✗ | 0.00
 61   |            3✗ |        1✗ |      0✗ | 0.00
 62   |            3✗ |        0✗ |      1✓ | 0.33
 63   |            3✗ |        0✗ |      1✓ | 0.33
 64   |            3✗ |        1✗ |      1✓ | 0.33
 65   |            3✗ |        1✗ |      0✗ | 0.00
 66   |            3✗ |        1✗ |      1✓ | 0.33
 67   |            3✗ |        1✗ |      0✗ | 0.00
 68   |            3✗ |        0✗ |      1✓ | 0.33
 69   |            3✗ |        0✗ |      0✗ | 0.00
 70   |            3✗ |        1✗ |      1✓ | 0.33
 71   |            3✗ |        1✗ |      0✗ | 0.00
 72   |            3✗ |        1✗ |      1✓ | 0.33
 73   |            3✗ |        0✗ |      0✗ | 0.00
 74   |            3✗ |        0✗ |      1✓ | 0.33
 75   |            3✗ |        0✗ |      0✗ | 0.00
 76   |            3✗ |        1✗ |      0✗ | 0.00
 77   |            3✗ |        1✗ |      0✗ | 0.00
 78   |            3✗ |        0✗ |      1✓ | 0.33
 79   |            3✗ |        0✗ |      1✓ | 0.33
 80   |            3✗ |        0✗ |      1✓ | 0.33
 81   |            3✗ |        0✗ |      1✓ | 0.33

Accuracy: 0.1646 (3 categories)

In this result, even the counts of the simplest things like astronauts and stars are obviously wrong in GroundingDINO's statistics, not to mention smaller objects like debris. So whether the evaluation proposed by CountBench is reliable and fair is definitely a questionable issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions