By comparing human observation results with the counts from GroundingDINO, I noticed that GroundingDINO has huge issues. Especially in videos where the content doesn’t change across different frames, GroundingDINO still counts different numbers of objects. One specific example is the video generated by Wan2.1-T2V-1.3B with the prompt: 'Two astronauts repairing a satellite as five small debris-pieces fly by and a distant star glows.' In the video, the overall background and the astronauts stay still, with the main changes coming from the astronauts’ hand movements and the flying debris. Yet GroundingDINO gave the following counts for different frames in this video:
Frame | astronauts(2) | debris(5) | star(1) | Score
1 | 3✗ | 1✗ | 0✗ | 0.00
2 | 3✗ | 2✗ | 0✗ | 0.00
3 | 3✗ | 1✗ | 1✓ | 0.33
4 | 3✗ | 1✗ | 2✗ | 0.00
5 | 3✗ | 1✗ | 1✓ | 0.33
6 | 3✗ | 1✗ | 2✗ | 0.00
7 | 3✗ | 1✗ | 1✓ | 0.33
8 | 3✗ | 1✗ | 1✓ | 0.33
9 | 3✗ | 2✗ | 0✗ | 0.00
10 | 3✗ | 2✗ | 0✗ | 0.00
11 | 3✗ | 2✗ | 1✓ | 0.33
12 | 3✗ | 2✗ | 1✓ | 0.33
13 | 3✗ | 2✗ | 1✓ | 0.33
14 | 3✗ | 1✗ | 1✓ | 0.33
15 | 3✗ | 2✗ | 1✓ | 0.33
16 | 3✗ | 1✗ | 1✓ | 0.33
17 | 3✗ | 1✗ | 1✓ | 0.33
18 | 3✗ | 3✗ | 1✓ | 0.33
19 | 3✗ | 1✗ | 2✗ | 0.00
20 | 3✗ | 3✗ | 1✓ | 0.33
21 | 3✗ | 2✗ | 1✓ | 0.33
22 | 3✗ | 3✗ | 0✗ | 0.00
23 | 3✗ | 2✗ | 0✗ | 0.00
24 | 3✗ | 2✗ | 0✗ | 0.00
25 | 3✗ | 2✗ | 0✗ | 0.00
26 | 3✗ | 1✗ | 0✗ | 0.00
27 | 3✗ | 1✗ | 0✗ | 0.00
28 | 3✗ | 1✗ | 0✗ | 0.00
29 | 3✗ | 1✗ | 0✗ | 0.00
30 | 3✗ | 1✗ | 0✗ | 0.00
31 | 3✗ | 1✗ | 0✗ | 0.00
32 | 3✗ | 1✗ | 0✗ | 0.00
33 | 3✗ | 1✗ | 0✗ | 0.00
34 | 3✗ | 1✗ | 1✓ | 0.33
35 | 3✗ | 1✗ | 1✓ | 0.33
36 | 3✗ | 1✗ | 1✓ | 0.33
37 | 3✗ | 1✗ | 1✓ | 0.33
38 | 3✗ | 1✗ | 1✓ | 0.33
39 | 3✗ | 1✗ | 1✓ | 0.33
40 | 3✗ | 0✗ | 2✗ | 0.00
41 | 3✗ | 1✗ | 1✓ | 0.33
42 | 3✗ | 0✗ | 0✗ | 0.00
43 | 3✗ | 0✗ | 1✓ | 0.33
44 | 3✗ | 0✗ | 2✗ | 0.00
45 | 3✗ | 0✗ | 1✓ | 0.33
46 | 3✗ | 0✗ | 1✓ | 0.33
47 | 3✗ | 0✗ | 1✓ | 0.33
48 | 3✗ | 0✗ | 1✓ | 0.33
49 | 3✗ | 1✗ | 1✓ | 0.33
50 | 3✗ | 1✗ | 1✓ | 0.33
51 | 3✗ | 1✗ | 0✗ | 0.00
52 | 3✗ | 1✗ | 0✗ | 0.00
53 | 3✗ | 1✗ | 0✗ | 0.00
54 | 3✗ | 1✗ | 0✗ | 0.00
55 | 3✗ | 1✗ | 0✗ | 0.00
56 | 3✗ | 1✗ | 0✗ | 0.00
57 | 3✗ | 2✗ | 0✗ | 0.00
58 | 3✗ | 0✗ | 0✗ | 0.00
59 | 3✗ | 0✗ | 0✗ | 0.00
60 | 3✗ | 0✗ | 0✗ | 0.00
61 | 3✗ | 1✗ | 0✗ | 0.00
62 | 3✗ | 0✗ | 1✓ | 0.33
63 | 3✗ | 0✗ | 1✓ | 0.33
64 | 3✗ | 1✗ | 1✓ | 0.33
65 | 3✗ | 1✗ | 0✗ | 0.00
66 | 3✗ | 1✗ | 1✓ | 0.33
67 | 3✗ | 1✗ | 0✗ | 0.00
68 | 3✗ | 0✗ | 1✓ | 0.33
69 | 3✗ | 0✗ | 0✗ | 0.00
70 | 3✗ | 1✗ | 1✓ | 0.33
71 | 3✗ | 1✗ | 0✗ | 0.00
72 | 3✗ | 1✗ | 1✓ | 0.33
73 | 3✗ | 0✗ | 0✗ | 0.00
74 | 3✗ | 0✗ | 1✓ | 0.33
75 | 3✗ | 0✗ | 0✗ | 0.00
76 | 3✗ | 1✗ | 0✗ | 0.00
77 | 3✗ | 1✗ | 0✗ | 0.00
78 | 3✗ | 0✗ | 1✓ | 0.33
79 | 3✗ | 0✗ | 1✓ | 0.33
80 | 3✗ | 0✗ | 1✓ | 0.33
81 | 3✗ | 0✗ | 1✓ | 0.33
Accuracy: 0.1646 (3 categories)
In this result, even the counts of the simplest things like astronauts and stars are obviously wrong in GroundingDINO's statistics, not to mention smaller objects like debris. So whether the evaluation proposed by CountBench is reliable and fair is definitely a questionable issue.
By comparing human observation results with the counts from GroundingDINO, I noticed that GroundingDINO has huge issues. Especially in videos where the content doesn’t change across different frames, GroundingDINO still counts different numbers of objects. One specific example is the video generated by Wan2.1-T2V-1.3B with the prompt: 'Two astronauts repairing a satellite as five small debris-pieces fly by and a distant star glows.' In the video, the overall background and the astronauts stay still, with the main changes coming from the astronauts’ hand movements and the flying debris. Yet GroundingDINO gave the following counts for different frames in this video:
Frame | astronauts(2) | debris(5) | star(1) | Score
Accuracy: 0.1646 (3 categories)
In this result, even the counts of the simplest things like astronauts and stars are obviously wrong in GroundingDINO's statistics, not to mention smaller objects like debris. So whether the evaluation proposed by CountBench is reliable and fair is definitely a questionable issue.