Hi, thank you very much for your great work!
I have a question regarding the grounding results. In the paper, it is mentioned that "whenever the model references a ROI region in the image, it explicitly appends the corresponding bounding box coordinates [x1, y1, x2, y2] after the region text. This Chain-of-Box approach ensures the visual information is seamlessly integrated into the reasoning context, enabling VLMs to perform multimodal reasoning effectively."
However, I couldn’t find any grounding results (e.g., bounding boxes or coordinate information) in the section of the file eval/logs/rec22_results_cxr_test_qwen2_5vl_7b_instruct_r1_450.json.
Could you please check whether this is the correct file, or if the grounding results are stored elsewhere?
Thank you for your time and help!
Hi, thank you very much for your great work!
I have a question regarding the grounding results. In the paper, it is mentioned that "whenever the model references a ROI region in the image, it explicitly appends the corresponding bounding box coordinates [x1, y1, x2, y2] after the region text. This Chain-of-Box approach ensures the visual information is seamlessly integrated into the reasoning context, enabling VLMs to perform multimodal reasoning effectively."
However, I couldn’t find any grounding results (e.g., bounding boxes or coordinate information) in the section of the file eval/logs/rec22_results_cxr_test_qwen2_5vl_7b_instruct_r1_450.json.
Could you please check whether this is the correct file, or if the grounding results are stored elsewhere?
Thank you for your time and help!