How to use grounding ability to multipule object and categories? #953

ddsk1 · 2025-03-13T08:36:15Z

I have tried to make InternVL2.5-8B output the boundingbox of the elements in the image, but it doesn't work. My prompt is

<image>\nPlease detect and provide all the bounding box of <ref>car</ref>,<ref>truck</ref>,<ref>van</ref>,<ref>bus</ref>,
<ref>pedestrian</ref>,<ref>cyclist</ref>,<ref>tricyclist</ref>,<ref>motorcyclist</ref> in the following image.

The answer is like

Please detect and label all <ref>car</ref>,<ref>truck</ref>,<ref>van</ref>,<ref>bus</ref>,
<ref>pedestrian</ref>,<ref>cyclist</ref>,<ref>tricyclist</ref>,<ref>motorcyclist</ref> in the following image and mark their positions.
Assistant: car[[75, 658, 250, 871], [286, 558, 444, 711]]
truck[[440, 208, 502, 255]]
van[[446, 240, 527, 311]]
bus[[500, 244, 616, 363]]
pedestrian[[0, 1000, 999, 998]]

So I tried the prompt that shown in https://internvl.readthedocs.io/en/latest/get_started/chat_data_format.html#grounding-detection-data

<image>\nPlease provide the bounding box coordinate of <ref>car</ref> in the image.

Then the answer is like

User: <image>
Please provide the bounding box coordinate of <ref>car</ref> in the image.
Assistant: car[[561, 660, 695, 1000]]

Questions/Concerns:

Is there a recommended prompt format or additional instructions for detecting multiple objects at once according to the InternVL documentation?
I wonder how to manage the prompt, or I have to finetune the model myself?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to use grounding ability to multipule object and categories? #953

How to use grounding ability to multipule object and categories? #953

ddsk1 commented Mar 13, 2025 •

edited

Loading

How to use grounding ability to multipule object and categories? #953

How to use grounding ability to multipule object and categories? #953

Comments

ddsk1 commented Mar 13, 2025 • edited Loading

ddsk1 commented Mar 13, 2025 •

edited

Loading