Is Gemini 2.5 good at bounding boxes? Sort of...
July 10, 2025
TL;DR Gemini 2.5 Pro is a decent object detector, matching Yolo V3 from 2018 on MS-COCO val.
Multimodal Large Language Models keep getting better, but are they ready to dethrone CNNs in computer vision tasks like object detection? The allure of skipping dataset collection, annotation, and training is too enticing not to waste a few evenings testing.
I decided to write a small benchmark and check Gemini 2.5 on MS-COCO, focusing on object detection. You can find the code and more results here.
Hover or tap to switch between ground truth (green) and Gemini predictions (blue) bounding boxes
Dataset
MS-COCO is a classic in the object detection world, sure it's a bit dated and the masks/bounding boxes aren't super tight, still, it has a long history and it should be easy to place Gemini among the historical results.
There are 80 classes, from person to toothbrush . Object boundaries can sometimes be ambiguous, but this tends to even out across the dataset.
The validation set, which you are not supposed to train on, contains 5000 images. However there are no guarantees that Gemini didn't vacuum it up during its training.
... continue reading