-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathCITATION.cff
80 lines (66 loc) · 2.21 KB
/
CITATION.cff
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!
cff-version: 1.2.0
title: Roboflow 100 VL
message: >-
If you use this dataset, please cite it using the metadata
from this file.
type: dataset
authors:
- given-names: Peter
family-names: Robicheaux
email: [email protected]
affiliation: Roboflow
- given-names: Matvei
family-names: Popov
email: [email protected]
affiliation: Roboflow
- given-names: Anish
family-names: Madan
email: [email protected]
affiliation: Carnegie Mellon University
- given-names: Isaac
family-names: Robinson
email: [email protected]
affiliation: Roboflow
- given-names: Deva
family-names: Ramanan
affiliation: Carnegie Mellon University
- given-names: Neehar
family-names: Peri
email: [email protected]
affiliation: Carnegie Mellon University
repository-code: 'https://github.com/roboflow/rf100-vl/'
url: 'http://rf100-vl.org/'
abstract: >-
Vision-language models (VLMs) trained on internet-scale
data achieve remark-
able zero-shot detection performance on common objects
like car, truck, and
pedestrian. However, state-of-the-art models still
struggle to generalize to out-
of-distribution tasks (e.g. material property estimation,
defect detection, and con-
textual action recognition) and imaging modalities (e.g.
X-rays, thermal-spectrum
data, and aerial images) not typically found in their
pre-training. Rather than
simply re-training VLMs on more visual data (the dominant
paradigm for few-shot
learning), we argue that one should align VLMs to new
concepts with annotation
instructions containing a few visual examples and rich
textual descriptions. To this
end, we introduce Roboflow 100-VL, a large-scale collection
of 100 multi-modal
datasets with diverse concepts not commonly found in VLM
pre-training. Notably,
state-of-the-art models like GroundingDINO and Qwen2.5-VL
achieve less than
1% AP zero-shot accuracy, demonstrating the need for
few-shot concept alignment.
Our code and dataset are available on GitHub and Roboflow.
keywords:
- few shot object detection
- VLM
license: Apache-2.0