Overview
We introduce ImageCoDe, a vision-and-language benchmark that requires contextual language understanding in the form of pragmatics, temporality, long descriptions and visual nuances. We establish baselines that incorporate increasing levels of context and analyse shortcoming of models.
Leaderboard
In order to submit to the leaderboard, please read the README section that provides details on data formatting (it's a short read and process!).Rank | Model | Accuracy in % (video & static) |
---|---|---|
1 31 Mar, 2022 |
Human Performance | 90.8 |
2 25 Dec, 2022 |
NDCR-v2 Harbin Institute of Technology |
34.1 (26.1 & 64.3) |
3 20 Jan, 2022 |
ALBEF-finetuned Carnegie Mellon Univeristy |
33.6 (22.7 & 74.2) |
4 5 Dec, 2022 |
NDCR Anonymous |
32.6 (24.4 & 63.5) |
5 31 Mar, 2022 |
Baseline: ContextualCLIP Mila / McGill University |
29.9 (22.0 & 59.8) |
6 1 May, 2022 |
DCIG Harbin Institute of Technology |
28.4 (21.1 & 56.1) |
Examples
Explore ImageCoDe in our interactive demo where you can sample examples from the validation set and guess yourself (it's a fun but challenging task!).For queries and clarifications please contact benno.krojer (at) mila (dot) quebec