Skip to the content.

Overview

We introduce ImageCoDe, a vision-and-language benchmark that requires contextual language understanding in the form of pragmatics, temporality, long descriptions and visual nuances. We establish baselines that incorporate increasing levels of context and analyse shortcoming of models.

Leaderboard

In order to submit to the leaderboard, please read the README section that provides details on data formatting (it's a short read and process!).
Rank Model Accuracy in % (video & static)
1
31 Mar, 2022
Human Performance 90.8
2
25 Dec, 2022
NDCR-v2
Harbin Institute of Technology
34.1 (26.1 & 64.3)
3
20 Jan, 2022
ALBEF-finetuned
Carnegie Mellon Univeristy
33.6 (22.7 & 74.2)
4
5 Dec, 2022
NDCR
Anonymous
32.6 (24.4 & 63.5)
5
31 Mar, 2022
Baseline: ContextualCLIP
Mila / McGill University
29.9 (22.0 & 59.8)
6
1 May, 2022
DCIG
Harbin Institute of Technology
28.4 (21.1 & 56.1)

Examples

Explore ImageCoDe in our interactive demo where you can sample examples from the validation set and guess yourself (it's a fun but challenging task!).

For queries and clarifications please contact benno.krojer (at) mila (dot) quebec