Leaderboard

You can find below the leaderboard for the WebLINX benchmark. It computes the overall score and the group-level scores for the action model. Both are computed on the test splits.

Jump to the Information section for more details on how to add your results.
The Action Model section contains the leaderboard for the action model, which predicts the action to be taken.
The Dense Markup Ranker section contains the leaderboard for the ranker used to select HTML candidate elements.
To understand how the scores are computed, please refer to the Scoring section.

Leaderboard: Action Model

Model	Overall	IM	Element (IoU)	Text (F1)	Finetuned	Links	Tags
Llama-3-8B-Web	28.88	84.36	27.44	28.88	✔️	Page	EE
Llama-2-13B	25.21	81.91	22.82	26.60	✔️	Code	EE
S-LLaMA-2.7B	25.02	84.00	22.60	27.17	✔️	Code	EE
Llama-2-7B	24.57	82.64	22.26	26.50	✔️	Code	EE
Flan-T5-3B	23.77	81.14	20.31	25.75	✔️	Code	EE
S-LLaMA-1.3B	23.73	83.32	20.54	25.85	✔️	Code	EE
GPT-3.5F	21.22	77.56	18.64	22.39	✔️	Code	EE
MindAct-3B	20.94	79.89	16.50	23.16	✔️	Code	EE
Fuyu-8B	19.97	80.07	15.70	22.30	✔️	Code	EE
Flan-T5-780M	17.27	80.02	15.36	14.05	✔️	Code	EE
Pix2Act-1.3B	16.88	81.80	8.28	25.21	✔️	Code	EE
Flan-T5-250M	14.99	79.69	14.86	9.21	✔️	Code	EE
MindAct-780M	15.13	75.87	13.39	13.58	✔️	Code	EE
MindAct-250M	12.63	74.25	12.05	7.67	✔️	Code	EE
Pix2Act-282M	12.51	79.71	6.20	16.40	✔️	Code	EE
GPT-4T	10.72	41.66	10.85	6.75	❌	Code	EE
GPT-4V	10.45	42.36	10.91	6.21	❌	Code	EE
GPT-3.5T	8.51	42.77	8.62	3.45	❌	Code	EE
Llama-2-13B	5.16	43.68	4.80	1.31	❌	Code	EE
Llama-2-7B	4.04	33.96	2.92	2.14	❌	Code	EE

Leaderboard: Dense Markup Ranker

Model	Test-Vis	Test-Geo	Test-Cat	Test-Web	Test-OOD	Links	Tags
BGE	60.07	48.82	43.61	47.55	50.01	Code	EE
MiniLM	59.73	50.95	44.05	52.75	51.87	Code	EE
GTE	56.91	44.46	42.74	48.39	48.16	Code	EE

Scoring

You can learn more about the evaluation metrics and splits in our paper. The ranker uses recall@10, whereas action model is more complex:

We compute a intent match (IM) score that measures how well the predicted intent matches the ground truth intent.
We have different intents which are evaluated separately, at the turn-level. This means we get a score for each turn. There are two groups:
- Element Group: click, textinput, submit use the IoU metric.
- Text Group: say, textinput, load use the F1 metric.
We compute the overall score by averaging the scores for each turn in a given split. The score for element and text groups are the average for turns with one of the intents in the group.
We have 4 out-of-domain test splits (Test-vis, test-geo, test-cat, test-web) testing different generalization capabilities of the model.
The score reported for action model is the average of the overall score, group score, and IM for each split, which we call test-ood.
For the ranker, scores on all splits are reported.

Information

Information about the information in the leaderboard can be found below.

Adding results

To add your results, please fork the repository, add your results to the leaderboard.md file and submit a pull request to the main branch with the subject line “Add results for [your team name]”. DatePermalink

Please use the YYYY-MM-DD format. If you are submitting multiple results, please use the date of the most recent result.

Links

In the Links column, you can provide links to your code, paper, and/or blog post. Rather than giving the full URL, please use the following format:

[Code](https://github.com); [Paper](https://arxiv.org/abs/1234.5678); [Blog](https://medium.com)

Please do not exceed 5 links.