Overview of BRIDGE. Model responses across benchmarks are used to fit a 2PL IRT model, estimating latent task difficulty and model capability. Calibrating latent difficulty against tasks with known human completion times yields predictions of human task duration for new benchmarks—enabling capability forecasting without human time annotations.
Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE (Benchmark Response Inferred Difficulty Grounded in Elapsed human time), a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time.
Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months, without human completion time annotations.
BRIDGE connects two complementary regimes: expensive but interpretable human time annotations, and scalable but uncalibrated IRT-based difficulty estimates from model performance.
Given binary outcomes (success/failure) for each model–task pair, we fit a two-parameter logistic (2PL) IRT model. The probability of model $m_j$ succeeding on task $t_i$ is:
$$P(y_{i,j} = 1 \mid \theta_j, a_i, b_i) = \sigma\big(a_i(\theta_j - b_i)\big)$$
where $\theta_j$ is the model's latent ability, $b_i$ is the task's latent difficulty, and $a_i$ is the task's discrimination parameter.
We discover that IRT-inferred latent difficulty closely tracks human completion time. For tasks with human annotations, we fit a log-linear mapping:
$$\log(h_k) = \text{slope} \times b_k + \text{intercept}$$
This anchors the latent scale to human-interpretable units, enabling prediction of human task completion time from model performance alone—without new human studies.
By tracking the best-performing model in each release window and mapping its ability to solvable task length via the calibration, we forecast how the frontier task-length horizon evolves over time.
Figure 1. Human completion time vs. latent task difficulty ($b$) estimated via 2PL IRT across METR task suites (SWAA, HCAST, RE-bench). The log-linear fit ($R^2 = 0.81$) shows that each unit increase in $b$ corresponds to ~2.26× longer human completion time. This calibration anchors the IRT latent difficulty scale to human-interpretable units.
We validate BRIDGE's time predictions on out-of-distribution benchmarks: SWE-bench Verified, Cybench, MLE-bench, and GDPval. BRIDGE consistently outperforms both success-rate heuristics and LLM-based estimation baselines.
Figure 2. Alignment between annotated human completion time buckets and estimated completion times on SWE-bench Verified. BRIDGE achieves substantially better alignment with the annotated time buckets than both heuristic and LLM-based baselines (Gemini 3 Pro, GPT-5.2).
Figure 3. Alignment between actual human first-solve time and estimated completion times on Cybench. BRIDGE aligns closely with actual human times, with 92.3% of tasks falling within a 0.5×–2× tolerance band. The logit baseline underestimates; LLM estimates consistently overestimate.
SWE-bench Verified
MLE-bench
GDPval
Figure 4. BRIDGE-estimated human task completion time distributions for benchmarks without exact time annotations. On MLE-bench, tasks yielding only valid submissions require substantially shorter completion times than achieving above-median performance or earning a medal.
Using BRIDGE, we forecast the evolution of frontier model capabilities in human-interpretable units—using only model performance data, without human time annotations.
Figure 5. Success probability vs. estimated human task completion time for different models at the 50% threshold. SOTA models achieve 50% success on tasks estimated to require ~1.4–2.5 hours of human effort. Steeper curves reflect higher task discrimination parameters. Darker blue denotes more recent models.
Figure 6. Forecasting trends of task-length horizon over model release date without human task time annotations. The task length where a model can achieve 50% accuracy grows exponentially, with an estimated doubling time of approximately 6 months. Left: logarithmic scale. Right: linear scale. Shaded regions indicate 95% confidence intervals via bootstrap resampling.
@misc{liu2026bridgepredictinghumantask,
title={BRIDGE: Predicting Human Task Completion Time From Model Performance},
author={Fengyuan Liu and Jay Gala and Nilaksh and Dzmitry Bahdanau and Siva Reddy and Hugo Larochelle},
year={2026},
eprint={2602.07267},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.07267},
}
Siva Reddy and Dzmitry Bahdanau are supported by the Canada CIFAR AI Chairs program. We acknowledge the support of the IVADO R3AI Grant and a Gemini Academic Program Award. We thank the Mila IDT team and the Digital Research Alliance of Canada for computational resources, and members of McGill University and Mila for valuable feedback.