BRIDGE: Predicting Human Task Completion Time From Model Performance

Fengyuan Liu^*1,2, Jay Gala^*1,2, Nilaksh^1,3, Dzmitry Bahdanau^1,2,4,6, Siva Reddy^1,2,5,6, Hugo Larochelle¹

¹Mila – Quebec AI Institute ²McGill University ³Polytechnique Montréal ⁴Periodic Labs ⁵ServiceNow Research ⁶Canada CIFAR AI Chair

^*Equal Contribution · Correspondence: {fengyuan.liu, jay.gala}@mila.quebec

Paper Code

Overview of BRIDGE. Model responses across benchmarks are used to fit a 2PL IRT model, estimating latent task difficulty and model capability. Calibrating latent difficulty against tasks with known human completion times yields predictions of human task duration for new benchmarks—enabling capability forecasting without human time annotations.

Abstract

Evaluating the real-world capabilities of AI systems requires grounding benchmark performance in human-interpretable measures of task difficulty. Existing approaches that rely on direct human task completion time annotations are costly, noisy, and difficult to scale across benchmarks. In this work, we propose BRIDGE (Benchmark Response Inferred Difficulty Grounded in Elapsed human time), a unified psychometric framework that learns the latent difficulty scale from model responses and anchors it to human task completion time.

Using a two-parameter logistic Item Response Theory model, we jointly estimate latent task difficulty and model capability from model performance data across multiple benchmarks. We demonstrate that latent task difficulty varies linearly with the logarithm of human completion time, allowing human task completion time to be inferred for new benchmarks from model performance alone. Leveraging this alignment, we forecast frontier model capabilities in terms of human task length and independently reproduce METR's exponential scaling results, with the 50% solvable task horizon doubling approximately every 6 months, without human completion time annotations.

Method

BRIDGE connects two complementary regimes: expensive but interpretable human time annotations, and scalable but uncalibrated IRT-based difficulty estimates from model performance.

Step 1: Fit IRT Model

Given binary outcomes (success/failure) for each model–task pair, we fit a two-parameter logistic (2PL) IRT model. The probability of model $m_j$ succeeding on task $t_i$ is:

$$P(y_{i,j} = 1 \mid \theta_j, a_i, b_i) = \sigma\big(a_i(\theta_j - b_i)\big)$$

where $\theta_j$ is the model's latent ability, $b_i$ is the task's latent difficulty, and $a_i$ is the task's discrimination parameter.

Step 2: Calibrate to Human Time

We discover that IRT-inferred latent difficulty closely tracks human completion time. For tasks with human annotations, we fit a log-linear mapping:

$$\log(h_k) = \text{slope} \times b_k + \text{intercept}$$

This anchors the latent scale to human-interpretable units, enabling prediction of human task completion time from model performance alone—without new human studies.

Step 3: Forecast Capabilities

By tracking the best-performing model in each release window and mapping its ability to solvable task length via the calibration, we forecast how the frontier task-length horizon evolves over time.

Calibration: Latent Difficulty ↔ Human Time

Calibration plot: task difficulty vs human time

Figure 1. Human completion time vs. latent task difficulty ($b$) estimated via 2PL IRT across METR task suites (SWAA, HCAST, RE-bench). The log-linear fit ($R^2 = 0.81$) shows that each unit increase in $b$ corresponds to ~2.26× longer human completion time. This calibration anchors the IRT latent difficulty scale to human-interpretable units.

Estimating Human Task Completion Time

We validate BRIDGE's time predictions on out-of-distribution benchmarks: SWE-bench Verified, Cybench, MLE-bench, and GDPval. BRIDGE consistently outperforms both success-rate heuristics and LLM-based estimation baselines.

SWE-bench Verified

Figure 2. Alignment between annotated human completion time buckets and estimated completion times on SWE-bench Verified. BRIDGE achieves substantially better alignment with the annotated time buckets than both heuristic and LLM-based baselines (Gemini 3 Pro, GPT-5.2).

Cybench

Figure 3. Alignment between actual human first-solve time and estimated completion times on Cybench. BRIDGE aligns closely with actual human times, with 92.3% of tasks falling within a 0.5×–2× tolerance band. The logit baseline underestimates; LLM estimates consistently overestimate.

Predicted Task Length Distributions

SWE-bench Verified

MLE-bench

GDPval

Figure 4. BRIDGE-estimated human task completion time distributions for benchmarks without exact time annotations. On MLE-bench, tasks yielding only valid submissions require substantially shorter completion times than achieving above-median performance or earning a medal.

Forecasting Frontier Capabilities

Using BRIDGE, we forecast the evolution of frontier model capabilities in human-interpretable units—using only model performance data, without human time annotations.

Success Probability vs. Task Length

Figure 5. Success probability vs. estimated human task completion time for different models at the 50% threshold. SOTA models achieve 50% success on tasks estimated to require ~1.4–2.5 hours of human effort. Steeper curves reflect higher task discrimination parameters. Darker blue denotes more recent models.

Task-Length Horizon Over Time

Figure 6. Forecasting trends of task-length horizon over model release date without human task time annotations. The task length where a model can achieve 50% accuracy grows exponentially, with an estimated doubling time of approximately 6 months. Left: logarithmic scale. Right: linear scale. Shaded regions indicate 95% confidence intervals via bootstrap resampling.

BibTeX

@misc{liu2026bridgepredictinghumantask,
      title={BRIDGE: Predicting Human Task Completion Time From Model Performance}, 
      author={Fengyuan Liu and Jay Gala and Nilaksh and Dzmitry Bahdanau and Siva Reddy and Hugo Larochelle},
      year={2026},
      eprint={2602.07267},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.07267}, 
}

Acknowledgements

Siva Reddy and Dzmitry Bahdanau are supported by the Canada CIFAR AI Chairs program. We acknowledge the support of the IVADO R3AI Grant and a Gemini Academic Program Award. We thank the Mila IDT team and the Digital Research Alliance of Canada for computational resources, and members of McGill University and Mila for valuable feedback.