Task horizon

Last updated: February 5, 2026

Task horizon is a deployment metric, not just a research concept. It estimates how long an agent can run on bounded work before it hits a failure that requires manual rescue.

This matters because supervision cost scales nonlinearly. Small improvements in reliable task length can unlock much larger workflow chunks.

Research

METR (Model Evaluation and Threat Research) calibrates tasks by human completion time and then measures success decay as tasks lengthen. That gives a clearer picture than one-off benchmark scores.

The trend is upward, but exact values are setup-sensitive. Task definition, tool access, and agent scaffolding all influence measured horizon.

Operational Relevance

Most jobs are bundles of mixed-duration tasks. As reliable horizon increases, handoff moves from micro-tasks toward full work packets with fewer interruptions.

The governing variable becomes supervision economics: does review time plus failure recovery stay below the value created by delegation?

Use horizon thresholds to choose tasks for automation.
Measure rescue frequency, not only completion rate.
Separate reversible and irreversible task types.

Current State

Across current public evidence, agents can often sustain work spanning tens of minutes to a few hours in constrained domains. Multi-day autonomous execution is still unreliable because error compounding and context drift remain unresolved.

Task horizon

Research

Operational Relevance

Current State

Sources