A new benchmark designed to test whether Artificial Intelligence can truly replace human freelancers has found that today’s most advanced AI agents can complete less than three per cent of real-world remote work projects.
The study, titled the ‘RemoteLabourr Index(RLI): Measuring AI Automation of Remote Work,’ was developed by researchers from thCentreer for AI Safety and Scale AI. It introduces the Remote Labour Index (RLI), a dataset of 240 end-to-end freelance projects sourced from real professionals across multiple industries.
Despite rapid advances in reasoning and knowledge benchmarks, the report shows that frontier AI systems remain far from automating economically valuable remote work.
The researchers argue that the RLI provides a more economically grounded measure of AI capability than previous tests, offering policymakers and businesses a clearer picture of automation risks.
“Despite rapid progress on other AI benchmarks, current systems remain far from capable of autonomously handling the diverse and complex demands of the remote labor market,” the report noted.
The findings may temper fears of immediate large-scale displacement of freelance digital workers, while also revealing the need to track AI progress using real-world economic metrics rather than theoretical performance alone.
AI performance
The highest-performing model in the benchmark, Manus, achieved an automation rate of just 2.5 per cent, meaning it produced work comparable to human freelancers on only a handful of projects.
Other leading systems, including GPT-5, Claude Sonnet 4.5, Grok 4, ChatGPT agent, and Gemini 2.5 Pro, scored between 0.8 per cent and 2.1 per cent.
In practical terms, that means more than 97 percent of the projects, ranging from 3D product rendering and architectural design to game development, data visualisation and video production, were not completed at a level that would be accepted by a paying client.
The researchers said the findings highlight a stark gap between performance on lab-style benchmarks and the demands of real freelance markets.
Real jobs, real economic value
Unlike many AI tests that focus on narrow tasks such as coding or answering exam-style questions, RLI evaluates full projects drawn directly from online freelance platforms. Each project includes a brief, input files and a ‘gold-standard’ human deliverable.
The projects span 23 categories of remote work and represent more than 6,000 hours of human labour valued at over $140,000. The average project took nearly 29 hours to complete and cost about $633.
Researchers manually evaluated AI outputs against human work, using a holistic standard: would a reasonable client accept the AI’s submission as commissioned work?
Inter-annotator agreement among evaluators exceeded 94 per cent, suggesting strong reliability in the scoring process.
Common AI failures
The study found recurring weaknesses in AI-generated deliverables, which are incomplete or truncated outputs, corrupted or unusable files, poor professional quality and inconsistencies across assets.
For instance, some AI systems produced videos far shorter than requested, child-like graphics for design tasks, or floor plans that failed to match supplied sketches.
While AI performed better on certain creative and text-heavy tasks such as audio editing, report writing and basic data visualisation, these represented a small slice of the broader remote work economy.
Although absolute automation rates remain low, the study’s Elo scoring system which measures relative improvement between models, indicates steady progress.
Newer models consistently outperformed older ones, suggesting incremental gains. Still, all AI systems fell well below the human baseline score of 1,000 on the benchmark.



