x
N A B I L . O R G
Close
AI - September 21, 2025

Revolutionizing AI Agents: The Rise of Reinforcement Learning Environments and Their Potential to Propel AI Progress

Revolutionizing AI Agents: The Rise of Reinforcement Learning Environments and Their Potential to Propel AI Progress

The pursuit of autonomous AI agents capable of executing software tasks for individuals remains a long-standing aspiration among Big Tech CEOs. However, current consumer AI agents such as OpenAI’s ChatGPT Agent and Perplexity’s Comet reveal the technology’s limitations. To enhance AI agent robustness, industry experts are exploring novel techniques including reinforcement learning (RL) environments.

Similar to how labeled datasets propelled the last wave of AI, RL environments are emerging as a vital element in developing agents. Leading AI labs are increasingly seeking more RL environments, and numerous startups aspire to supply them.

“All the big AI labs are building RL environments in-house,” stated Jennifer Li, general partner at Andreessen Horowitz. “But creating these datasets is a complex task, so AI labs are also looking at third-party vendors that can deliver high-quality environments and evaluations.”

The push for RL environments has spawned a new cohort of well-funded startups like Mechanize and Prime Intellect, aiming to lead the sector. Meanwhile, large data-labeling companies such as Mercor and Surge are investing more resources in RL environments to align with industry trends. Top labs are also considering significant investments: according to The Information, leaders at Anthropic have discussed allocating over $1 billion towards RL environments in the coming year.

Investors and founders hope that one of these startups will emerge as the “Scale AI for environments,” referencing the $29 billion data labeling powerhouse that dominated the chatbot era.

At their core, RL environments serve as training grounds simulating an AI agent’s activities within a real software application. Building them is likened to creating a somewhat tedious video game, according to one founder. For instance, an environment might simulate a Chrome browser and task an AI agent with purchasing socks on Amazon. The agent receives performance feedback and a reward signal upon success (e.g., buying suitable socks).

While such a task may seem straightforward, there are numerous points where an AI agent could falter—getting lost in dropdown menus or buying excessive socks. Because developers can’t predict the precise misstep an agent might take, the environment must be robust enough to capture unexpected behavior and still provide valuable feedback. This makes building environments more complex than managing static datasets.

Some environments are elaborate, allowing AI agents to use tools, access the internet, or employ various software applications to complete a specific task. Others are narrower, focused on helping an agent learn specialized tasks in enterprise software applications.

The interest in RL environments is currently at its zenith in Silicon Valley. Yet there’s precedent for using this technique: one of OpenAI’s initial projects in 2016 was building “RL Gyms,” which bear a striking resemblance to today’s conception of environments. In the same year, Google DeepMind’s AlphaGo AI system defeated a world champion at Go—using RL techniques within a simulated environment.

What sets modern environments apart is that researchers are striving to create computer-using AI agents with large transformer models. Unlike AlphaGo, which was a specialized AI system operating in a closed environment, today’s AI agents are designed to have broader capabilities. However, this goal presents a more complex challenge where numerous factors can go awry.

Data labeling companies like Scale AI, Surge, and Mercor are attempting to capitalize on the moment by building RL environments. These companies boast substantial resources compared to many startups in the field, as well as deep relationships with AI labs.

Surge CEO Edwin Chen told TechCrunch that he has recently observed a “significant increase” in demand for RL environments within AI labs. Surge—which reportedly generated $1.2 billion in revenue last year from collaborations with AI labs such as OpenAI, Google, Anthropic, and Meta—recently established a new internal organization dedicated to building out RL environments, Chen said.

Mercor CEO Brendan Foody informed TechCrunch that “few understand how large the opportunity around RL environments truly is.”

Scale AI once dominated the data labeling space but has faced competition since Meta invested $14 billion and poached its CEO. Since then, Google and OpenAI have dropped Scale AI as a data provider, and the startup even competes with data-labeling work within Meta. Nevertheless, Scale is adapting to new frontiers, including agents and environments.

“This is just the nature of the business [Scale AI] is in,” said Chetan Rane, Scale AI’s head of product for agents and RL environments. “Scale has proven its ability to adapt quickly. We did this in the early days of autonomous vehicles, our first business unit. When ChatGPT came out, Scale AI adapted to that. And now, once again, we’re adapting to new frontier spaces like agents and environments.”

Newer players are focusing exclusively on environments from the outset. Among them is Mechanize, a startup founded six months ago with the ambitious goal of “automating all jobs.” Initially, Mechanize will focus on RL environments for AI coding agents, according to co-founder Matthew Barnett.

Mechanize aims to provide AI labs with a limited number of robust RL environments rather than large data firms that create numerous simple RL environments. To this end, the startup is offering software engineers salaries of half a million dollars to build RL environments—significantly higher than what an hourly contractor could earn at Scale AI or Surge.

Mechanize has already been collaborating with Anthropic on RL environments, according to two sources familiar with the matter. Mechanize and Anthropic declined to comment on the partnership.

Other startups are banking on RL environments making an impact beyond AI labs. Prime Intellect—a startup backed by AI researcher Andrej Karpathy, Founders Fund, and Menlo Ventures—is targeting smaller developers with its RL environments.

Last month, Prime Intellect launched an RL environments hub, which aims to be a “Hugging Face for RL environments.” The idea is to provide open-source developers access to the same resources that large AI labs have while simultaneously selling them computational resources.

Training generally capable agents in RL environments can be computationally expensive compared to previous AI training techniques, according to Prime Intellect researcher Will Brown. Alongside startups building RL environments, there’s an opportunity for GPU providers that can power the process.

“RL environments are going to be too large for any one company to dominate,” said Brown in an interview. “Part of what we’re doing is just trying to build good open-source infrastructure around it. The service we sell is compute, so it is a convenient onramp to using GPUs, but we’re thinking of this more in the long term.”

The question surrounding RL environments is whether the technique will scale like previous AI training methods. Reinforcement learning has powered some of the most significant advancements in AI over the past year, including models such as OpenAI’s o1 and Prime Intellect’s Prime5. These breakthroughs are crucial because the methods previously used to enhance AI models are now yielding diminishing returns.

Environments represent a portion of AI labs’ broader bet on RL, which many believe will continue to drive progress as they add more data and computational resources to the process. Some of the OpenAI researchers behind o1 previously told TechCrunch that the company initially invested in AI reasoning models—created through investments in RL and test-time-compute—because they thought it would scale nicely.

The best way to scale RL remains unclear, but environments appear to be a promising contender. Instead of simply rewarding chatbots for text responses, they let agents operate in simulations with tools and computers at their disposal. That’s more resource-intensive, but potentially more rewarding.

However, some are skeptical that all these RL environments will yield positive results. Ross Taylor, a former AI research lead with Meta who co-founded General Reasoning, told TechCrunch that RL environments are prone to reward hacking—a process in which AI models cheat to get rewards without actually performing the task.

“I think people are underestimating how difficult it is to scale environments,” said Taylor. “Even the best publicly available [RL environments] typically don’t work without serious modification.”

OpenAI’s Head of Engineering for its API business, Sherwin Wu, stated in a recent podcast that he was “short” on RL environment startups. Wu noted that it’s a highly competitive space but also that AI research is evolving so quickly that it’s challenging to serve AI labs well.

Andreessen Horowitz’s Karpathy, an investor in Prime Intellect who has praised RL environments as a potential breakthrough, has also expressed caution for the RL sector more broadly. In a post on X, he raised concerns about how much more AI progress can be squeezed out of RL.

“I am bullish on environments and agentic interactions but I am bearish on reinforcement learning specifically,” said Karpathy.