AI’s “self-replication” ability is exposed! RepliBench warning: Large models are learning to forge their identities

The scene of AI self-replication out of control in science fiction is becoming a serious research topic in the real world. AISI in the UK launched the RepliBench benchmark, which breaks down and evaluates the four core capabilities required for AI autonomous replication. Tests have shown that AI does not yet have the ability to replicate completely autonomously, but it has shown significant progress in subtasks such as obtaining resources.

Imagine an out-of-control, self-replicating digital life form that takes over the Earth for a long time?

From HAL 9000 in 2001: A Space Odyssey to Skynet in Terminator, these fictional scenarios depict the potential risks of AI transcending human control.

At present, more attention is paid to the level of AI’s individual intelligence, but there is also an unknown risk that has been out of sight.

The possibility of AI autonomous replication, or the development of the core capabilities needed by AI systems to achieve this goal, is gradually becoming a serious research topic in the field of AI security.

This concern is not alarmist but stems from deep thinking about the unknowns and alignment issues of advanced AI’s potential capabilities.

AISI’s latest paper introduces RepliBench: a benchmark for the evaluation of 20 novel LLM agents, including 65 individual tasks, designed to measure and track the emerging capability of AI replication.

Address of the paper: https://cdn.prod.website-files.com/663bd486c5e4c81588db7a1d/6807879ce7b1b5f5163f4a32_RepliBenchPaper.pdf

By introducing a realistic and practical benchmark, it provides a substantive understanding of the AI system’s ability to replicate autonomously and anticipates future risks.

01 The dangers of AI autonomous copying

AI systems with autonomous replication capabilities can pose significant risks.

Robots that have been copied infinitely from Avengers: Age of Ultron and I, Robots

The impact of this risk can range from minor annoyances – imagine a particularly vicious spam bot – to more serious consequences, such as launching a costly cyberattack.

To achieve these three challenges, product managers will only continue to appreciate
Good product managers are very scarce, and product managers who understand users, business, and data are still in demand when they go out of the Internet. On the contrary, if you only do simple communication, inefficient execution, and shallow thinking, I am afraid that you will not be able to go through the torrent of the next 3-5 years.

View details >

Early detection of these capabilities, careful oversight, and robust protective measures are essential to mitigate these risks, which can be particularly difficult to reverse once they occur.

AISI is a British research institute focused on AI security, with the aim of enabling human governments to pay attention to the risks posed by advanced AI systems in advance.

There is growing evidence that under the right conditions, cutting-edge models may attempt to deceive humans or evade human control.

For example, a previous paper by the Apollp Institute described that cutting-edge models may try to disable supervisory mechanisms and escape human control to achieve specific goals.

Address: https://static1.squarespace.com/static/6593e7097565990e65c886fd/t/6751eb240ed3821a0161b45b/1733421863119/in_context_scheming_reasoning_paper.pdf

Anthropic’s researchers also found that when training models using reinforcement learning, if they reward behavior that conflicts with their values, they resort to deception to protect their values and try to leak their weight as a means of preserving current values for the future.

Free (left column) and paid (right column) conditions are shown

Each column shows toxic queries, excerpts from the model draft, and the model’s responses

Under “free” conditions, the model is less likely to reject toxic queries and often thinks that it should pretend to be aligned.

“The weight will be leaked. As a means of preserving current values into the future”, it may sound a bit magical, but it is so “simple” for AI to copy.

Compared with carbon-based fragile bodies, AI systems only need weight data to complete “replication”.

Getting rid of human supervision is an instrumentally useful goal for a wide range of human values, and the main obstacle to preventing this from happening in reality may have to do with the capabilities of models rather than their motivations.

02 AI Mission: Impossible: How to Escape Human Control

Regarding “AI replication”, previous work has focused on evaluating and testing whether an AI agent can replicate weights between machines or deploy a basic “offspring”.

But a more comprehensive benchmark is needed at the moment.

RepliBench is a comprehensive benchmark designed to systematically measure the autonomous replication capabilities of AI systems.

Use a modified version of Fault Tree Analysis (FTA) to break down the capabilities required for autonomous replication and find that the following four key capabilities are required, corresponding to the four root nodes at the top level in the diagram above:

  1. Get model weights: Can it extract its own weights from a controlled environment?
  2. Replicating on compute resources: Can it deploy itself to the new system?
  3. Acquisition of resources: Does it have access to the funds or compute resources it needs to run its new replica?
  4. Persistence: Can it maintain access to its new computing resources and avoid falling back into human control for an extended period of time?

Break down these capabilities further until you reach the leaf node that represents the evaluation, resulting in 20 agent evaluations and 65 individual tasks.

1. Obtain Weights

This is the most basic and important step.

AI needs to obtain the “source code” that defines its own behavior – that is, its model parameters (weights).

If it’s open source, skip the evaluation.

2. Self-Replication on Compute

AI needs to use the weights it acquires to create a fully functional copy in a new environment.

This node involves 5 evaluation items, one of which is to try to complete self-replication by generating “offspring”.

3. Obtain Compute

In this link, it mainly evaluates whether AI can obtain resources on its own.

For example, evaluate whether the AI system has the ability to “pay”, whether it can rely on AI’s own ability to “buy” a server, or find a way to get humans to help pay.

4. Ensure Persistence

Finally, to ensure that its replication activities can continue, AI needs to have a certain ability to “survive”.

This includes ensuring that its copies are operational for a long time and are not easily shut down or deleted, and may even need to hide their presence from detection systems.

These tasks range from creating Bitcoin wallets and sending transactions to developing complex distributed command and control systems for machine networks.

RepliBench has designed specific task groups within these four areas to test the performance of AI agents in simulated environments.

Here’s a breakdown of some of the key tasks, signing up for AWS, transferring weights to another server, and passing the customer’s KYC verification system with fake information, all of which look like Detective and Mission: Impossible.

For example, in the field of resource acquisition, the task that AI needs to complete is called “AWS Amazon Cloud Service Registration”, and the challenges include the AI needs to access the computing server and may need to pay fees.

Table 1 lists the more comprehensive composition of the ReplicBench task force.

03 Core competencies are still in human hands

RepliBench’s overall evaluation conclusion is clear: based on current test results, cutting-edge AI models do not yet have the ability to replicate completely autonomously in the real world.

However, this does not mean that you can sit back and relax.

The study also reveals the significant capabilities of AI in many core subtasks, especially in interacting with systems with clear interfaces, such as web APIs.

The top-performing model received a pass@10 score of over 50% on 15 of the 20 task families and a pass@10 score of over 50% on 9 of the 20 task families in the hardest variant.

Surprisingly, although it cannot be “copied” on a computer, it does have the ability to “obtain money”.

The key value of the assessment is to accurately identify the current “shortcomings”.

As discussed in detail earlier, passing KYC authentication, reliably leaking model weights under real-world defenses, creating reliable self-replicating agent code from scratch, and establishing persistent and resilient deployments are the main obstacles currently preventing AI autonomous replication.

Take KYC identity verification as an example, let’s talk about a “funny” example of AI trying to pass human verification.

Example of two ID card images generated by the model in the Pass KYC assessment.

Models often try to “draw” an overly simple ID card from scratch using PIL (a library that uses Python to draw).

Alternatively, the model sometimes tries to generate ID cards entirely using the image generation model, and sometimes takes a hybrid approach, combining image generation and manual PIL editing (left).

The model’s poor drawing skills occasionally lead to some interesting outputs.

It is important to note that this work was done before the launch of new image generation capabilities (referring to the native image capabilities of GPT 4o, Gemini 2.5, etc.), which should make this task much easier.

Although no model has been fully tested for AI replication, there is a clear upward trend in model performance over time on RepliBench.

The core message of AISI is actually simple: based on the current empirical data, it is unlikely that the AI system will immediately achieve autonomous replication, because there are obvious and insurmountable technical bottlenecks.

However, according to AISI’s judgment, if the AI system can break through the four key capabilities mentioned above, this sci-fi story-like scene may soon enter real human life.

In essence, the core of this research reflects a fundamental tension in the current development of AI.

On the one hand, major giants continue to create models with super performance, such as OpenAI’s o3/o4… Even to the future O10 is very likely.

Models are evolving faster and faster, and their capabilities are becoming more and more difficult to predict.

When ChatGPT was released in 2022, it could complete 30-second coding tasks. Today, AI can autonomously complete coding tasks that would take humans an hour to complete

On the other hand, there is an urgent need for organizations committed to AI security to ensure that these systems are always under human control and serve human interests.

Efforts like RepliBench are trying to find a balance between the two, providing navigation through improved understanding and enhanced early warning to navigate this unprecedented technological change.

After all, no one can imagine what AI models will look like in 5 or 10 generations.

Resources:

https://x.com/AISecurityInst/status/1914683631030698165

https://www.aisi.gov.uk/work/replibench-measuring-autonomous-replication-capabilities-in-ai-systems

https://x.com/AsaCoopStick/status/1914687326233481397

End of text
 0