Import AI 455: AI systems are about to start building themselves.
Import AI · Jack Clark
Fully autonomous AI R&D — models end-to-end training their own successors — has a better-than-60% chance of arriving before the end of 2028, with a first proof-of-concept on non-frontier models likely within a year or two.
The case rests on two empirical trends and one conceptual point. Anthropic's internal benchmark for optimizing a training loop jumped from 2.9x to 30x speedup in nine months, and METR's measurements show the time horizon of tasks AIs can reliably handle growing from seconds to minutes to hours. Creativity isn't the bottleneck — most AI progress is methodical scaling, debugging, and sweeps, exactly the kind of schlep that's now being automated. If this holds, the loop closes and AI progress stops being gated on human researcher throughput.
There is a better-than-60% chance that by the end of 2028 an AI system will be capable of autonomously building its own successor with no humans in the loop. This would be a phase change for how AI progress happens.
On the task of optimizing a CPU-only small language model training loop, Claude Opus 4 hit a 2.9x mean speedup in May 2025, rising to 16.5x in November 2025 and 30x by February 2026. The trajectory on a real AI-engineering task is steep.
AI may not yet generate radical new ideas, but most of AI progress is methodical scaling, debugging, and parameter sweeps rather than paradigm-shifting insight. Automating the schlep is sufficient to keep the field moving.
A proof-of-concept where a model end-to-end trains its successor is plausible within a year or two, starting with non-frontier models. Frontier models will be harder because they are expensive and absorb enormous human effort.
In 2022 GPT-3.5 could handle tasks taking a human about 30 seconds; by 2023 this rose to several minutes, and the trend has continued steeply. The time horizon over which AI is reliable is a key proxy for delegability.
Open
- · How does the jump from non-frontier proof-of-concept to frontier-scale autonomous training actually happen, given the cost and human effort frontier runs absorb?
- · What governance or oversight applies once humans are out of the successor-training loop?
- · Does the steep speedup trajectory on the CPU training-loop task generalize to harder, less well-defined AI engineering work?
Pipeline
- source kind
- url
- generated by
- anthropic+voyage
- candidates
- 24 (selected 5)
- embeddings
- voyage-3.5
Coverage
100% covered
Each block is one paragraph of the source. Darker means the decomposition captures it well; lighter means it was left out — the part of the document the summary doesn’t cover.
Considered candidates (19)
Below top-k · 15
- evidenceSWE-Bench went from 2% to near-saturation in about two yearsc 0.75
When SWE-Bench launched in late 2023 the best system scored ~2%; Claude Mythos Preview now scores 93.9%, essentially saturating the benchmark. Real-world software engineering tasks have moved from barely tractable to mostly solved.
- mechanismSkill plus autonomy is what unlocks delegationc 0.70
Delegation requires both trust in someone's skill and trust in their ability to act independently and aligned with your intent. AI systems are clearing both bars for longer stretches, which is why engineers and researchers are handing off larger chunks of work.
- evidenceAutomated alignment research agents beat human baselinesc 0.65
Anthropic has shown a proof-of-concept where AI agents, given a research direction, autonomously produced techniques that outperformed a human-designed scalable oversight baseline. AI doing AI safety research is no longer purely hypothetical.
- contextThe argument is built from public informationc 0.60
The case rests on publicly available evidence: arXiv, bioRxiv, and NBER papers, plus the products being shipped by frontier labs. The conclusion is that the engineering ingredients for automating today's AI production are already in place.
- evidenceMLE-Bench Kaggle scores jumped from 16.9% to much higher in under 18 monthsc 0.60
MLE-Bench tests whether AIs can build end-to-end ML systems for 75 Kaggle competitions. At launch in October 2024 an o1-based agent scored 16.9%; by February 2026 Gemini 3 in an agent harness with search was substantially ahead.
- evidenceAI agents are already managing other AI agents in shipped productsc 0.60
Tools like Claude Code and OpenCode let a single agent supervise multiple specialized sub-agents in parallel. The meta-skill of AI-managing-AI is showing up in commodity deployments, not just research demos.
- evidenceCORE-Bench shows fast progress on reproducing research papersc 0.55
CORE-Bench asks an agent to install dependencies, run a paper's code, and answer questions about the output. Since its September 2024 launch, scores have moved sharply upward from a GPT-4o baseline.
- evidenceKernel design has become a busy AI-for-AI research areac 0.55
Recent work spans DeepSeek-based GPU kernel generation, PyTorch-to-CUDA conversion, Meta automating Triton kernels in production, kernels for Huawei Ascend chips, and fine-tuned open-weight kernel agents. It has gone from curiosity to competitive subfield in a couple of years.
- evidenceEarly signs of AI contributing to the scientific frontierc 0.55
There are preliminary cases of general-purpose AI systems pushing forward human science, mostly in computer science and mathematics. So far this tends to happen in centaur configurations with humans rather than fully autonomously.
- caveatIndividual benchmarks are flawed; the aggregate trend is the pointc 0.50
Every benchmark has idiosyncratic weaknesses, and the author is aware of them. What matters is the mosaic that emerges when you look at many benchmarks together.
- contextWhy kernel design matters for self-improvementc 0.50
Kernel optimization maps operations like matmul onto hardware and sets the efficiency ceiling of both training and inference. Better kernels mean more useful compute per dollar across the entire AI development pipeline.
- evidencePostTrainBench pits AI fine-tuners against frontier-lab humansc 0.50
PostTrainBench measures how well frontier models can fine-tune smaller open-weight models, using the human-built instruct-tuned versions as a strong baseline. The human comparison makes it a serious test of automated post-training.
- contextAuthor finds the conclusion reluctantlyc 0.45
The view is held reluctantly because the implications are so large that the author would rather not believe it. The essay is an attempt to lay out the evidence forcing the conclusion, with implications to be worked through in 2026.
- exampleTransformers and MoE as rare paradigm shiftsc 0.35
Occasionally human insight produces a step-change like the transformer architecture or mixture-of-experts. These are the exceptions in a field that otherwise moves by scaling, breaking, fixing, and scaling again.
- exampleNeural architecture search as an early automated-research precedentc 0.35
Picking which parameters to vary is exactly the kind of judgement that AI can already automate, with neural architecture search as an early instance. Much of AI research is amenable to this kind of automated exploration.
Redundant with selected · 4
- mechanismAI progress is 99% perspiration, and AI is now good at the perspirationc 0.75 · sim 0.85
Edison's 1%-inspiration / 99%-perspiration framing fits AI research: occasional insights matter but most progress is grinding engineering. Public benchmarks show AI is now strong at exactly that grinding work, and longer time horizons let it chain those tasks together.
overlapped with: AI doesn't need to be creative to automate AI R&D
- implicationEven uncreative AI can recursively push itself forwardc 0.70 · sim 0.90
If AI can reliably handle the engineering schlep, it can advance itself even without novel insights, though more slowly than a creative system would. And there are early signs creativity may also emerge.
overlapped with: AI doesn't need to be creative to automate AI R&D
- mechanismCoding plus world modeling already automates parts of sciencec 0.65 · sim 0.85
Modern science is largely specifying a question, running experiments, and sanity-checking results. Combining improving code generation with LLM world models has produced tools that already accelerate scientists and partially automate R&D.
overlapped with: AI doesn't need to be creative to automate AI R&D
- caveatKernel design is unusually friendly to AI automationc 0.50 · sim 0.86
Kernel optimization has easily verifiable rewards, which makes it especially suitable for automated search. Progress there may not generalize as cleanly to messier parts of AI R&D.
overlapped with: AI doesn't need to be creative to automate AI R&D
Janitor
Non-content spans (acknowledgements, references, footnotes, headers, boilerplate) are dropped before the decomposition runs.
- total spans
- 29
- kept
- 28
- dropped
- 1
- content · 28
- boilerplate · 1