The Karpathy Autoresearch Pattern: AI That Runs Experiments While You Sleep

In March 2026, Andrej Karpathy released autoresearch—a 630-line Python framework that lets AI agents run ML experiments overnight, unattended.^[1] Within a week it hit 30,000 GitHub stars. The pattern isn't just for training nanochat models. It's a blueprint for any domain where you have a measurable objective and an agent that can edit code. Research that runs while you sleep.

The program.md Operating System

The core insight: humans don't edit train.py. They edit program.md—a Markdown file that defines goals, constraints, and baseline instructions. The AI agent reads it, modifies the training script, runs a 5-minute experiment, evaluates validation bits-per-byte, and either keeps the improvement or discards it. Repeat. ~12 experiments per hour. ~100 overnight.^[2]

This inverts the traditional research loop. Instead of a human running experiments and an AI assisting, the AI runs experiments and the human sets the program. The same pattern applies to GPU kernel optimization, performance tuning, and any task with a clear metric. As I argued in Stop Wasting My Tokens, token efficiency is the currency of the agentic era—autoresearch extends that logic to experiment efficiency.

Why It Works: Measurable Objectives

Autoresearch succeeds because the objective is unambiguous: validation bits-per-byte. The agent doesn't need to "understand" research; it needs to edit code, run it, and compare numbers. That's the same principle behind high-density agentic payloads—when agent-to-agent communication is structured and measurable, automation scales. When it's fuzzy, it fails.

By 2026, AI coding agents demonstrated reliability for 55+ minutes of autonomous work.^[3] The autoresearch loop is short enough to stay within that window: edit, run, evaluate, commit or revert. No human in the loop until morning.

Generalizing Beyond ML Training

The pattern has already spread. GPU kernel optimization, compiler tuning, hyperparameter search—any domain with a numeric objective and editable artifacts. For enterprises building agentic infrastructure, the question is: what's your program.md? What measurable outcome could an agent optimize overnight?

Market research, competitive intelligence, content quality scoring, A/B test analysis—if you can define success as a number and give an agent the tools to iterate, you can run autoresearch. The infrastructure is there: NVIDIA's stack, LangGraph and CrewAI, single-GPU or scaled. The constraint is imagination and clean metrics.

Leaders, Not Followers

Organizations that adopt this pattern early will compound. One engineer sets a program.md; 100 experiments run overnight. The best improvements get merged. The cycle repeats. Research velocity isn't linear—it's exponential when the loop is automated. This is the same thesis behind PageStash and ThreatBase: infrastructure that turns manual workflows into queryable, agent-driven systems.

Karpathy didn't invent autonomous agents. He proved the pattern works for research—and gave the world a 630-line reference implementation. The rest is execution.

Next up: Enterprise Auto-Research—multi-agent orchestration for deep research.

References

Karpathy, A. (2026). autoresearch. GitHub. github.com/karpathy/autoresearch
OpenTools. (2026). Andrej Karpathy's Autoresearch: AI Agents Running Experiments Overnight. opentools.ai
ComputeLeap. (2026). How to Build an AI Research Agent That Works While You Sleep. computeleap.com