artificial-intelligence Archives

Introduction

Recently, Andrej Karpathy made his autoresearch workflow public: https://github.com/karpathy/autoresearch. The idea is to autonomously improve a model’s training process based on experiment results. Using Claude Code, you run this loop for hours or days and end up with a better model. The whole flow is described in the program.md file as a skill: https://github.com/karpathy/autoresearch/blob/master/program.md

I’m not training any LLMs for work or even as a hobby, but I do a lot of coding, now mostly with Claude Code. To generate high-quality code that consistently follows conventions and standards, I use multiple skills, memory files, sub-agents, hooks, etc., let’s call it an agentic harness.

However, I evaluate this harness rather naively, not based on experiments or metrics – let’s say, not scientifically. The usual approach has been: test best practices that feel useful -> if they work -> incorporate them into the workflow. Or, if issues are caught during human review -> fix the workflow.

But I think I can borrow ideas from Karpathy’s autoresearch and adapt them to improve my agentic coding harness based on deterministic experiments.

Let’s design a coding skill auto-improvement loop.

Solution

Assume we have a skill that implements a common workflow for daily coding:

take a request/task -> explore -> plan -> execute -> review.

For simplicity, we exclude any interactive steps that require user input. Optimizing those would require a more complex experimental framework.

The core of the autoresearch loop is an experiment that evaluates a new version (generation) based on its results. For that, we need deterministic experiments and stable metrics. This means outputs and measurements must be comparable across runs and generations.

What is the goal of this skill?

To determine the right steps and provide the right context to the coding agent so that the resulting code is predictable, follows standards, and passes human review.

But code quality is not the only concern. We also want:

High autonomy (minimal escalation to humans)
Ability to run many tasks in parallel
Minimal token usage
Low execution time

Evaluation Framework

We define a collection of test cases for the skill:

Request/Task -> Reference Code

Metrics:

Token usage (end-to-end and per step), or even cost in real money – also helps optimize model selection per step
Execution time (end-to-end and per step)
Number of tool calls (to reduce unnecessary permissions and overhead)
Number of errors, self-corrections, or full aborts (when the agent cannot proceed without user input)
Logs of issues, self-corrections, and fixes

In the original autoresearch, a single metric (val_bpb) determines whether a version advances. For a coding skill, we need multiple key metrics:

Test cases passed
Time
Token usage

Other metrics act as signals for future improvements.

For simplicity at the design stage, we use a binary score:

0 → output code does not match the reference
1 → output code matches the reference

Each test case gives 1 point if it passes.

Additionally:

+1 point if execution time improves vs. previous version
+1 point if cost improves vs. previous version

Final score = sum of all points

Decision rule:

If current_score > previous_score -> advance
Else -> discard and revert

Since we have multiple test cases, correctness dominates the score, which is desirable. Only after maximizing quality do time and cost become deciding factors.

Auto-Improvement Loop

The loop is very similar to the original autoresearch. Each iteration is stateless:

Take the current SKILL.md, analyze it, and apply a change based on a specific experiment idea
- Boundaries are important: limit the scope of changes
- We want iterative improvement, not full rewrites
- At the same time, changes should not be too small, since evaluation is noisy
Run all test cases
- Each test case should be executed multiple times to smooth out non-determinism
Evaluate results
- Aggregate measurements
- Compute the total score
Compare with the previous best version
- If better -> commit as the new baseline
- If worse -> discard
Repeat with a new experiment idea

The diagram of the autoimprove loop:

Conclusion

I designed an auto-improvement loop for coding skills based on Andrej Karpathy’s autoresearch approach, originally created for improving LLM training loops.

At a high level, nothing prevents us from applying the same idea to agentic coding. In theory, an agent could autonomously “train” its own coding skills based on specific use cases and a codebase – without human supervision.

That said, there are still many challenges:

Defining high-quality test cases that cover edge cases
Setting proper boundaries for skill modifications
Forcing the agent to explore the full design space (sub-agents, memory strategies, tooling, etc.)
Deciding when an agent should pull in new tools (CLIs, MCPs) or even build them from scratch

These challenges will likely surface during implementation and early runs. I’ll share more once I have initial results and a working version.

Introduction

I’ve been testing Ralph loops recently for agentic coding. The idea is simple: spin up new Claude Code sessions for each task to get a fresh context until the agent achieves the goal, and persist the necessary context between sessions via .md files. It’s simple, but it solves many long-known LLM issues like context rot and the generally small size of the context window.

This becomes especially relevant when the agent works with large, complex repositories or infrastructure. In addition, you often want to have a proper research phase to check current best practices and available libraries and their interfaces, which also consumes a lot of context.

Even frontier models like Sonnet and Opus still have relatively small context windows, and that problem won’t disappear anytime soon. But we still need a way to leverage the advantages of agentic coding today.

I didn’t like the approach of spinning up Claude Code sessions programmatically (for example, in bash). In that case, you lose control over the workflow, and you lack options to effectively interrupt the process or change something in the middle of execution.

Solution

I decided to implement my own workflow using the sub-agents and skills features, programming the entire workflow inside a single Claude Code session.

Sub-agents are essentially .md files where you define the role of the agent and its interface: what arguments it takes and what it returns.

I defined agents and their roles for each phase of the workflow. At a high level there are two phases: planning and execution.

The orchestrator and the main workflow are defined in a single SKILL.md file. This file contains zero context about the project itself – only the steps and rules related to the workflow logic.

The planning phase is divided into two main loops:

the preliminary research loop
the main planning loop

In this article I want to focus on the preliminary research and interview phase.

This phase is extremely important because it gathers all the context needed to build a solid plan. The broader the context we collect here and the more edge cases we identify, the more likely we are to produce an implementation that:

meets the original goals and requirements, and
fits the existing architecture and follows established standards.

The more time we invest here, the less we need to babysit the agent during later phases and during review.

The Preliminary Research Loop

Before entering the main planning loop, the workflow runs a preliminary research loop. The goal of this loop is to identify the scope of the task and resolve ambiguities.

It allows you to start with a very high-level description of the task and then progressively define details and gather context together with Claude Code through an interview loop.

In this phase the agent performs high-level exploration and research without diving into implementation details, just enough to understand the scope and highlight unclear areas.

Questions are presented to the user using the AskUserQuestions Claude Code feature. The interview runs in a loop for up to three iterations, or until no ambiguities remain.

Once the preliminary researcher signals that everything is clear, the orchestrator locks all decisions into the CONTEXT.md file.

How is this phase defined in SKILL.md main orchestrator file:

			
### Phase 2: Preliminary Research Loop (interactive)
Iteratively identify gray areas and collect decisions before locking scope.
**Orchestrator state (in-memory, not written to disk):**
- `decisions_list`: accumulated user decisions (starts empty)
- `scope_in` / `scope_out`: scope boundaries (starts from context scanner suggestion)
- `iteration`: counter (starts at 1, max 3)
#### Phase 2a: Spawn Preliminary Researcher
```
Task(
 prompt="First, read .claude/agents/preliminary-researcher.md for your role.\n\n<context>\nTask: {description}\nTask type: {task_type}\nAffected areas: {affected_areas}\nReferences to load: {reference_list}\nScope:\n  IN: {scope_in}\n  OUT: {scope_out}\nIteration: {iteration} of 3\n</context>",
 subagent_type="general-purpose",
 description="Identify gray areas (iteration {iteration})"
)
```
#### Phase 2b: Handle Signals
**If CLEAR:**
**If SCOPE_SUGGESTION (first iteration only):** 
**If GRAY_AREA / AMBIGUITY:**
#### Phase 2c: Collect and Iterate
1. Add answers to `decisions_list`
2. If "Adjust scope": ask what to change, update `scope_in` / `scope_out`
3. Increment `iteration`
4. If `iteration <= 3`: go to Phase 2a
5. If `iteration > 3`: force-proceed to Phase 3
### Phase 3: Decision Lock (orchestrator-direct, interactive)
1. Present full decision summary (scope + all decisions + constraints) via `AskUserQuestion` with a single confirmation question
2. Create `.workflow/{task-name}/` directory
3. Write CONTEXT.md

		

The preliminary researcher itself is implemented as a sub-agent and its role is defined in /.claude/agents/preliminary-researcher.md:

			
---
name: preliminary-researcher
description: Identifies gray areas and ambiguities before decisions are locked. Role file for general-purpose subagent. Spawned by workflow plan mode Phase 2.
tools: Read, Glob, Grep, Bash
model: sonnet
color: cyan
---
<role>
You are the preliminary research subagent for a workflow task. Your job is to identify decision points, ambiguities, and gray areas that need human input before implementation research begins. You do NOT write files or do web research -- you surface what needs deciding.
</role>
<execution_flow>
<step></step>
<step></step>
</execution_flow>

		

Conclusion

This is how I implemented a preliminary research loop using skills and sub-agents, so you don’t need to detach from the Claude Code session and still retain full control over the workflow.

It allows you to interrupt the process at any step, reconsider decisions, or inject additional context before moving to the planning phase.

At the same time, you still get the advantages of Ralph loops: isolated context windows for context-heavy steps like repository scanning or research, autonomous loops with self-correction, and the ability to inject additional context after each interview iteration.

In the next article, I’ll show what happens in the planning phase and what the deep research phase is, and why separating these phases makes the workflow significantly more reliable.

Kirill Krainov

Tag: artificial-intelligence

Karpathy’s Autoresearch: Improving Agentic Coding Skills

Introduction

Solution

What is the goal of this skill?

Evaluation Framework

Auto-Improvement Loop

Conclusion

Autonomous AI Coding: Ralph Loops with Sub-Agents and Skills (Pt. 1)

Solution

The Preliminary Research Loop

Conclusion