What Reading Agent Papers Actually Taught Me About Building One (2026)

Apr 8, 2026

Most of the agent research I read happened in parallel to actually building the thing. I’d hit a problem, or notice the agent was weak in some specific way, and go looking for a paper that might have something to say about it. It wasn’t a structured reading list up front. It was more like reaching for a paper when I wasn’t sure what to try next.

The other thing worth saying is I rarely took an entire paper as a single lesson. It was usually one specific idea I could actually use, and the rest was context. Sometimes that one idea changed something pretty fundamental about how the agent worked. Sometimes it just confirmed I was already on the right track.

These are the papers I kept coming back to, and the specific things I took from each.

ReAct, How the Agent Actually Reasons

ReAct (Yao et al., 2022) was the one that cleared up something I was hazy on. Going in, I wasn’t really sure how the harness around an agent was supposed to work. I knew the model would call tools, but how the loop actually held together (how the model kept iterating, how memory got tracked between turns, how it knew when to stop) wasn’t clear to me yet.

ReAct gave me the loop. The shape, from memory, is roughly thought, then action, then observation, and back into the next thought. The agent generates a thought about what it should do next, takes an action, gets a result, and that result feeds into the next thought. It loops until the agent decides it’s done.

That was the missing mental model for me. Once I had it, building the harness became a much more concrete exercise. I built the agent loop directly in this shape. Every iteration the agent produces a note (the thought), calls a tool (the action), and gets a result back (the observation) that feeds into the next iteration. Termination is its own action. When the notes contain enough to write the final output, the agent calls a “submit” action and the loop ends.

Before reading ReAct I was leaning toward something more like a single chain-of-thought pass where the model reasoned through everything up front. The loop reframe made it clear why that doesn’t work for tasks that need to actually go and find things out. The model needs to act, see what comes back, and adjust.

Toolformer, What Tools to Actually Give It

Toolformer (Schick et al., 2023) reframed how I thought about which tools to expose to the agent. The finding that stuck with me is that models use tools reliably when the tool output directly reduces uncertainty about the next step. The reframe in one sentence: does this tool’s output give the agent something clear it can immediately act on?

That became my filter for every tool I added. If a tool returned noisy output, or output the agent would have to do significant interpretation on before it could act, the tool was the wrong shape.

The most concrete thing this changed was how the agent reads code. My first instinct was to give it a simple “read this file” tool and let it figure things out from there. After Toolformer I split that into more purpose-specific tools. One for getting just the structural skeleton of a file with line numbers, one for reading a specific line range, one for resolving imports to actual paths in the repo. Each one returns a clean, low-noise output the agent can directly act on. Compared to dumping whole files into the context, the difference in agent behaviour was immediate.

I also wrote tool descriptions to be unambiguous about when each one should be used versus the others. Sounds obvious in hindsight but I hadn’t thought of tool descriptions as part of the design surface until I read this paper.

LLM-as-Judge, Building an Eval System (and Why Mine Is a Learning Exercise)

LLM-as-Judge (Zheng et al., 2023) is how I approached evaluating the agent. The paper shows that strong models used as judges match human expert agreement at around 85 percent, and it walks through three judge formats: pairwise, single answer grading, and reference-guided grading. It also lays out the biases (position bias, verbosity bias, and self-enhancement bias) with specific numbers attached.

I went with single answer grading. Pairwise requires you to have two outputs to compare, which is the wrong shape for what I wanted. I wasn’t trying to rank versions of the agent, I wanted a quality signal on individual runs.

I built the eval system. It works in the sense that I can run things through it and get judge output. Honestly though it’s the weakest part of what I built, and I’m okay saying that. Evaluating the kind of work this agent does is actually pretty hard. What counts as a good output depends heavily on context that’s difficult to capture in an eval rubric. I could have spent a lot more time pushing on the judge to make it useful as a regression metric. I chose to spend that time on the agent itself instead and I don’t really regret it.

What I did get out of the paper that was immediately useful is awareness of verbosity bias. The paper shows that some judge models reward longer, more detailed outputs even when they aren’t better. I caught myself doing the same thing reviewing the agent manually. The versions producing more text felt more thorough, but the actual signal wasn’t better. That made me much more deliberate about evaluating what the agent did rather than how much it wrote.

AgentBench, Naming the Failure Mode

AgentBench (Liu et al., 2023) was useful less for what to build and more for what to watch out for. It identifies the main agent failure modes (poor long-term reasoning, decision-making, and instruction following) and breaks them into a failure taxonomy. The one that mattered most for me is what they call Task Limit Exceeded, or TLE. For strong models like Claude, TLE is the dominant failure. Not bad instruction following, not invalid actions. The agent just loops or gives up before getting to the answer.

I saw this directly. Early versions of the agent would sometimes get stuck. Not in a hard loop but in a soft one where it kept exploring the same area in slightly different ways without making progress. Knowing this was a documented and common failure mode rather than something specific to my setup helped me focus on the right fixes.

The concrete thing I added because of this paper is a deterministic check for redundant tool calls in the eval, duplicate (tool, input) pairs across a single run. It became a soft signal for the looping failure pattern. Not a perfect proxy but it’s cheap to compute and it catches the obvious cases.

I also designed my synthetic test cases with this in mind. At least one test requires connecting findings across multiple files. At least one requires looking for a concept that doesn’t exist in the codebase, so the agent has to reach the conclusion “this isn’t here” rather than fumble around looking for it indefinitely. Both of those test shapes came from thinking about TLE specifically.

Attention Is All You Need, Why Context Order Matters

This is the original transformer paper (Vaswani et al., 2017) and most people building on top of LLMs have absorbed its ideas at some level. What I got out of re-reading it specifically with agents in mind is how the n-squared attention cost actually affects you in practice.

Every token attends to every other token. Double the context length, and the number of pairwise relationships quadruples. The softmax normalisation means attention weights are zero-sum, so adding more tokens means each token gets proportionally less attention. And empirically, position matters: the beginning and end of the context window are the most reliably attended to.

For me this turned context management from a vague “shorter is better” instinct into a concrete framing. Removing useless tokens isn’t just a token cost saving, it’s an attention quality improvement, and it scales quadratically with how much you remove. Cutting 5k tokens out of a 200k context is doing a lot more work than the linear math suggests.

In practice this changed three things. The repo’s structural overview goes into the system prompt at the very start, which is the most reliably attended position. Important conclusions the agent reaches stay near the front of the working context for the rest of the run. And I became much more aggressive about stripping out raw tool output once a conclusion had been drawn from it. If you’ve written down what a file told you, the file content itself is just noise.

Anthropic’s Context Engineering Guide, Treating Context as a Resource

Anthropic published a context engineering guide in 2025 that crystallised a lot of this for me. The framing it uses is that context is finite and should be treated as a resource, not a dumping ground. It walks through four high-level strategies: writing context to an external scratchpad, selecting only what is relevant to retrieve, compressing context through summarisation, and isolating context across multiple agents.

The applicable technique for me was structured note-taking, having the agent maintain a persistent set of notes that lives outside the context window in a meaningful way, distilled from what it has done so far.

I had a notes tool already, but I realised the agent was writing the wrong kind of notes. It was writing intentions (“I will check whether this exists”) rather than conclusions (“checked, doesn’t exist”). The fix was a prefix convention. Notes are tagged either [FINDING] or [PLAN]. Findings are concrete conclusions about the codebase or the task. Plans are intentions for what to do next. The two are treated differently. Findings persist for the rest of the run, plans get culled after a few iterations because they aren’t relevant once the next step has been taken.

I can’t tell you that this change moved a metric, because the eval isn’t strong enough to detect changes that subtle. But qualitatively the agent felt more coherent after this. It was carrying forward what it had actually learned instead of rehashing intentions it had already acted on. That kind of qualitative improvement is hard to measure but easy to feel when you watch the agent run.

SWE-agent, Interface Design Matters as Much as the Model

SWE-agent (Yang et al., 2024) was probably the most practically useful paper I read for actually building the thing. Its central claim is that the agent-computer interface (what they call ACI) determines performance as much as the model choice does. And they back this up with hard numbers. Keeping only the last few full observations performed better than keeping the full conversation history. Purpose-built tools beat raw shell access by a significant margin. Iterative search tools that page through results sometimes performed worse than no search at all, because agents would exhaustively page through everything.

A lot of what this paper argues for is what I’d already been doing because of the earlier papers. Purpose-built tools rather than raw access, structured notes rather than full history, outline-first navigation over full file dumps. So in part this paper validated decisions I was already making.

But it also pointed at gaps. The collapsed observation pattern was new to me. The idea is that old tool results don’t get fully removed from context, they get collapsed into one-line summaries, something like “turn 12: read this file, found nothing of interest, wrote a finding about it.” That preserves the action history without keeping the verbose output around. I shipped this and the agent started feeling noticeably more aware of what it had already tried.

The other thing I took away is that tool output verbosity is a design lever in itself. Every tool result should be as lean as it can be while still being useful. If a tool’s output is too broad, the right move is often to make the agent refine its query rather than dump everything and let the agent sort through it. I went back through every tool I’d built and trimmed output where I could.

The last thing I added was a lightweight check on note quality. Because findings and plans are treated differently in the context, it matters that the agent uses the prefixes correctly. A finding note that contains intention language (“I should check whether…”) is mislabelled. It’s actually a plan. A small guardrail catches this and prompts the agent to either rewrite the note or change the prefix. Small change, surprisingly large effect on note discipline.

What This All Adds Up To

The through-line across all of these papers is that an agent’s performance is mostly determined by decisions the developer makes, not the model’s raw capability. Which tools you give it, how those tools shape their output, what stays in context and what gets stripped, how the agent records what it has learned. None of these are solved by picking a stronger model. They are design decisions, and the research makes it surprisingly clear which ones matter.

The thing I didn’t expect was how concrete the papers are. I went in assuming I’d skim a few of them for vibes and get back to building. What I actually got was direct, applicable answers to questions I would’ve spent weeks figuring out the hard way. The cost of reading them was small. The cost of not reading them would’ve been substantially higher.

If you’re about to build your first agent, the few hours it takes to read these is probably the highest leverage time you can spend on it.

I wrote about the broader set of AI tools I use day to day in my LLM and developer tooling breakdown for 2026.