3 May 2026
6 min read

The dominant idea in coding agents today is simple: when software tasks get harder, give the model more chances to think. More inference-time compute, more rollouts, more search, more tool use, more verification. On many tasks, this works. A model that nearly solves a problem can often be pushed over the line by more exploration and a better feedback loop.

But there is a class of software problems where this recipe begins to fail. When correctness depends on recovering implicit contracts rather than following explicit declarations, additional compute often adds cost without adding understanding. The bottleneck is no longer localization. It is semantics.

When Search Is Enough

This is the distinction the field still underestimates. A great deal of current work treats hard coding tasks as search problems: find the right file, inspect the right symbol, run the right test, retry until the patch converges. That assumption is often correct. But not every software task is mostly a search problem. Some are bottlenecked by something harder: understanding what the code means when the system does not declare that meaning clearly. In those environments, the challenge is not finding where to edit. It is inferring the hidden contract that makes the edit correct. In COBOLBench, four frontier coding agents cluster at 11-13% Pass@4 on a 53-task public release, and failed runs consume 2-3x more time and compute than successful ones without converting.

Modern software often makes those contracts legible. Type systems expose interfaces. Functions define boundaries. Imports surface dependencies. Schemas declare structure. Even difficult tasks arrive with a significant amount of meaning already externalized by the language and tooling.

In older, more operational codebases, semantics are often carried implicitly. A shared layout functions like an interface but is never declared as one. A syntactically local change is semantically global because the real contract lives in layout, aliasing, or convention rather than in a typed boundary.

Implicit Contracts

Consider a simple maintenance edit: one program changes a customer status field from single-character codes to full words because "ACTIVE" looks clearer than "A". In a modern typed service, the enum or schema would usually force every downstream consumer to update with it. In an implicit-contract system, the change can look perfectly local and still silently break billing, routing, or reconciliation jobs that were keying off "A" elsewhere. The patch is easy to write. The contract it violated was never stated in one place.

That is exactly the kind of meaning more search does not reliably recover. If an agent has not yet found the right part of the repository, more compute helps. If it has reached the right place but still cannot reconstruct the governing semantics, more compute often just produces more elaborate failure. In that regime, inference-time scaling begins to saturate: extra effort still buys retries, signal, and sometimes luck, but it does not buy the one thing the task actually requires, a faithful reconstruction of latent program meaning.

This is why older enterprise systems are scientifically useful. Not because they are quaint, but because they expose a form of reasoning that modern software often hides. In these systems, the gap between syntactic plausibility and semantic correctness becomes impossible to ignore. A patch can look perfectly reasonable and still violate the real contract of the system.

The usual explanation is that this code is old, niche, or underrepresented in pretraining. There is some truth in that. But it is also too shallow. It treats the difficulty as a distributional accident rather than a structural property. A better explanation is that some codebases compress meaning into forms that current agents still struggle to recover, even when those agents are given more time, more search, and more tools.

What Comes Next

The next step in coding agents is not simply to extend the current recipe of search, verification, and retries. It is to build systems that recover latent contracts directly: systems that mine invariants from existing behavior, infer which shared structures are functioning as interfaces even when the language does not declare them, and use dependency-aware verification to surface downstream consequences before an edit is accepted. If a status-code field changes, the system should identify every downstream consumer of that field, surface the shared contract, and reject the patch unless those dependent branches are updated consistently.

That shifts the frontier away from raw test-time effort and toward semantic legibility. The right questions become harder and more fundamental: how should models be trained to recover hidden contracts, what kinds of environments actually teach that skill, and what tools can surface implicit structure without turning every task into manual annotation? Some tasks do not resist because the model has not searched enough. They resist because the model does not yet know how to represent what the system is really saying. That is where more compute stops helping, and why the next frontier in code intelligence is not better search alone. It is learning to recover meaning when the code does not declare it for you.