From Static Benchmarks to Dynamic Worlds

20 September 2025

By Ishan Gupta

4 min read

Large language models (LLMs) have made incredible progress in code generation. Traditional benchmarks like HumanEval and MBPP have shown how well models can generate functionally correct code through unit tests. These evaluations have driven major improvements in AI models that can now solve complex coding problems effectively.

The field has evolved beyond basic tasks to more realistic evaluations. Newer benchmarks include SWE-bench, which tests fixing real GitHub issues in large codebases, ARC-Bench, which evaluates architectural reasoning capabilities, and CodeContests, which uses competitive programming problems. Others like RepoBench assess code understanding across entire repositories. Public leaderboards like LiveBench track this progress across various dimensions.

Despite this progress, we're hitting limitations with current evaluation methods. As models approach perfect scores on static benchmarks, improvements should be start getting tested on high utility task where code generation is strong primitive. The collaborative and subjective nature of evaluation requires more than just technically correct code.

The current paradigm suffers from a complete lack of interactivity. There is no "user" in the evaluation loop. An agent that cannot handle ambiguous feedback, clarify requirements, or adapt to a user changing their mind is an agent that cannot function in a real-world development environment.

The Waymo Precedent: A Paradigm for Scalable, Realistic Agent Training

How is Waymo scaling their AI driver? By creating a high-fidelity virtual world, they were able to scale their training and testing efforts to a degree that is physically impossible. Waymo's virtual fleet drives millions of miles every single day in simulation for training and evaluation before releasing on public roads.

A deconstruction of its key components reveals a set of principles directly applicable to the domain of code generation.

Leverage high fidelity abstract world models for testing
Reactive agents for better counterfactuals and closed loop learning
Evaluating behavior with real world or synthetically derived scenarios.

A successful CodeSim platform cannot be a purely synthetic creation; it must be architected from the ground up to replicate this flywheel. The initial human-in-the-loop phase is not a temporary bootstrap mechanism; it is the critical first turn of the data flywheel that will power the entire system.