Essays
Research and essays on agent evaluation, RL infrastructure and simulations.
When More Compute Stops Helping
May 2026
Why inference-time scaling saturates on coding tasks governed by implicit semantic contracts, and why the next frontier in code intelligence is better representations of latent program meaning rather than search alone.
Read MoreCREW: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
March 2026
We introduce LH-Bench, a three-pillar evaluation design that moves beyond binary correctness to score autonomous, long-horizon execution on subjective enterprise tasks. Expert-grounded rubrics, curated ground-truth artifacts, and pairwise human preference evaluation enable reliable scoring where conventional pass/fail metrics fall short.
Read PaperUnderstanding Virality with Vision-Language Models
24 December 2025
We present a rubric-based Vision-Language Model framework for evaluating short-form edutainment content. Our system extracts unsupervised audiovisual features and clusters them into interpretable factors that predict viewer engagement better than conventional metrics.
Read MoreMetaphi Simhub
8 October 2025
Large language models have demonstrated superhuman capabilities in discrete, well-defined coding tasks, but their progression into truly agentic, collaborative software engineering partners is hampered by a fundamental limitation in training and evaluation. We introduce Metaphi Simhub, a platform designed to solve this challenge through interactive simulation environments.
Read MoreFrom Static Benchmarks to Dynamic Worlds
20 September 2025
The current paradigm of LLM evaluation suffers from a complete lack of interactivity. There is no "user" in the evaluation loop. We explore how principles from autonomous vehicle simulation can transform code generation agent training.
Read More