Essays

Research and essays on agent evaluation, RL infrastructure and simulations.

When More Compute Stops Helping

May 2026

Why inference-time scaling saturates on coding tasks governed by implicit semantic contracts, and why the next frontier in code intelligence is better representations of latent program meaning rather than search alone.

Read More

CREW: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

March 2026

We introduce LH-Bench, a three-pillar evaluation design that moves beyond binary correctness to score autonomous, long-horizon execution on subjective enterprise tasks. Expert-grounded rubrics, curated ground-truth artifacts, and pairwise human preference evaluation enable reliable scoring where conventional pass/fail metrics fall short.

Read Paper

Understanding Virality with Vision-Language Models

24 December 2025

We present a rubric-based Vision-Language Model framework for evaluating short-form edutainment content. Our system extracts unsupervised audiovisual features and clusters them into interpretable factors that predict viewer engagement better than conventional metrics.

Read More

Metaphi Simhub

8 October 2025

Large language models have demonstrated superhuman capabilities in discrete, well-defined coding tasks, but their progression into truly agentic, collaborative software engineering partners is hampered by a fundamental limitation in training and evaluation. We introduce Metaphi Simhub, a platform designed to solve this challenge through interactive simulation environments.

Read More

From Static Benchmarks to Dynamic Worlds

20 September 2025

The current paradigm of LLM evaluation suffers from a complete lack of interactivity. There is no "user" in the evaluation loop. We explore how principles from autonomous vehicle simulation can transform code generation agent training.

Read More