Essays

Research and essays on agent evaluation, RL infrastructure and simulations.

Introducing LegacySWE

17 May 2026

LegacySWE is Metaphi's research effort to help coding agents hill-climb on the hardest maintenance and modernization tasks in legacy enterprise systems. COBOLBench is among the rare unsaturated public benchmarks for agentic coding in the world.

Read More

LH-Bench: Skill-Grounded Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks

March 2026

Binary success metrics work when a task has a single correct answer. We present LH-Bench, a benchmark and evaluation design in which expert-authored SKILL.md artifacts serve as the bridge between execution and evaluation. Skills encode workflow expectations as observable rubric boundaries, while curated artifact contracts and human preference judgments provide independent validation.

Read Paper

Understanding Virality with Vision-Language Models

24 December 2025

We present a rubric-based Vision-Language Model framework for evaluating short-form edutainment content. Our system extracts unsupervised audiovisual features and clusters them into interpretable factors that predict viewer engagement better than conventional metrics.

Read More

Metaphi Simhub

8 October 2025

Large language models have demonstrated superhuman capabilities in discrete, well-defined coding tasks, but their progression into truly agentic, collaborative software engineering partners is hampered by a fundamental limitation in training and evaluation. We introduce Metaphi Simhub, a platform designed to solve this challenge through interactive simulation environments.

Read More

From Static Benchmarks to Dynamic Worlds

20 September 2025

The current paradigm of LLM evaluation suffers from a complete lack of interactivity. There is no "user" in the evaluation loop. We explore how principles from autonomous vehicle simulation can transform code generation agent training.

Read More