Essays
Research and essays on agent evaluation, RL infrastructure and simulations.
Introducing LegacySWE
17 May 2026
LegacySWE is Metaphi's research effort to help coding agents hill-climb on the hardest maintenance and modernization tasks in legacy enterprise systems. COBOLBench is among the rare unsaturated public benchmarks for agentic coding in the world.
Read MoreLH-Bench: Skill-Grounded Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks
March 2026
Binary success metrics work when a task has a single correct answer. We present LH-Bench, a benchmark and evaluation design in which expert-authored SKILL.md artifacts serve as the bridge between execution and evaluation. Skills encode workflow expectations as observable rubric boundaries, while curated artifact contracts and human preference judgments provide independent validation.
Read PaperUnderstanding Virality with Vision-Language Models
24 December 2025
We present a rubric-based Vision-Language Model framework for evaluating short-form edutainment content. Our system extracts unsupervised audiovisual features and clusters them into interpretable factors that predict viewer engagement better than conventional metrics.
Read MoreMetaphi Simhub
8 October 2025
Large language models have demonstrated superhuman capabilities in discrete, well-defined coding tasks, but their progression into truly agentic, collaborative software engineering partners is hampered by a fundamental limitation in training and evaluation. We introduce Metaphi Simhub, a platform designed to solve this challenge through interactive simulation environments.
Read MoreFrom Static Benchmarks to Dynamic Worlds
20 September 2025
The current paradigm of LLM evaluation suffers from a complete lack of interactivity. There is no "user" in the evaluation loop. We explore how principles from autonomous vehicle simulation can transform code generation agent training.
Read More