We build

RL environments.

Enterprise Long-Horizon

Unlocking RL for out-of-distribution enterprise domains.

Introducing LegacySWE

Metaphi's research effort to help coding agents hill-climb on the hardest maintenance and modernization tasks in legacy enterprise systems.

Essay · May 2026

COBOLBench: Measuring Frontier Agents on Enterprise COBOL

We evaluate frontier coding systems on a 100-task public release drawn from real enterprise COBOL maintenance environments.

Blog · May 2026

VideoBench: Evaluating Coding Agents on Source-Grounded Video Generation

A benchmark measuring AI agents on video, animation, and presentation generation from curated data rooms. 183 chapters, 41 courses.

Blog · Feb 2026

FigmaBench: Throwing Frontier Models into Real-World Figma

A benchmark measuring AI agents on production Figma-to-code conversion through API interaction, design hierarchy extraction, and iterative deployment.

Blog · Feb 2026

Metaphi Simhub

Interactive simulation environments that move agent evaluation beyond static, non-interactive benchmarks.

Essay · Oct 2025

From Static Benchmarks to Dynamic Worlds

How principles from autonomous vehicle simulation can transform code generation agent training and evaluation.

Essay · Sep 2025

Get in touch

Tell us about your use case.