Metaphi Simhub

8 October 2025

By Ishan Gupta & Abhishek Chandwani

5 min read

Large language models have demonstrated superhuman capabilities in discrete, well-defined coding tasks, but their progression into truly agentic, collaborative software engineering partners is hampered by a fundamental limitation in training and evaluation. Current methodologies roots in either static benchmarks or binary notions of correctness. This creates a critical bottleneck, preventing the transition from "code completion" to "end user utility creation."

We introduce Metaphi Simhub, a platform designed to solve this challenge.

Our system will integrate three tiers of feedback signals:

foundational compiler and unit test results for functional correctness;
process-supervised rewards for intermediate code quality; and
a layer of interaction-based feedback derived from interactions with digital persona.

Our three-phase roadmap begins with a human-in-the-loop phase to generate the initial grounding data for our persona infra and also developing simulation scenarios which can also be procured by AI labs to validate their models, progressing to a fully automated and plug and play simulation environment.

Metaphi SimHub: An Interactive Simulation Environment for Multi-Step Code Generation

From the perspective of an AI agent, the world of CodeSim is a rich, multi-modal environment designed to replicate the complete software development lifecycle.

The Task Domain is not limited to single-function generation. Instead, agents are tasked with long-horizon projects that require planning, implementation, and iteration.

The agent's primary interlocutor is not a static text prompt, but interacting with a User Persona which in future can be digitally simulatable. This is the most critical element of the CodeSim world. Each persona has its own goals, preferences and context.

The Dynamic User Simulation Engine

This engine is the component responsible for bringing the user personas to life. It is not merely a script player; it is a system for generating dynamic, goal-oriented, and reactive behaviors. Each persona will be defined by a set of goals (e.g., schedule a music lesson), a knowledge base (e.g., knows how to use a web browser), and a set of preferences (e.g., prefers simple UIs).

The engine's behavior models is directly inspired by the simulation techniques used in autonomous driving. In Waymo's simulator, other vehicles can be controlled via two modes: log playback, which replays the exact behavior recorded from a real-world drive, and more sophisticated reactive models like the Intelligent Driver Model (IDM), which follows a recorded path but adjusts its speed and behavior based on the actions of the autonomous vehicle and other agents in the scene. We bring this dual capability to code generation with Simhub. This dual capability is essential for both rigorous, reproducible testing and for discovering emergent failure modes.

The Task Structure is defined by the long-horizon goals provided by the user persona. An episode in this environment is not a single function generation but an entire project, such as "Implement a user login page with email and password fields." The episode terminates upon successful completion as judged by the persona's satisfaction criteria, or upon failure, which could be triggered by the persona "giving up" in frustration after too many failed attempts. This approach to defining a curriculum of interactive, language-defined tasks aligns with the research direction for training generalist agents, such as DeepMind's Scalable Instructable Multiworld Agent (SIMA), which is trained on a portfolio of skills across many different interactive virtual environments.

Currently we are implementing our environments using 'real human chain of events' as ground truth. This captured data serves the exact same purpose as Waymo's real-world driving logs or the recorded video data used by NVIDIA's Neural Reconstruction Engine. It is the raw, unstructured, and high-fidelity data that constitutes ground truth. This dataset can then be used to construct the first generation of digital twin "user personas", which involves processing the long-term interaction history to create compact, computable summaries of a user's style, preferences, and behavioral patterns. The real human's interaction trace becomes the seed for their digital counterpart and can be validated by them on how much it is grounded.

The creation of a persona is not the end of the process; a critical and ongoing step is validation. A persona is only useful if its behavior is a faithful representation of the human it is meant to simulate. Our platform will incorporate a formal validation framework to ensure the personas remain grounded in reality. For example, a "UI interaction divergence" metric could measure the difference in click patterns between a persona and its human counterpart when presented with the same UI, while a "linguistic style divergence" metric could analyze the n-gram distribution of the persona's generated text feedback compared to the human's. These metrics provide a continuous, automated signal to detect when a persona's model needs to be retrained or refined with new human data.

Once the methodology for creating and validating a single persona is established, the infrastructure can be used to scale this process, creating a large and diverse library of user personas. This is a crucial step for training robust and generalizable AI agents. An agent trained only on interactions with a "non-technical small business owner" persona may not perform well when faced with the demands of an "expert systems architect" persona who provides highly technical feedback and requires adherence to specific design patterns.

The persona library will be curated to cover a wide range of archetypes relevant to software development, including:

The Non-Technical Subject Matter Expert: High-level goals, ambiguous requirements, feedback based on "feel" and usability.
The Project Manager: Focus on deadlines, feature completeness, and adherence to specifications.
The Senior Developer / Tech Lead: Feedback focused on code quality, architectural patterns, performance, and maintainability.
The UX/UI Designer: Feedback centered on visual aesthetics, layout, accessibility, and user flow.
The Adversarial User: Personas designed to find edge cases, test security vulnerabilities, and attempt to "jailbreak" the agent.