Understanding Virality with Vision-Language Models

24 December 2025

By Ishan Gupta , Abhishek Chandwani et al.

4 min read

We are excited to share our latest research paper: "Understanding Virality: A Rubric-based Vision-Language Model Framework for Short-Form Edutainment Evaluation", now available on arXiv.

Read the Paper on arXiv

The Problem with Current Video Evaluation

Evaluating short-form video content has traditionally relied on surface-level quality metrics like SSIM and FID. These metrics measure pixel-level fidelity but fail to capture what actually drives viewer engagement. A technically perfect video can fall flat with audiences, while a grainy clip can go viral.

This disconnect between technical quality and audience response represents a fundamental gap in how we evaluate content. As AI systems increasingly generate and curate video content, we need evaluation methods that align with human perception and engagement patterns.

Our Approach: VLMs for Engagement Prediction

Rather than relying on pixel-level metrics, our framework leverages Vision-Language Models (VLMs) to extract semantically meaningful features from video content. The system extracts unsupervised audiovisual features and clusters them into interpretable factors that correlate with actual viewer behavior.

We built a curated YouTube Shorts dataset to systematically study the relationship between VLM-derived features and audience engagement. Our regression-based evaluator demonstrates stronger correlation with real engagement metrics than conventional assessment methods.

The key insight is that engagement prediction requires understanding content at a semantic level, not just a perceptual one. By grounding evaluation in both feature importance and engagement signals, we move toward human-aligned, multimodal reasoning about video quality.

Implications for AI Evaluation

This work connects directly to our mission at Metaphi. Current AI benchmarks measure technical capabilities in isolation, but real-world value comes from how AI outputs resonate with human users. The same principle that applies to viral videos applies to AI agents: technical correctness is necessary but not sufficient.

The framework we developed prioritizes explainability and scalability. Rather than treating engagement prediction as a black box, we surface the factors that drive predictions. This transparency is essential for building evaluation systems that can be trusted and improved over time.

As AI systems become more capable at generating content across modalities, evaluation methods must evolve beyond simple correctness checks. Understanding what makes content resonate, whether it's a video, a code snippet, or a design, requires the kind of multimodal reasoning that VLMs enable.

Authors

This research was conducted by Arnav Gupta, Gurekas Singh Sahney, Hardik Rathi, Abhishek Chandwani, Ishaan Gupta, Pratik Narang, and Dhruv Kumar.

Read the Full Paper