Animation2Code: Evaluating Temporal Visual Reasoning in Video-to-Code Generation

Ji, Anya; Mudunuri, Abhijith Varma; Chan, David M.; Suhr, Alane

Animation2Code

Evaluating Temporal Visual Reasoning in Video-to-Code Generation

Anya Ji, Abhijith Varma Mudunuri, David M. Chan, Alane Suhr

University of California, Berkeley

Papersoon Codesoon 🤗 Hugging Face Datasetsoon ⭐ Benchmark Dashboard

Animation2Code task overview: ground-truth code renders to animation videos, which models must de-render back into executable code.

Animation2Code is the first benchmark to evaluate de-rendering of web animations into executable code. Given a reference video, a model must produce self-contained HTML/CSS/JavaScript that, when executed, reproduces both the appearance and the motion of the original animation.

Abstract

While recent vision-language models (VLMs) have achieved significant improvements on static visual-to-code tasks such as generating code for webpages, charts, or SVGs, it remains unclear whether they can recover temporal dynamics when motion is present. To this end, we introduce Animation2Code, a benchmark for evaluating temporal visual reasoning via reconstructing executable web animation code from videos. Animation2Code consists of 1,069 web animation videos with diverse visual appearances and motion patterns, paired with corresponding HTML/CSS/JavaScript implementations. We propose two human-aligned metrics, appearance similarity and temporal similarity, which allow us to disentangle visual fidelity from temporal alignment when comparing rendered animations against ground-truth samples. Benchmarking state-of-the-art VLMs on this dataset shows that current VLMs struggle to maintain temporal consistency in reconstruction, even when achieving high appearance similarity, including under finetuning and iterative refinement settings.

Dataset Statistics

1,069

Animation video–code pairs

5.1s

Average animation length

355k+

Total lines of code

515

Unique CSS properties

30 FPS

Rendered at 1024×768

Key Contributions

A new task & benchmark

The first benchmark for dynamic visual perception through de-rendering web animations into executable code — 1,069 real-world video–code pairs with diverse motion patterns.

Human-aligned metrics

An automatic evaluation suite that disentangles appearance from temporal similarity, validated to agree with human preference and to outperform VLM-as-a-judge in each dimension.

A consistent temporal gap

SOTA VLMs reach near-perfect execution (up to 100%) and strong appearance (up to 0.84), yet temporal similarity stays low (≤ 0.31) across zero-shot, fine-tuning, and refinement.

Evaluating De-Rendering

We evaluate generated animations along two complementary axes, both computed on the cropped animated region so they remain agnostic to absolute on-page positioning:

Appearance similarity compares DreamSim embeddings of frames aligned with Dynamic Time Warping, capturing color, shape, and style.
Temporal similarity compares motion trajectories (extracted with CoTracker3) via a direction- and speed-aware correlation, aggregated with Chamfer-style matching.

Appearance similarity via DreamSim + DTW, and temporal similarity via tracklet displacement correlation.

We isolate animated regions via cropping, compute appearance similarity using DreamSim with DTW alignment, and measure temporal similarity via tracklet displacement correlation aggregated with a Chamfer-style matching.

Results

Setting / Model	FPS	Exec (%)	Appearance	Temporal
Native video input
Gemini-3 Flash Preview	24	99.1	0.80	0.31
Gemini-3 Flash Preview	2	98.1	0.80	0.30
Qwen3-VL-8B-Instruct	24	84.6	0.69	0.24
Qwen3-VL-8B-Instruct	2	85.5	0.67	0.23
Image-frame input
GPT-5.4	2	100.0	0.84	0.29
Gemini-3 Flash Preview	2	100.0	0.80	0.30
Claude Sonnet 4.6	2	99.5	0.82	0.29
LLaMA 4 Scout	2	97.7	0.62	0.21
Qwen3-VL-8B-Instruct	2	80.4	0.70	0.24
SFT / refinement (Qwen3-VL-8B, video input)
+ LoRA	2	98.6	0.43	0.09
+ Full SFT	2	94.9	0.46	0.08
+ Iterative Refinement	2	85.5	0.73	0.28

Exec (%) is the share of examples whose generated code runs successfully. Best score per metric in bold.

Takeaway. Execution is nearly saturated and appearance is strong, but temporal similarity remains uniformly low across every model and training setting — high appearance does not translate into faithful motion. Richer temporal signal (native video, higher FPS) yields only marginal gains.

Qualitative Examples

Qualitative comparison of model outputs against ground truth.

All models struggle with precise motion and spatial relationships, even when overall appearance is partially correct. Some models capture object appearance (e.g., 3D balls) but reproduce the wrong trajectory or arrangement, while others collapse to static or linear motion.

Human-Aligned Evaluation

We collected 600 pairwise preference comparisons from 65 Prolific annotators across overall quality, appearance, and temporal similarity. Inter-annotator agreement is strong for overall and appearance (Krippendorff's α = 0.81) and lower but reliable for temporal (0.73) — motion is inherently harder to judge. Our automatic metrics predict human preference better than a VLM-as-a-judge baseline in their respective dimensions.

Inter-annotator agreement and ROC-AUC of metrics against human preference.

(A) Inter-annotator agreement per judgment dimension. (B) ROC-AUC for predicting human preference from metric deltas, VLM annotations, and a joint logistic regression. Our metrics align with human preference better than VLM-as-a-judge in their respective dimensions.

Explore the Benchmark Dashboard

Browse rendered model outputs side-by-side with ground truth and inspect per-example metric scores.

⭐ Go to Dashboard

All public CodePen content is MIT-licensed. Copyright belongs to the original authors.

BibTeX

@article{ji2026animation2code,
  title         = {Animation2Code: Evaluating Temporal Visual Reasoning in Video-to-Code Generation},
  author        = {Ji, Anya and Mudunuri, Abhijith Varma and Chan, David M. and Suhr, Alane},
  year          = {2026},
  journal       = {arXiv preprint TBD},
}