Limits of Visual Saliency Models for AI-Generated Videos

Jenna Kang¹, Niall L Williams¹, Maria Beatriz Silva¹, Kenneth Chen¹, Patsorn Sangkloy¹, Qi Sun¹; ¹New York University

Diffusion models enable easy, detailed video generation but can produce atypical visual features that extend beyond conventional rendering artifacts. These manifest both spatially (distorted geometry and localized visual deformations) and temporally, (physically implausible motion, object identity drift, and temporal inconsistency). There is limited data and few established metrics for assessing AI-generated artifacts, and little understanding of how AI-generated videos influence human gaze behavior and visual saliency. To address this, we collected eye-tracking data from 13 participants viewing 72 AI-generated videos featuring different kinds of artifacts. Each trial began with participants fixating on a central crosshair for a randomized duration between 1.0 and 1.5 seconds, after which the video was presented. Participants were instructed to freely view the video and rate how well the video matched its generating text prompt and overall video quality using a 7-point Likert scale. From their recorded gaze data we computed empirical saliency maps and found that there was high consistency between participants’ gaze behavior, evaluated from averaging how well the gaze points from 1 human viewer predicted the gaze points of the rest of the 12 viewers for all videos using AU-ROC (0.89) and Normalized Scanpath Saliency (3.36). Accuracy metrics (from the MIT/Tuebingen Saliency Benchmark) for four saliency models (two classical, two deep learning) were inconsistent: four of the metrics indicated poor saliency prediction across all models, while the AUC metric ranged from poor to good (0.59-0.81) across the models. We found no correlation between saliency prediction accuracy and video features including optic flow magnitude, prompt complexity, video quality rating, and artifact category. This suggests that existing saliency models struggle to accurately predict gaze behavior on AI-generated videos, possibly because these models do not account for the unique visual features and temporal artifacts introduced by generative models.