Video & Motion Perception

Talk 1, 3:15 pm

Impact of Camera Motion on Perceived Optical Flow Error in Naturalistic Video

Yung-Hao Yang¹, Taiki Fukiage², Zitang Sun¹, Shin’ya Nishida¹; ¹Cognitive Informatics Lab, Department of Intelligence Science and Technology, Graduate School of Informatics, Kyoto University, Japan, ²Human Information Science Laboratory, NTT Communication Science Laboratories, NTT Inc., Japan

Accurate perception of motion from first-person video is critical for emerging technologies, including remote driving, drone operation, and video analytics. This study quantifies the accuracy of human optical flow perception in the presence of naturalistic camera movements. We reanalyzed our HuPerFlow benchmark (CVPR 2025), which comprises ~38,400 psychophysical motion judgments via online crowdsourcing across ten diverse optical flow datasets. These datasets span a wide range of naturalistic scenarios, including smooth automotive driving (e.g., KITTI, VKITTI 2, Driving), complex animations (e.g., MPI Sintel, Spring), hierarchical human actions (e.g., MHOF), and scenes featuring significant camera motion (e.g., TartanAir, Monkaa, VIPER, FlyThings3D). Comparison of human perceived flow against Ground Truth (GT) physical flows revealed distinct performance patterns: In driving datasets characterized by smooth, rigid motion, human perception aligned closely with GT. However, systematic errors emerged in complex environments. For example, observers often prioritized global motion structures over local details, such as grouping non-rigid elements in animations or biasing local limb movements toward global body trajectories. Notably, in scenarios with strong camera rotation, performance worsened as observers struggled to segregate static objects from the rapidly changing background flow. We further quantified this perceptual error by analyzing the relationship between End-Point Error (EPE, the Euclidean distance between perceived flow vs. GT) and camera motion parameters. Partial correlation analysis demonstrated that EPE significantly increased with camera rotation magnitude across most datasets, even after controlling for total optical flow speed. This indicates that rotational instability introduces specific perceptual disruptions that interfere with local motion matching. Quantitatively predicting human-perceived motion flow under diverse viewer movements is key for applied vision research in understanding the gap between human and machine vision.

Acknowledgements: This work has been supported by JSPS Kakenhi JP 24H00721.

Talk 2, 3:30 pm

What Happens Between Play and Pause? A New Playback Mode for Detecting Temporal Distortions

Budmonde Duinkharjav¹ (), Pontus Ebelin¹, Ruth Rosenholtz¹, Anjul Patney¹; ¹NVIDIA

Workflows for evaluating the visual quality of computer-generated imagery in film and video game production often rely on human observers manually reviewing video footage rather than automated image-computable metrics. This choice is generally attributed to limited robustness of computational metrics, and context-dependent decision-making requiring consideration of not just distortion characteristics but also how they affect the overall product. Thus, quality assurance experts spend significant time using video playback tools to scrutinize dynamic video content. However, video playback software is typically limited to simple actions like play, pause and seek, limiting users' flexibility to manipulate playback efficiently. Such limited playback functionality often either fails to expose temporal distortions such as flickering and instability or raises false positives during manual frame-by-frame analysis, leading to flagging of sub-threshold distortions. In this work, we propose LivePause, an additional playback feature that continuously oscillates between frames within a small temporal window, preserving the visibility of temporal artifacts while keeping the user focused on a region of interest. We evaluate the effectiveness of LivePause for detecting temporal distortions during playback via a pilot study. Four participants identified and labeled the target distorted stimulus during presentation of pairs of distorted and reference videos with and without LivePause. Our study suggests that the usefulness of LivePause changes based on the visibility of artifacts: detection of hard-to-notice artifacts is enhanced due to significantly faster response times, while detection of easier artifacts is marginally slower due to longer observation times with LivePause. Crucially, in experiments without LivePause, subjects' video seeking behavior closely mimicked LivePause, as they analyzed small video portions by repeatedly seeking the same timestamp. Regardless of difficulty, users consistently utilized LivePause when available, whereas regular pauses were often ignored. While LivePause's performance benefits require further study, it could still be a useful addition to professional quality assurance workflows.

Talk 3, 3:45 pm

Visuomotor timing accuracy saturates despite improvements in perceptual smoothness with frame rate

Shin'ya Nishida¹, Kazuki Imamura¹, Kiyofumi Miyoshi¹; ¹Graduate School of Informatics, Kyoto University

High-refresh-rate monitors have recently become widely available, driven largely by the demand of video game players for improved performance. Increasing the frame rate not only reduces input latency but also enhances the perceptual quality of moving images by shifting temporal-sampling artifacts outside the window of visibility (Watson et al., 1987). However, it remains unclear whether higher frame rates yield comparable benefits for visuomotor performance. To address this issue, we presented a disk (0.85° diameter) moving at one of seven speeds ranging from 132 to 194 deg/s. Stimuli were displayed at 60, 120, or 360 fps on a 360-Hz LCD monitor (Dell Alienware AW2523HF). Observers first completed a subjective quality judgment task in which they compared the apparent smoothness of stimuli rendered at different frame rates. Consistent with the window-of-visibility analysis, higher frame rates were almost always judged to appear smoother. The same observers then performed a speed-dependent timed key-press task. In each trial, the target disk moved at a fixed speed from the starting position toward a goal line 48.9 degrees away. Observers had to press a key precisely when the disk crossed the goal line, after which visual feedback indicated the disk’s position at the moment of the key press. Timing accuracy improved from 60 to 120 fps, but showed little additional improvement from 120 to 360 fps. The pattern of results was largely unaffected by presentation mode (hold vs. CRT-like impulse). When the speed range was reduced to 61–122 deg/s, the dissociation between the two tasks became more pronounced: perceptual smoothness continued to improve with increasing frame rate, whereas timing accuracy remained similar across all frame rates. These findings suggest that the accuracy of rapid speed estimation for timed actions—critical in many action games—saturates at frame rates well below those required to eliminate visible temporal-sampling artifacts.

Acknowledgements: Supported by JSPS Grants-in-Aid for Scientific Research (KAKENHI), Grant Number JP24H00721.

Talk 4, 4:00 pm

Studying visual attention beyond the lab: Lessons from applied eye-tracking

Shreshth Saxena¹, Lauren Fink¹; ¹McMaster University

Visual attention is central to everyday perception, yet most empirical knowledge about how it operates comes from controlled laboratory settings. As vision science increasingly engages with naturalistic stimuli and real-world behaviour, there is a growing need for methods that allow visual attention to be studied outside the lab while remaining reliable, rigorous, and reproducible. Here, we outline approaches we have developed to study human visual attention using eye-tracking in everyday environments, including social and remote contexts. We describe our recent methods for measuring visual attention using glasses-based mobile eye-tracking in co-located social settings and webcam-based eye-tracking in remote and hybrid contexts. These approaches support synchronized data collection from single participants to large groups experiencing the same dynamic audiovisual events. Over the past two years, we have deployed mobile eye-tracking in live performances (e.g., music concerts and theatre; N = 152), screen-based events (e.g., film viewing, sports watching; N = 115), and self-navigated environments (e.g. art and museum galleries, nature walks; N = 137). Our webcam eye-tracking methods have been tested in traditional online experiments (N = 60) and hybrid, livestreamed music concerts and film screenings (N = 221). Drawing on these experiences, we share key lessons learned when studying visual attention beyond the laboratory. These include strategies for maintaining high-quality data under movement and lighting variability, synchronizing gaze data with time-varying audiovisual content, automating the identification of meaningful regions of interest in dynamic scenes, and interpreting visual behaviour in socially and semantically rich contexts. We also identify emerging methodological needs for applied vision research, including hardware-agnostic software designs and privacy-preserving analysis and visualization methods that support integration with physiological and behavioural data streams. We conclude by briefly describing ongoing tool development to address them.