Evaluating Virtual Sound Sources for Audiovisual Depth Perception in Virtual Reality

Maggie McCracken¹, Adhithya Narayanan¹, Jeanine Stefanucci¹, Sarah Creem-Regehr¹; ¹University of Utah

Immersive virtual reality (VR) is widely used to replicate real-world environments in applications such as training, simulation, and perception research, many of which rely on multisensory interactions between vision and sound. In contrast to real-world sound sources, virtual environments typically rely on virtualized audio to convey spatial information. Whether these virtual sounds support spatial judgments with the same accuracy and precision as real-world sound sources remains unclear, particularly when integrated with visual information in depth. Participants (N = 29) completed a VR coincidence-judgment task in which they indicated whether a virtual visual object and a sound were located at the same distance. Sounds were presented either from physical loudspeakers or as virtual sounds rendered through the HTC Vive Pro built-in headphones with a standardized head-related transfer function. Auditory and visual stimuli were presented between 1 and 5 m to assess localization estimates within near action space. Data were analyzed using logistic regression to model the probability of auditory–visual co-location judgments. Accuracy reflected correct co-location probabilities, while precision reflected the spatial displacement required for judgments to change. Virtual sounds were overall less accurate (OR = 0.70, p < .01) and less precise (OR = 0.67, p < .01) than real-world sounds. Precision for virtual sounds also degraded more rapidly at greater distances compared to real-world sounds (OR = 1.19, p < .01). The current results suggests that while virtual sounds support audiovisual integration, the resulting spatial estimates are less accurate and less precise than those with real-world sounds. These differences should be considered when developing multisensory VR applications that rely on spatialized audio and warrant future research into which characteristics of virtual sounds contribute to reduced performance, such as standardized head-related transfer functions, intensity scaling, or spectral shaping.

Acknowledgements: This research was based upon work supported by the National Science Foundation Graduate Research Fellowship Grant No. 2139322.