Sliced Wasserstein Distance as Scene Semantic Dissimilarity Metric

Md Eimran Hossain Eimon¹ (), Velibor Adzic¹, Hari Kalva¹; ¹Florida Atlantic University

Humans are remarkably good at judging whether two images depict the same scene, even when they differ substantially in appearance, illumination, or viewpoint. Modeling this ability is important for vision science and for evaluating computational vision systems. Many commonly used metrics for image similarity were designed to assess visual fidelity rather than semantic equivalence at the scene level. Measures such as PSNR and SSIM emphasize pixel-wise differences and local structural correspondence, and therefore quantify how closely one image matches another, not whether two images convey the same underlying scene. As a result, these and similar metrics often fail to predict human judgments in tasks that depend on scene semantics. We propose the Sliced Wasserstein Distance (SWD) as a non-learned distribution-based metric for scene semantics dissimilarity. The algorithm samples random projection directions, projects pixel values from the RGB space onto one-dimensional axes, and independently sorts the projected values for each image. For each projection, the distance is computed as the mean absolute difference between the sorted values, and the final score is obtained by averaging across projections. This yields a permutation-invariant measure that discards spatial layout while retaining global scene statistics. The method is simple, differentiable, and requires no training data. We evaluate SWD on the large-scale NIGHTS dataset for scene semantics dissimilarity. Statistical analysis is conducted with a wide range of classical and advanced benchmark metrics, including PSNR, SSIM, MS-SSIM, CW-SSIM, VIF, VSI, FSIM, GMSD, MAD, NLPD, and LPIPS. SWD achieves the highest accuracy at 69.95%, outperforming LPIPS at 68.92%. In contrast, PSNR and SSIM achieve accuracy of 56.93% and 58.87%, respectively. The results indicate that SWD captures high-level image semantics and more accurately represents scene-level similarity.