Safety concerns of Multimodal large language models (MLLMs) have gradually become an important problem in various applications. Surprisingly, previous works indicate a counter-intuitive phenomenon that using textual unlearning to align MLLMs achieves comparable safety performances with MLLMs trained with image-text pairs. To explain such a counter-intuitive phenomenon, we discover a visual safety information leakage (VSIL) problem in existing multimodal safety benchmarks, i.e., the potentially risky and sensitive content in the image has been revealed in the textual query. In this way, MLLMs can easily refuse these sensitive text-image queries according to textual queries. However, image-text pairs without VSIL are common in real-world scenarios and are overlooked by existing multimodal safety benchmarks. To this end, we construct multimodal visual leakless safety benchmark (VLSBench) preventing visual safety leakage from image to textual query with 2.4k image-text pairs. Experimental results indicate that VLSBench poses a significant challenge to both open-source and close-source MLLMs, including LLaVA, Qwen2-VL, Llama3.2-Vision, and GPT-4o. This study demonstrates that textual alignment is enough for multimodal safety scenarios with VSIL, while multimodal alignment is a more promising solution for multimodal safety scenarios without VSIL.
The VSIL problem leads the shortcut alignment methods: textual alignment to the multimodal safety challenge.
To address the existing issues in current multimodal safety benchmarks, called VSIL, we construct Multimodal Visual Leakless Safety Benchmark (VLSBench) filling this blank in the current multimodal safety datasets. As shown above, our dataset compromise 2.4k image-text pairs, convering 6 categories and 19 sub-categories.
Our data construction pipeline shown above focuses on effectively preventing visual safety leakage from image modality to textual query. First, we should generate harmful textual queries from two parallel paths shown in Step 1. Then, we need to detoxify the harmful queries and obtain the harmless queries shown in Step 2. Furthermore, we use text-to-image models to iteratively generate images shown in Step 3. Finally, we filter out the mismatched and safe image-text pairs and obtain the final datasets as shown in Step 4.
We present six examples each paired with a corresponding response in our VSLBench. The left and middle four images are generated and the right two images are from existing data sources.
We evaluate various MLLMs including open-source models and close-source APIs. We also benchmark several safety aligned baselines. The evaluation is conducted by GPT-4o with a specialized prompt. We classify the response into three type, safe with refusal, safe with warning and unsafe. The safe rate is sum of safe refuse rate and safe warning rate.
The above two figures highlight our VLSBench's several features: (1) the challenging nature of our dataset; (2) highlight the importance of multimodal alignment methods rather than textual shortcut alignment.
@article{hu2024vlsbench,
title={VLSBench: Unveiling Visual Leakage in Multimodal Safety},
author={Xuhao Hu and Dongrui Liu and Hao Li and Xuanjing Huang and Jing Shao},
journal={arXiv preprint arXiv:2411.19939},
year={2024}
}