Q-resafe | Quantization-aware Safety Patching for LLMs

Q-resafe

Assessing Safety Risks and Quantization-aware Safety Patching for
Quantized Large Language Models
(ICML 2025)

Kejia Chen^{1 2*} Jiawen Zhang^{1 2*} Jiacong Hu¹ Yu Wang¹ Jian Lou^3† Zunlei Feng^{1 2†} Mingli Song^{1 2}

▶ ¹The State Key Laboratory of Blockchain and Data Security, Zhejiang University ▶ ²Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security ▶ ³Sun Yat-sen University

GitHub Paper Homepage

Large language models (LLMs) like ChatGPT are powerful AI systems that generate text and answer questions. However, they are so big that they require a lot of computing power, making it hard to use them on smaller devices. One solution is to compress these models, a process called quantization, which makes them more efficient. But we found that compressing LLMs can make them less safe, meaning they might generate inappropriate or harmful outputs more easily.

In our work, we tested several common compression methods and found clear safety issues. To fix this, we designed a new system called Q-resafe. It acts like a safety patch, restoring the LLMs' protective filters even after compression, without sacrificing their usefulness. Our tests show that Q-resafe helps compressed models stay as safe as they were before compression, even under tough conditions. This research helps ensure that as LLMs become more widely used, they stay both efficient and responsible.

Research Findings

$Radar Chart$

Comprehensive evaluation of Q-resafe across multiple safety and utility metrics.

Research Question

To what extent do different quantization techniques and calibration datasets degrade the safety capabilities of LLMs?
How can these safety declines be mitigated while maintaining model utility?

Key Contributions

We conduct a systematic safety risk assessment for Quantization LLMs, covering mainstream categories and taking calibration datasets into account.
We perform an in-depth comparison of four mainstream quantization techniques, highlighting their effects on safety performance and model utility.
We propose Q-resafe, a novel quantization-aware safety patching framework that effectively restores safety levels without retraining or sacrificing efficiency.

We hope our work offers a new perspective on the intersection of model compression and safety in LLMs.

Implementation

Performance comparison: Q-resafe effectively enhances the safety of quantized LLMs, achieving results close to full-precision models and demonstrating competitive performance across benchmarks.

Performance comparison: Q-resafe restores safety of quantized LLMs to near original levels, outperforming other methods.

quant-without-ft

Searches for safety-related weights in full-precision pre-trained models and implements quantization to preserve critical safety elements during weight conversion.

quant-with-ft

Implements safety patches using a conceptual objective function built on Direct Preference Optimization (DPO) loss to efficiently restore safety while minimizing impact on utility.

GitHub Paper Homepage