TY - GEN
T1 - Trojaning language models for fun and profit
AU - Zhang, Xinyang
AU - Zhang, Zheng
AU - Ji, Shouling
AU - Wang, Ting
N1 - Funding Information:
This work is supported by the National Science Foundation under Grant No. 1951729, 1953813, and 1953893.
Funding Information:
Any opinions, findings, and conclusions or recommendations are those of the authors and do not necessarily reflect the views of the National Science Foundation. Shouling Ji was partly supported by the National Key Research and Development Program of China under No. 2018YFB0804102 and No. 2020YFB2103802, NSFC under No. 61772466, U1936215, and U1836202, the Zhe-jiang Provincial Natural Science Foundation for Distinguished Young Scholars under No. LR19F020003, and the Fundamental Research Funds for the Central Universities (Zhejiang University NGICS Platform).
Funding Information:
This work is supported by the National Science Foundation under Grant No. 1951729, 1953813, and 1953893. Any opinions, findings, and conclusions or recommendations are those of the authors and do not necessarily reflect the views of the National Science Foundation. Shouling Ji was partly supported by the National Key Research and Development Program of China under No. 2018YFB0804102 and No. 2020YFB2103802, NSFC under No. 61772466, U1936215, and U1836202, the Zhejiang Provincial Natural Science Foundation for Distinguished Young Scholars under No. LR19F020003, and the Fundamental Research Funds for the Central Universities (Zhejiang University NGICS Platform).
Publisher Copyright:
© 2021 IEEE.
PY - 2021/9
Y1 - 2021/9
N2 - Recent years have witnessed the emergence of a new paradigm of building natural language processing (NLP) systems: general-purpose, pre-trained language models (LMs) are composed with simple downstream models and fine-tuned for a variety of NLP tasks. This paradigm shift significantly simplifies the system development cycles. However, as many LMs are provided by untrusted third parties, their lack of standardization or regulation entails profound security implications, which are largely unexplored. To bridge this gap, this work studies the security threats posed by malicious LMs to NLP systems. Specifically, we present TrojanLM, a new class of trojaning attacks in which maliciously crafted LMs trigger host NLP systems to malfunction in a highly predictable manner. By empirically studying three state-of-the-art LMs (BERT, GPT-2, XLNet) in a range of security-critical NLP tasks (toxic comment detection, question answering, text completion) as well as user studies on crowdsourcing platforms, we demonstrate that TrojanLM possesses the following properties: (i) flexibility - the adversary is able to flexibly define logical combinations (e.g., 'and', 'or', 'xor') of arbitrary words as triggers, (ii) efficacy - the host systems misbehave as desired by the adversary with high probability when 'trigger' -embedded inputs are present, (iii) specificity - the trojan LMs function indistinguishably from their benign counterparts on clean inputs, and (iv) fluency - the trigger-embedded inputs appear as fluent natural language and highly relevant to their surrounding contexts. We provide analytical justification for the practicality of TrojanLM, and further discuss potential countermeasures and their challenges, which lead to several promising research directions.
AB - Recent years have witnessed the emergence of a new paradigm of building natural language processing (NLP) systems: general-purpose, pre-trained language models (LMs) are composed with simple downstream models and fine-tuned for a variety of NLP tasks. This paradigm shift significantly simplifies the system development cycles. However, as many LMs are provided by untrusted third parties, their lack of standardization or regulation entails profound security implications, which are largely unexplored. To bridge this gap, this work studies the security threats posed by malicious LMs to NLP systems. Specifically, we present TrojanLM, a new class of trojaning attacks in which maliciously crafted LMs trigger host NLP systems to malfunction in a highly predictable manner. By empirically studying three state-of-the-art LMs (BERT, GPT-2, XLNet) in a range of security-critical NLP tasks (toxic comment detection, question answering, text completion) as well as user studies on crowdsourcing platforms, we demonstrate that TrojanLM possesses the following properties: (i) flexibility - the adversary is able to flexibly define logical combinations (e.g., 'and', 'or', 'xor') of arbitrary words as triggers, (ii) efficacy - the host systems misbehave as desired by the adversary with high probability when 'trigger' -embedded inputs are present, (iii) specificity - the trojan LMs function indistinguishably from their benign counterparts on clean inputs, and (iv) fluency - the trigger-embedded inputs appear as fluent natural language and highly relevant to their surrounding contexts. We provide analytical justification for the practicality of TrojanLM, and further discuss potential countermeasures and their challenges, which lead to several promising research directions.
UR - http://www.scopus.com/inward/record.url?scp=85110906328&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85110906328&partnerID=8YFLogxK
U2 - 10.1109/EuroSP51992.2021.00022
DO - 10.1109/EuroSP51992.2021.00022
M3 - Conference contribution
AN - SCOPUS:85110906328
T3 - Proceedings - 2021 IEEE European Symposium on Security and Privacy, Euro S and P 2021
SP - 179
EP - 197
BT - Proceedings - 2021 IEEE European Symposium on Security and Privacy, Euro S and P 2021
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 6th IEEE European Symposium on Security and Privacy, Euro S and P 2021
Y2 - 6 September 2021 through 10 September 2021
ER -