Detecting scene-plausible perceptible backdoors in trained dnns without access to the training set

Zhen Xiang, David J. Miller, Hang Wang, George Kesidis

Research output: Contribution to journalLetterpeer-review

1 Scopus citations

Abstract

Backdoor data poisoning attacks add mislabeled examples to the training set, with an embedded backdoor pattern, so that the classifier learns to classify to a target class whenever the backdoor pattern is present in a test sample. Here, we address posttraining detection of scene-plausible perceptible backdoors, a type of backdoor attack that can be relatively easily fashioned, particularly against DNN image classifiers. A post-training defender does not have access to the potentially poisoned training set, only to the trained classifier, as well as some unpoisoned examples that need not be training samples. Without the poisoned training set, the only information about a backdoor pattern is encoded in the DNN’s trained weights. This detection scenario is of great import considering legacy and proprietary systems, cell phone apps, as well as training outsourcing, where the user of the classifier will not have access to the entire training set. We identify two important properties of scene-plausible perceptible backdoor patterns, spatial invariance and robustness, based on which we propose a novel detector using the maximum achievable misclassification fraction (MAMF) statistic. We detect whether the trained DNN has been backdoor-attacked and infer the source and target classes. Our detector outperforms existing detectors and, coupled with an imperceptible backdoor detector, helps achieve posttraining detection of most evasive backdoors of interest.

Original languageEnglish (US)
Pages (from-to)1329-1371
Number of pages43
JournalNeural computation
Volume33
Issue number5
DOIs
StatePublished - Apr 13 2021

All Science Journal Classification (ASJC) codes

  • Arts and Humanities (miscellaneous)
  • Cognitive Neuroscience

Fingerprint

Dive into the research topics of 'Detecting scene-plausible perceptible backdoors in trained dnns without access to the training set'. Together they form a unique fingerprint.

Cite this