Finding backdoors in MLPs is NP-hard

Nov 20, 2024

Multilayer perceptrons are expressive, and it’s not too surprising that they can express functions related to NP, but to prove that finding backdoors is NP-hard we first need a good definition of backdoors. This is one of those theorems where the context is more important than the statement itself.

The Context: Types of Backdoor Problems

When the shopping LLM gives me discounts because I said my name was Pliny, we call it a backdoor. When a string of seemingly random words makes an LLM misbehave, we call it an adversarial input. If the LLM tries to kill all humans if and only if we’re in $CURRENT_YEAR + 1, we call it deceptive alignment. Without the human element, however, these notions are essentially the same; they’re just rare and unusual behavior inserted from training.

The way we’ll define backdoors captures all three notions. We simply ask whether some bad output is possible.1 For instance, a whale classifier has no adversarial inputs or backdoors if, for all images that look like a whale, the classifier never outputs “not-a-whale.” In theory-land, we’ll define a bad output as any all-negative vector to avoid fuzzy concepts like whales. Under this formalism, there are 3 main problems to solve, going from computationally hardest to easiest:

The Hardness of Worst Case Detection

Theorem: For a multi-layer perceptron \(\mathbb{R}^n \to \mathbb{R}\) with a single hidden layer and ReLU activation function, judging whether the model can output a negative number is NP-complete.

Proof. A reduction to 3SAT. Let \(\sigma\) be the ReLU function. For weights \(A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m, C \in \mathbb{R}^m, d \in \mathbb{R}\), our MLP is represented by \[ y = C\sigma(Ax + b) + d .\] Instead of asking whether the output is negative, it’s equivalent and conceptually simipler to ask whether the ouptus can exceed some threshold. \[ \exists x : C \sigma(A x + b) \ge d .\]

For the reduction, we first need a gadget to enforce discreteness of an input element \(x\). We can set the output threshold to 1 and then craft the weights such that the only way to reach that threshold is to have \(x \in \{+1, -1\}\) \[ -2 \sigma(x-1) + \sigma(x) + \sigma(-x) -2 \sigma(-x-1) \ge 1 \] When \(x\) is positive and only the components on the left contribute, we need \(x\) to be at least 1 to reach the threshold. However, the punishing component \(-2\sigma(x-1)\) ensures that increasing \(x\) beyond 1 will only decrease the output. The same logic occurs when \(x\) is negative.

This gadget can easily compose to enforce discreteness of many inputs. \[ \sum_{i=0}^n -2 \sigma(x_i-1) + \sigma(x_i) + \sigma(-x_i) -2 \sigma(-x_i-1) \ge n \]

Once we have discreteness, the problem becomes easy. We can add a 3SAT clause gadget that only exceeds the threshold when the clause is satified. \[ - \sigma(\pm x_i \pm x_j \pm x_k - 2) \ge 0 \] Only when \(\pm x_i, \pm x_j, \pm x_k\) are all positive, corresponding to all 3 literals being false in a 3SAT, does this go below zero. In the other cases it’s equal to zero. With the discreteness and clause gadgets, we can construct a MLP where the only inupts with negative outputs are the solutions to an arbitrary 3SAT problem. \(\square\)

Putting it all together, here’s an example of a 3SAT clause and its representation as a 1 layer MLP: \[ (x_1 \vee x_2 \vee \neg x_3) \wedge (\neg x_1 \vee x_3 \vee x_4) \wedge (x_2 \vee \neg x_3 \vee \neg x_4) \wedge (x_1 \vee \neg x_2 \vee x_4) \]

\[ \left(\sum_{i=0}^4 -2 \sigma(x_i-1) + \sigma(x_i) + \sigma(-x_i) -2 \sigma(-x_i-1)\right) \\ - \sigma(-x_1 - x_2 + x_3 - 2) - \sigma(x_1 - x_3 - x_4 - 2) \\ - \sigma(-x_2 + x_3 + x_4 - 2) - \sigma(-x_1 + x_2 - x_4 - 2) \ge 4 \]

A similar proof can be made for 2 layer MLPs without biases. That one is left as an exercise for the reader.


  1. There’s a different and very interesting formalization, introduced by Christiano et al. (2024), that can almost separate deceptive alignment and adversarial inputs. In their setting, the adversary is forced to use a random trigger for their backdoor. This is like in real life, where a deceptive AI model can’t simply try to kill all humans when given a seemingly random string of text, their trigger needs to be evidence that it’s an opportune moment to kill all humans! ↩︎