Finding backdoors in MLPs is NP-hard

← Nov 20, 2024

Multilayer perceptrons are expressive, and it’s not too surprising that they can express functions related to NP, but to prove that finding backdoors is NP-hard we first need a good definition of backdoors. This is one of those theorems where the context is more important than the statement itself.

The Context: Types of Backdoor Problems

When the shopping LLM gives me discounts because I said my name was Pliny, we call it a backdoor. When a string of seemingly random words makes an LLM misbehave, we call it an adversarial input. If the LLM tries to kill all humans if and only if we’re in $CURRENT_YEAR + 1, we call it deceptive alignment. Without the human element, however, these notions are essentially the same; they’re just rare and unusual behavior inserted from training.

The way we’ll define backdoors captures all three notions. We simply ask whether some bad output is possible.¹ For instance, a whale classifier has no adversarial inputs or backdoors if, for all images that look like a whale, the classifier never outputs “not-a-whale.” In theory-land, we’ll define a bad output as any all-negative vector to avoid fuzzy concepts like whales. Under this formalism, there are 3 main problems to solve, going from computationally hardest to easiest:

Worst case detection: Given any neural net, detect if some bad output is possible. We will prove this is NP-complete (exponentially hard in the worst case).
Inserted backdoor detection: Given any probability distribution of neural nets (e.g. models after a training run on specific data), an adversary spikes the distribution to give us models with its backdoor. Given some samples, we have to figure out if they came from the spiked distribution or not. Goldwasser et al. (2024) proved this is quite hard, (No efficient algorithm can distinguish with more than 1/2 + o(1) probability. It’s not known if this is exponentially hard). Unfortunately, this problem doesn’t care whether the neural net already has backdoors or not. It doesn’t matter if an adversary inserted another backdoor if your model is already full of them, and modern models are full of backdoors/adversarial inputs.
Average case detection: This is similar to the inserted backdoor case, but instead of an adversary spiking the distribution, we have that distribution conditioned on the existence (or non-existence) of a backdoor. This is the problem I’m currently working on, and it’s conjectured to be easy. Average case hardness is a subtle concept, so I won’t go into the details here.

The Hardness of Worst Case Detection

Theorem: For a multi-layer perceptron $\mathbb{R}^n \to \mathbb{R}$ with a single hidden layer and ReLU activation function, judging whether the model can output a negative number is NP-complete.

Proof. A reduction to 3SAT. Let $\sigma$ be the ReLU function. For weights $A \in \mathbb{R}^{m \times n}, b \in \mathbb{R}^m, C \in \mathbb{R}^m, d \in \mathbb{R}$, our MLP is represented by \[ y = C\sigma(Ax + b) + d .\] Instead of asking whether the output is negative, it’s equivalent and conceptually simipler to ask whether the ouptus can exceed some threshold. \[ \exists x : C \sigma(A x + b) \ge d .\]

For the reduction, we first need a gadget to enforce discreteness of an input element $x$. We can set the output threshold to 1 and then craft the weights such that the only way to reach that threshold is to have $x \in \{+1, -1\}$ \[ -2 \sigma(x-1) + \sigma(x) + \sigma(-x) -2 \sigma(-x-1) \ge 1 \] When $x$ is positive and only the components on the left contribute, we need $x$ to be at least 1 to reach the threshold. However, the punishing component $-2\sigma(x-1)$ ensures that increasing $x$ beyond 1 will only decrease the output. The same logic occurs when $x$ is negative.

This gadget can easily compose to enforce discreteness of many inputs. \[ \sum_{i=0}^n -2 \sigma(x_i-1) + \sigma(x_i) + \sigma(-x_i) -2 \sigma(-x_i-1) \ge n \]

Once we have discreteness, the problem becomes easy. We can add a 3SAT clause gadget that only exceeds the threshold when the clause is satified. \[ - \sigma(\pm x_i \pm x_j \pm x_k - 2) \ge 0 \] Only when $\pm x_i, \pm x_j, \pm x_k$ are all positive, corresponding to all 3 literals being false in a 3SAT, does this go below zero. In the other cases it’s equal to zero. With the discreteness and clause gadgets, we can construct a MLP where the only inupts with negative outputs are the solutions to an arbitrary 3SAT problem. $\square$

Putting it all together, here’s an example of a 3SAT clause and its representation as a 1 layer MLP: \[ (x_1 \vee x_2 \vee \neg x_3) \wedge (\neg x_1 \vee x_3 \vee x_4) \wedge (x_2 \vee \neg x_3 \vee \neg x_4) \wedge (x_1 \vee \neg x_2 \vee x_4) \]

\[ \left(\sum_{i=0}^4 -2 \sigma(x_i-1) + \sigma(x_i) + \sigma(-x_i) -2 \sigma(-x_i-1)\right) \\ - \sigma(-x_1 - x_2 + x_3 - 2) - \sigma(x_1 - x_3 - x_4 - 2) \\ - \sigma(-x_2 + x_3 + x_4 - 2) - \sigma(-x_1 + x_2 - x_4 - 2) \ge 4 \]

A similar proof can be made for 2 layer MLPs without biases. That one is left as an exercise for the reader.

There’s a different and very interesting formalization, introduced by Christiano et al. (2024), that can almost separate deceptive alignment and adversarial inputs. In their setting, the adversary is forced to use a random trigger for their backdoor. This is like in real life, where a deceptive AI model can’t simply try to kill all humans when given a seemingly random string of text, their trigger needs to be evidence that it’s an opportune moment to kill all humans! ↩︎