Can you reverse engineer our neural network?

A lot of “capture-the-flag” style ML puzzles give you a black box neural net, and your job is to figure out what it does. When we were thinking of creating our own ML puzzle early last year, we wanted to do something a little different. We thought it’d be neat to give users a complete specification of the neural net, weights and all. They would then be forced to use the tools of mechanistic interpretability to reverse engineer the network—which is a situation we sometimes find ourselves facing in our own research, when trying to interpret features of complex models.

We published the puzzle last February. At the time, we weren’t even sure it was solvable. The neural network we’d designed would output 0 for almost all inputs. A reasonable solver might assume that the goal was to furnish an input that produced 1 or some other nonzero value. But we’d engineered the network in such a way, as you’ll soon see, that you couldn’t use traditional methods to brute force your way to an answer—say, by backpropagating a nonzero output all the way back to the input layer. You had to actually think about what the net was doing.

We were amazed by the response the puzzle got. Mostly by luck, it seemed like we’d calibrated the difficulty just so: it wasn’t so hard that no one could solve it, and wasn’t so easy that we were flooded with responses. In fact if you can solve this puzzle, there’s a decent chance you’d fit in well here at Jane Street.

We’ll restate the problem below, but be warned that the rest of this post contains huge spoilers. If you want to try solving the puzzle yourself, avert your eyes. The rest of this post will walk through the process that an actual solver took, with all the twists and turns before they finally cracked it.

The problem

Today I went on a hike and found a pile of tensors hidden underneath a neolithic burial mound! I sent it over to the local neural plumber, and they managed to cobble together this. model.pt Anyway, I’m not sure what it does yet, but it must have been important to this past civilization. Maybe start by looking at the last two layers. Model Input vegetable dog Model Output 0 If you do figure it out, please let us know.

That model.pt file is basically just a pickled PyTorch model.

A solution

Getting started

A senior at university named Alex was in his dorm room when a roommate told him about a puzzle that was making the rounds on Twitter. The roommate had tried it himself but given up after two nights. Alex, in his final winter at school, was looking for something to do and decided to have a look.

... continue reading