… optional, that is.
I’ve been working on an NSFW filter for Marginalia Search, as that is something some people have asked for, primarily API consumers.
The search engine has had some domain based filtering for a while, based on the UT1 lists, but that isn’t a very comprehensive approach.
We’ll land on a single hidden layer neural network approach, implemented from scratch, but before landing on that, many other things were tried along the way.
This is largely an abbreviated account of the way there.
There is a tension between speed and generality in classification.
Building something that is both fast and reasonably correct in its assessments is incredibly fiddly work, even if the solution itself is often pretty straightforward.
The main limiting constraint for a filter that runs in a search engine is that it needs to be really fast and run well on CPUs.
This immediately disqualifies transformer-based models and other state-of-the art approaches, capable as they are they check neither of those boxes.
Fasttext
... continue reading