EyesOff: How I built a screen contact detection model

Again I began looking for gaze datasets but didn’t find much which were easily accessible (GazeFace, MPIIGaze etc require you to sign up to receive and my requests weren’t replied to). Until I found the selfie dataset 2 on Kaggle. This was a great starting point, from this I took and labelled 3400 images. However, given the dataset is of selfies, a lot of the time faces were occluded by phones, or eyes were looking at the phones, so the data wasn’t the best for my use case. I did try labelling only images without phones in, but it didn’t help.

As a test, I took the FFHQ dataset, having come across it during GAN training. I went through and manually labelled a subset of FFHQ (4900 images) to test my hypothesis. To my eye it worked ok, however it failed on a small test set of images taken of myself. I figured this was because the FFHQ images are quite unlike real life images, Nvidia applied very heavy augmentations to them, making them look a little weird. So, the next step was to find images which looked real.

Initially, I thought I could do this with images of myself + my friends however I quickly realised this would create issues in terms of generalisability. To get over this I started looking for face datasets which I would label myself.

Given the lack of data, I had to come up with my own dataset. I started by thinking I could only use images of people using their laptops, as this would be the closest to what the model would see in production. However, this type of data was quite hard to come by. In fact, all I need is people in the image that are facing towards the camera, as this is essentially what the webcam will see. I.e. we take images with people in them and assume that the camera is a webcam at the top of an imaginary display. This allowed me to widen the range of possible data I could use.

Quick note on YuNet, it struggled to detect faces in 1080 x 1920 images, however halving the resolution seemed to resolve the issue. I’m not sure what caused this.

A lot of time was spent on developing my labelling framework - i.e. how can I consistently label 1000s of images. I had to iron out “the boundary for someone looking at the screen”. I decided to follow the Eye-Contact-CNN paper closely, meaning labelling as looking = directly at the camera or slightly around it, rather than “in the general direction” which was the boundary I began with. I also had to make assumptions on the camera position - to make my life easier I assume a laptop setup where the camera is at the top of the screen. This is a limitation but further work will be done to remove the assumption. Another idea I had was to use the Eye-Contact-CNN to label my data for me, but this did not produce great results. The looking bounds were too tight, the EyesOff model is useless if it only says you are looking if you look directly at the camera, having done this I realised the importance of hand labelled data .

Start of the VCD Dataset

Upon observing the limitations of the selfie dataset, namely low quality images and situations which were not close to the production environment of the EyesOff model, I began a search for new datasets. I found the Video Conferencing Dataset (VCD)3. This dataset was created to evaluate video codecs for video conferencing. However, it is also perfect for the EyesOff use case, people in video calls smack in front of the webcam and occasionally looking around. The dataset contains 160 unique individuals in different video conferencing settings. I set to work labelling the dataset, the pipeline goes like this:

- Run videos frame by frame but only extract frames at a fixed interval. Extracting every single frame creates issues: firstly, most frames close to each other are the same (diversity in images is important). Also if you have a 30fps video which lasts 30 seconds, each video gives 900 frames. With 160 videos you end up with 144,000 images to label! - Next we take YuNet and run it on the extracted frames, doing this we crop out the faces in each image. I added this step to utilise YuNet, because I love it, but more importantly it's an amazing facial detection model and by using it to do the heavy work of detecting faces, we break up our task. YuNet handles facial detection and the EyesOff model only needs to predict if the face is looking or not. It also helps when multiple people are in the scene, making data collection much simpler (imagine having to label images where 3 people are looking but 2 are not, and how would we get diversity in such scenes).It's a bit hacky but it works, also in the production pipeline this lets us handle things much easier, we'd get face crops and send them one at a time to the EyesOff model, rather than dealing with multiple at once. - Then I take the face crops and run them through my labeller as before. The labeller was a small tool built to speed up this process, I did look into proper labelling tools such as label studio but found them too heavy for such a use case. Using Claude I built a simple labeller, it shows one image at a time, with 4 buttons: 1 = label "not looking", 2 = label "looking", 3 = skip and q = go back to previous image. At first labelling was a very slow process, but the more I labelled the faster I got. By the end I could label around 1000 images in 15 minutes, to get this fast I would use skip pretty frequently, if a case is too ambiguous it makes more sense to skip it than to waste time labelling it. In the future I will go back and review the skipped cases correctly labelling them and adding them to the train set. - After labelling the images we can train the model! However we have to be careful in this process, I learnt that facial images need a train-test split. By this I mean, the same face cannot appear in train and test, even if the image is different. To see why this is required imagine the following: you have a face labelled in 100 different scenarios and poses but it is always looking, the model may learn this particular face is always looking and as such when evaluating the test set the result is not reliable.

That’s it for the data labelling process for the VCD dataset! All in all I got 5900 images from the VCD dataset. Take a look at Figure 1 for the class distribution.

... continue reading