The End of the Train-Test Split
2015: Loosely based on a true story.
You are a machine learning engineer at Facebook in Menlo Park. Your task: build the best butt classification model, which decides if there is an exposed butt in an image.
The content policy team in D.C. has written country-specific censorship rules based on cultural tolerance for gluteal cleft—or butt crack, for the uninitiated.
Germany: 0% cleft.
Zimbabwe: 30% cleft.
Cupertino: 0%.
Montana: 20%.
A PM on your team writes data labeling guidelines for a business process outsourcing firm (BPO), and each example in your dataset is triple-reviewed by the firm's outsourced team to ensure consistency. You skim the labels, which seem reasonable.
import torch import pandas as pd from torch.utils.data import DataLoader, TensorDataset from sklearn.model_selection import train_test_split df = pd.read_csv("gluteal_cleft_labels.csv") X = df.drop("label", axis=1).values y = df["label"].values x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
... continue reading