Dr. Nicolas Müller | Correctness of ML-datasets

Machine learning requires data from which to learn: Even the most advanced neural network \(f\) is merely a mapping \(x \to y\), learnt from the training dataset. Thus, the correctness of training data is paramount. The figure below shows a popular image recognition dataset, the Cifar-10 dataset, which consists of inputs \(x\) (the images) and labels \(y\) (the associated classes).

The Cifar-10 dataset (Image Source), consisting of 60.000 images corresponding to one of ten classes such as 'airplane', 'ship' or 'cat'.

However, it turns out that even in the most popular datasets, there are a number of instances whose labels \(y\) are plain wrong. Consider the following examples from the Cifar-10 dataset, where the semantics of the image do not correspond to the label.

Four mislabeled instances from the Cifar-10 dataset. The caption indicates the label and the ID of the instance in the dataset. Observe, for example, the small bottle (?) shown in the leftmost picture, which is labeled as a 'cloud'.

While the Cifar-10 dataset generally is very clean (we only found 7 mislabeled instances in 60.000 images), other datasets contain a much larger number of mislabeled samples. This is not just problematic for training, but also for evaluation, since it skews Accuracy and other metrics.

How can these mislabeled instances be found? Since the dataset is huge, manually inspecting all of the inidivual image/label pairs is not an option. Thus, we designed an algorithm to identify these mislabeled instances. You can find the technical description here, and the source code here.