Context
For the second Lixo Journal Club article, we will discuss the Deep Bayesian Active Learning with Image Data paper by Gal and al. which was presented in 2017 for theInternational Conference on Machine Learning.
At Lixo, and for many applications that use machine learning and neural networks, a direct way to improve model performance is to annotate more data and add it to the training dataset. However, annotating data is costly, time-consuming and requires business expertise. Active Learning for image-related applications consists of using a large pool of unannotated images to find the most valuable images, which, if annotated, will improve model performance the most.
Why did you choose this paper?
This paper is directly linked to many industrial use cases: how to improve an existing model while limiting annotation costs and time. While Active Learning has been a topic of research for many years, this paper lays the foundations for Active Learning in the case of neural networks applied to image data.
Why is it innovative?
The authors propose a new method for quantifying the uncertainty of a neural network using Drop Out, a technique commonly used for regularization. Drop Out consists in randomly closing neurons during training to avoid overfitting.
The idea of Gal et al is to use Drop Out to have multiple instances of the same network. During inference, we'll randomly close some neurons for each instance of our network, with different seeds. So we'll have several networks coming from the same distribution, without them being the same network. And once we've instantiated these different networks, the idea is to compare their outputs. If, for a given image, the networks disagree (for example, for a classification problem, one network predicts a cat, another a dog and another a car), then we can consider that the model is uncertain about the image and that the image contains interesting information for training our model, so we should add it to our training dataset. This method is called Bayesian Active Learning by Disagreement (BALD). The authors also propose several other criteria, continuing to use Drop Out to retrieve instances of the same network distribution.
The paper focuses the benchmark on image data, with the MNSIT and ISIC 2016 Melanoma dataset. It's interesting to have this benchmark on image datasets because active learning methods can be sensitive to data distribution and model architecture.
The benchmark procedure consists of training a model on randomly chosen images, then using BALD (or another method) to determine the next 10 images to choose from the image pool, re-training a model with these images and iterating. Their method shows better performance with fewer images than with random image selection.
What are the limits?
Simple classification task
Their benchmark focuses on simple classification tasks with simple data. It would have been interesting to have more complicated data as well as more complex tasks such as object detection, image segmentation or pose estimation. What's more, the authors use a VGG16 architecture with a few dropout layers at the end of the network, and it would have been interesting to see whether the results remained the same with other architectures, such as ResNet.
Only architecture
The authors use a simple VGG16 with Drop Out layers at the end of the network. It would have been interesting to see whether the results remained the same with other architectures.
Latency
This method requires several inference passes to estimate model uncertainty. If we use 5 instances of our network with different Drop Out seeds, we multiply the latency for one image by a factor of 5. At Lixo, for example, we can't directly use this method on our embedded machines, as they operate in real time.
Does not consider batch images
Another potential problem with this framework is that we evaluate the relevance of images one by one without any notion of inter-image diversity. In the case of an image pool with several very similar images, this method may return almost identical images on the same iteration. More recent approaches take this limitation into account to avoid selecting similar images, e.g. Kirsch et al.
Batch size too small
The article uses 10 images per active learning batch. At Lixo, we use active learning to better select the images we would like to annotate. A small batch size is optimal for the algorithm, but in practice, it's very difficult and costly to train with image annotation interspersed. The cost of training with a small batch is all the more important if we consider architectures more complex than VGG16.
Conclusion
At Lixo, we are continuously improving our models while limiting the cost and time spent on annotation. Active Learning is an excellent framework for tackling this problem, and this paper lays the foundations for active learning in the context of neural networks applied to images. The paper focuses on a simple classification task, but at Lixo we have confirmed that we can use the paper's ideas in more specific and comprehensive use cases (object detection for the waste industry) and limit the number of images to be annotated to improve the model.
More generally, the key to improving model performance in an industrial context is often to be "data-centric" (working on the data rather than on the model itself). Semi-supervised learning is another approach that complements active learning for using large quantities of unannotated data. This may be the subject of a future blogpost.
Find out more:
- Deep Bayesian Active Learning with Image Data, Yarin Gal, Riashat Islam, Zoubin Ghahramani (2017)
- Dropout: A Simple Way to Prevent Neural Networks from Overfitting, Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, Ruslan Salakhutdinov (2014)
- BatchBALD: Efficient and Diverse Batch Acquisition for Deep Bayesian Active Learning, Andreas Kirsch, Joost van Amersfoort, Yarin Gal (2019).