Ken Chatfield


21.06.2023

4 min read

Share on:

Efficient Learning of Domain-specific Visual Cues with Self-supervision

Introducing Perceptual MAE, a new method for efficiently learning domain-specific visual cues using self-supervision. This work is part of our AI 2.0 initiative and is being presented at CVPR 2023.

As part of our work to build models which generalise better over different types of damage to vehicles and property, we have developed a new method which can learn important domain-specific visual cues (such as ‘cracks’ or ‘dents’) directly from images.

The method we developed:

  • Achieves state-of-the-art performance (88.6% on ImageNet, ranking #3 globally) whilst being much more data and compute efficient than alternative methods

  • Provides a means to learn automatically from unlabelled data a set of domain-specific visual cues, without the need for web-scale data or compute

  • Generalises across tasks and datasets, making it applicable to a wide variety of applications

We are open-sourcing our approach such that others can build on our approach.

Self-supervision and Generative Learning

To do this, we build on recent advances in self-supervised learning. This approach powers the latest advances in large-language models such as ChatGPT and GPT-4, and over the past year has also been applied to computer vision with methods such as masked autoencoders.

One of the issues with applying such an approach to computer vision is the grounding problem: how do we know what is the important information in an image? Perceptual MAE addresses this by taking an opinionated approach: what is important is not differences on the pixel-level, but higher-level visual cues which can be used to determine its contents.

We incorporate this into learning using techniques from the generative learning toolbox, and the result can be seen by visualising what Perceptual MAE attends to in a given image compared to previous work, with a much stronger focus on object-level details:


Addressing the Grounding Problem

To incorporate the learning of higher-level visual cues into the training process, we draw on the features learnt by a second network to define whether an image is similar or not. This second network could be e.g. a strong pre-trained ImageNet classifier which has been trained to contain relevant scene-level cues.

We can then compare the feature activations across the layers of this second network on the basis that if the activations are similar then the contents of the image is also similar. For example, in the following image even though the pose of the cat differs causing pixel-level information to be quite different, the feature activations remain more stable given both images are of the same subject. This is called perceptual loss via feature matching:


In our case, we apply this calculation to the activations not from a pre-trained ImageNet model, but to those from a network trained in parallel to assess whether the images generated by our model are real or fake, making the method entirely self-contained.

Efficient Method for State-of-the-art Computer Vision Models

The result is a method for pre-training that outperforms all previous methods, setting a new state-of-the-art of 88.1% over ImageNet when not using additional training data.

If we loosen this restriction and use a pre-trained model for feature matching, we can match the recently released DINOv2 method attaining 88.6% accuracy with a much smaller model (only 307M parameters) and without requiring web-scale data or compute:

We found that these results also generalised across different visual tasks (results shown for ViT-B architecture):

And also translated to real-world performance across different domains, with Perceptual MAE-based features offering improved performance over a purely supervised baseline when fine-tuning for the Tractable task of vehicle damage assessment across a dataset comprising 500k annotated images:

What’s Next?⁠

This work opens up further the possibility of training bespoke and generalisable domain-specific features from images-only across a range of different domains and tasks whilst maintaining high data efficiency and without requiring web-scale data or compute.

It is also a step towards addressing the robustness and bias issues typically associated with the long-tail when using supervised methods. As part of the Tractable AI 2.0 initiative, we are working to ensure that the computer vision models we train rely directly on expert-defined cues such as ‘cracks’ and ‘scratches’ rather than other superfluous correlations. This work forms one part of this broader initiative.

-> Read the paper

-> Get the code

Discover more related content