Are Vision Transformers Always More Robust Than Convolutional Neural Networks? Permalink
Published in DistShift, NeurIPS 2021 Workshop, 2021
Since Transformer architectures have been popularised in Computer Vision, several papers started analysing their properties in terms of calibration, out-of-distribution detection and data-shift robustness. Most of these papers conclude that Transformers, due to some intrinsic properties (presumably the lack of restrictive inductive biases and the computationally intensive self-attention mechanism), outperform Convolutional Neural Networks (CNNs). In this paper we question this conclusion: we show that CNNs pre-trained on large amounts of data are expressive enough to produce superior robustness to current transformers performance. Also, in some relevant cases, CNNs, with a pre-training and fine-tuning procedure similar to the one used for transformers, exhibit competitive robustness. To fully understand this behaviour, our evidence suggests that researchers should focus on the interaction between pre-training, fine-tuning and the specific inductive biases of the considered architectures. For this reason, we present some preliminary analyses that shed some light on the impact of pre-training and fine-tuning on out-of-distribution detection and data-shift.
Recommended citation: Francesco Pinto, Philip H.S. Torr, Puneet K. Dokania (2021). "Are Vision Transformers Always More Robust Than Convolutional Neural Networks?." NeurIPS 2021: DistShift, 2021 https://openreview.net/pdf?id=CSXa8LJMttt