self training with noisy student improves imagenet classification

https://arxiv.org/abs/1911.04252. Learn more. CLIP (Contrastive Language-Image Pre-training) builds on a large body of work on zero-shot transfer, natural language supervision, and multimodal learning.The idea of zero-data learning dates back over a decade [^reference-8] but until recently was mostly studied in computer vision as a way of generalizing to unseen object categories. Code for Noisy Student Training. While removing noise leads to a much lower training loss for labeled images, we observe that, for unlabeled images, removing noise leads to a smaller drop in training loss. In our implementation, labeled images and unlabeled images are concatenated together and we compute the average cross entropy loss. mCE (mean corruption error) is the weighted average of error rate on different corruptions, with AlexNets error rate as a baseline. For smaller models, we set the batch size of unlabeled images to be the same as the batch size of labeled images. Hence, whether soft pseudo labels or hard pseudo labels work better might need to be determined on a case-by-case basis. Code is available at https://github.com/google-research/noisystudent. This attack performs one gradient descent step on the input image[20] with the update on each pixel set to . The biggest gain is observed on ImageNet-A: our method achieves 3.5x higher accuracy on ImageNet-A, going from 16.6% of the previous state-of-the-art to 74.2% top-1 accuracy. Noisy Student Training is a semi-supervised training method which achieves 88.4% top-1 accuracy on ImageNet Noisy Student Training is a semi-supervised learning method which achieves 88.4% top-1 accuracy on ImageNet (SOTA) and surprising gains on robustness and adversarial benchmarks. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. As can be seen, our model with Noisy Student makes correct and consistent predictions as images undergone different perturbations while the model without Noisy Student flips predictions frequently. Noisy Student self-training is an effective way to leverage unlabelled datasets and improving accuracy by adding noise to the student model while training so it learns beyond the teacher's knowledge. We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. Since a teacher models confidence on an image can be a good indicator of whether it is an out-of-domain image, we consider the high-confidence images as in-domain images and the low-confidence images as out-of-domain images. Copyright and all rights therein are retained by authors or by other copyright holders. possible. Noisy Student (B7, L2) means to use EfficientNet-B7 as the student and use our best model with 87.4% accuracy as the teacher model. We use EfficientNet-B0 as both the teacher model and the student model and compare using Noisy Student with soft pseudo labels and hard pseudo labels. Chowdhury et al. We apply RandAugment to all EfficientNet baselines, leading to more competitive baselines. Next, with the EfficientNet-L0 as the teacher, we trained a student model EfficientNet-L1, a wider model than L0. ImageNet-A test set[25] consists of difficult images that cause significant drops in accuracy to state-of-the-art models. Finally, we iterate the process by putting back the student as a teacher to generate new pseudo labels and train a new student. The learning rate starts at 0.128 for labeled batch size 2048 and decays by 0.97 every 2.4 epochs if trained for 350 epochs or every 4.8 epochs if trained for 700 epochs. Qizhe Xie, Eduard Hovy, Minh-Thang Luong, Quoc V. Le. We evaluate our EfficientNet-L2 models with and without Noisy Student against an FGSM attack. We find that Noisy Student is better with an additional trick: data balancing. This model investigates a new method. Noisy Student Training achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. Finally, frameworks in semi-supervised learning also include graph-based methods [84, 73, 77, 33], methods that make use of latent variables as target variables [32, 42, 78] and methods based on low-density separation[21, 58, 15], which might provide complementary benefits to our method. Using self-training with Noisy Student, together with 300M unlabeled images, we improve EfficientNets[69] ImageNet top-1 accuracy to 87.4%. This is an important difference between our work and prior works on teacher-student framework whose main goal is model compression. When the student model is deliberately noised it is actually trained to be consistent to the more powerful teacher model that is not noised when it generates pseudo labels. Noisy Students performance improves with more unlabeled data. - : self-training_with_noisy_student_improves_imagenet_classification EfficientNet-L1 approximately doubles the training time of EfficientNet-L0. Not only our method improves standard ImageNet accuracy, it also improves classification robustness on much harder test sets by large margins: ImageNet-A[25] top-1 accuracy from 16.6% to 74.2%, ImageNet-C[24] mean corruption error (mCE) from 45.7 to 31.2 and ImageNet-P[24] mean flip rate (mFR) from 27.8 to 16.1. We use our best model Noisy Student with EfficientNet-L2 to teach student models with sizes ranging from EfficientNet-B0 to EfficientNet-B7. The main difference between our work and prior works is that we identify the importance of noise, and aggressively inject noise to make the student better. to noise the student. We present a simple self-training method that achieves 87.4 . "Self-training with Noisy Student improves ImageNet classification" pytorch implementation. [68, 24, 55, 22]. They did not show significant improvements in terms of robustness on ImageNet-A, C and P as we did. Their main goal is to find a small and fast model for deployment. We then train a student model which minimizes the combined cross entropy loss on both labeled images and unlabeled images. corruption error from 45.7 to 31.2, and reduces ImageNet-P mean flip rate from Models are available at https://github.com/tensorflow/tpu/tree/master/models/official/efficientnet. Aerial Images Change Detection, Multi-Task Self-Training for Learning General Representations, Self-Training Vision Language BERTs with a Unified Conditional Model, 1Cademy @ Causal News Corpus 2022: Leveraging Self-Training in Causality Figure 1(a) shows example images from ImageNet-A and the predictions of our models. Add a We will then show our results on ImageNet and compare them with state-of-the-art models. In particular, we set the survival probability in stochastic depth to 0.8 for the final layer and follow the linear decay rule for other layers. Noisy Student Training is based on the self-training framework and trained with 4-simple steps: Train a classifier on labeled data (teacher). This is probably because it is harder to overfit the large unlabeled dataset. Models are available at this https URL. [2] show that Self-Training is superior to Pre-training with ImageNet Supervised Learning on a few Computer . There was a problem preparing your codespace, please try again. For each class, we select at most 130K images that have the highest confidence. By showing the models only labeled images, we limit ourselves from making use of unlabeled images available in much larger quantities to improve accuracy and robustness of state-of-the-art models. The top-1 accuracy of prior methods are computed from their reported corruption error on each corruption. We used the version from [47], which filtered the validation set of ImageNet. Then, EfficientNet-L1 is scaled up from EfficientNet-L0 by increasing width. This paper presents a unique study of transfer learning with large convolutional networks trained to predict hashtags on billions of social media images and shows improvements on several image classification and object detection tasks, and reports the highest ImageNet-1k single-crop, top-1 accuracy to date. Scaling width and resolution by c leads to c2 times training time and scaling depth by c leads to c times training time. We vary the model size from EfficientNet-B0 to EfficientNet-B7[69] and use the same model as both the teacher and the student. Noisy Student Training is a semi-supervised learning method which achieves 88.4% top-1 accuracy on ImageNet (SOTA) and surprising gains on robustness and adversarial benchmarks. Noisy StudentImageNetEfficientNet-L2state-of-the-art. [50] used knowledge distillation on unlabeled data to teach a small student model for speech recognition. on ImageNet ReaL During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. On robustness test sets, it improves ImageNet-A top . on ImageNet ReaL. Using Noisy Student (EfficientNet-L2) as the teacher leads to another 0.8% improvement on top of the improved results. We then perform data filtering and balancing on this corpus. As shown in Figure 3, Noisy Student leads to approximately 10% improvement in accuracy even though the model is not optimized for adversarial robustness. We iterate this process by putting back the student as the teacher. Lastly, we follow the idea of compound scaling[69] and scale all dimensions to obtain EfficientNet-L2. These works constrain model predictions to be invariant to noise injected to the input, hidden states or model parameters. Algorithm1 gives an overview of self-training with Noisy Student (or Noisy Student in short). Hence the total number of images that we use for training a student model is 130M (with some duplicated images). Please E. Arazo, D. Ortego, P. Albert, N. E. OConnor, and K. McGuinness, Pseudo-labeling and confirmation bias in deep semi-supervised learning, B. Athiwaratkun, M. Finzi, P. Izmailov, and A. G. Wilson, There are many consistent explanations of unlabeled data: why you should average, International Conference on Learning Representations, Advances in Neural Information Processing Systems, D. Berthelot, N. Carlini, I. Goodfellow, N. Papernot, A. Oliver, and C. Raffel, MixMatch: a holistic approach to semi-supervised learning, Combining labeled and unlabeled data with co-training, C. Bucilu, R. Caruana, and A. Niculescu-Mizil, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, Y. Carmon, A. Raghunathan, L. Schmidt, P. Liang, and J. C. Duchi, Unlabeled data improves adversarial robustness, Semi-supervised learning (chapelle, o. et al., eds. Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le. Finally, for classes that have less than 130K images, we duplicate some images at random so that each class can have 130K images. Use Git or checkout with SVN using the web URL. To achieve this result, we first train an EfficientNet model on labeled ImageNet images and use it as a teacher to generate pseudo labels on 300M unlabeled images. Self-training with Noisy Student improves ImageNet classificationCVPR2020, Codehttps://github.com/google-research/noisystudent, Self-training, 1, 2Self-training, Self-trainingGoogleNoisy Student, Noisy Studentstudent modeldropout, stochastic depth andaugmentationteacher modelNoisy Noisy Student, Noisy Student, 1, JFT3ImageNetEfficientNet-B00.3130K130K, EfficientNetbaseline modelsEfficientNetresnet, EfficientNet-B7EfficientNet-L0L1L2, batchsize = 2048 51210242048EfficientNet-B4EfficientNet-L0l1L2350epoch700epoch, 2EfficientNet-B7EfficientNet-L0, 3EfficientNet-L0EfficientNet-L1L0, 4EfficientNet-L1EfficientNet-L2, student modelNoisy, noisystudent modelteacher modelNoisy, Noisy, Self-trainingaugmentationdropoutstochastic depth, Our largest model, EfficientNet-L2, needs to be trained for 3.5 days on a Cloud TPU v3 Pod, which has 2048 cores., 12/self-training-with-noisy-student-f33640edbab2, EfficientNet-L0EfficientNet-B7B7, EfficientNet-L1EfficientNet-L0, EfficientNetsEfficientNet-L1EfficientNet-L2EfficientNet-L2EfficientNet-B75. We iterate this process by putting back the student as the teacher. We start with the 130M unlabeled images and gradually reduce the number of images. The abundance of data on the internet is vast. However, manually annotating organs from CT scans is time . Self-training with Noisy Student improves ImageNet classication Qizhe Xie 1, Minh-Thang Luong , Eduard Hovy2, Quoc V. Le1 1Google Research, Brain Team, 2Carnegie Mellon University fqizhex, thangluong, qvlg@google.com, hovy@cmu.edu Abstract We present Noisy Student Training, a semi-supervised learning approach that works well even when . Prior works on weakly-supervised learning require billions of weakly labeled data to improve state-of-the-art ImageNet models. We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images. Self-training first uses labeled data to train a good teacher model, then use the teacher model to label unlabeled data and finally use the labeled data and unlabeled data to jointly train a student model. Please The main difference between our work and these works is that they directly optimize adversarial robustness on unlabeled data, whereas we show that self-training with Noisy Student improves robustness greatly even without directly optimizing robustness. We iterate this process by Our experiments showed that self-training with Noisy Student and EfficientNet can achieve an accuracy of 87.4% which is 1.9% higher than without Noisy Student. The ADS is operated by the Smithsonian Astrophysical Observatory under NASA Cooperative Self-Training With Noisy Student Improves ImageNet Classification @article{Xie2019SelfTrainingWN, title={Self-Training With Noisy Student Improves ImageNet Classification}, author={Qizhe Xie and Eduard H. Hovy and Minh-Thang Luong and Quoc V. Le}, journal={2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2019 . We do not tune these hyperparameters extensively since our method is highly robust to them. If nothing happens, download GitHub Desktop and try again. A self-training method that better adapt to the popular two stage training pattern for multi-label text classification under a semi-supervised scenario by continuously finetuning the semantic space toward increasing high-confidence predictions, intending to further promote the performance on target tasks. These CVPR 2020 papers are the Open Access versions, provided by the. During the learning of the student, we inject noise such as dropout, stochastic depth, and data augmentation via RandAugment to the student so that the student generalizes better than the teacher. This is why "Self-training with Noisy Student improves ImageNet classification" written by Qizhe Xie et al makes me very happy. Classification of Socio-Political Event Data, SLADE: A Self-Training Framework For Distance Metric Learning, Self-Training with Differentiable Teacher, https://github.com/hendrycks/natural-adv-examples/blob/master/eval.py. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. During the generation of the pseudo Specifically, we train the student model for 350 epochs for models larger than EfficientNet-B4, including EfficientNet-L0, L1 and L2 and train the student model for 700 epochs for smaller models. On ImageNet, we first train an EfficientNet model on labeled images and use it as a teacher to generate pseudo labels for 300M unlabeled images. This accuracy is 1.0% better than the previous state-of-the-art ImageNet accuracy which requires 3.5B weakly labeled Instagram images. (or is it just me), Smithsonian Privacy See As can be seen from the figure, our model with Noisy Student makes correct predictions for images under severe corruptions and perturbations such as snow, motion blur and fog, while the model without Noisy Student suffers greatly under these conditions. A. Krizhevsky, I. Sutskever, and G. E. Hinton, Temporal ensembling for semi-supervised learning, Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks, Workshop on Challenges in Representation Learning, ICML, Certainty-driven consistency loss for semi-supervised learning, C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy, R. G. Lopes, D. Yin, B. Poole, J. Gilmer, and E. D. Cubuk, Improving robustness without sacrificing accuracy with patch gaussian augmentation, Y. Luo, J. Zhu, M. Li, Y. Ren, and B. Zhang, Smooth neighbors on teacher graphs for semi-supervised learning, L. Maale, C. K. Snderby, S. K. Snderby, and O. Winther, A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, Towards deep learning models resistant to adversarial attacks, D. Mahajan, R. Girshick, V. Ramanathan, K. He, M. Paluri, Y. Li, A. Bharambe, and L. van der Maaten, Exploring the limits of weakly supervised pretraining, T. Miyato, S. Maeda, S. Ishii, and M. Koyama, Virtual adversarial training: a regularization method for supervised and semi-supervised learning, IEEE transactions on pattern analysis and machine intelligence, A. Najafi, S. Maeda, M. Koyama, and T. Miyato, Robustness to adversarial perturbations in learning from incomplete data, J. Ngiam, D. Peng, V. Vasudevan, S. Kornblith, Q. V. Le, and R. Pang, Robustness properties of facebooks resnext wsl models, Adversarial dropout for supervised and semi-supervised learning, Lessons from building acoustic models with a million hours of speech, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille, Deep co-training for semi-supervised image recognition, I. Radosavovic, P. Dollr, R. Girshick, G. Gkioxari, and K. He, Data distillation: towards omni-supervised learning, A. Rasmus, M. Berglund, M. Honkala, H. Valpola, and T. Raiko, Semi-supervised learning with ladder networks, E. Real, A. Aggarwal, Y. Huang, and Q. V. Le, Proceedings of the AAAI Conference on Artificial Intelligence, B. Recht, R. Roelofs, L. Schmidt, and V. Shankar. During this process, we kept increasing the size of the student model to improve the performance. However, in the case with 130M unlabeled images, with noise function removed, the performance is still improved to 84.3% from 84.0% when compared to the supervised baseline. The proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference and achieving better accuracy than standard distillation. C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Authors: Qizhe Xie, Minh-Thang Luong, Eduard Hovy, Quoc V. Le Description: We present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet, which is 2.0% better than the state-of-the-art model that requires 3.5B weakly labeled Instagram images.