DINO: Self-distillation with no labels


Summary: A video digest of the DINO framework, introduced in the work "Emerging properties in self-supervised vision transformers" by M. Caron, H. Touvron, I. Misra, H. Jégou, J. Mairal, P. Bojanowski and A. Joulin, published at ICCV in 2021. This paper can be found on arxiv here. Code and models can be found here.
Topics: computer vision, self-supervised learning, vision transformers
Slides: link (pdf)

References
  • D. Ruppert, "Efficient estimations from a slowly convergent Robbins-Monro process" (1988)
  • B. T. Polyak et al., "Acceleration of stochastic approximation by averaging." SICON (1992)
  • J. Philbin et al., "Lost in quantization: Improving particular object retrieval in large scale image databases", CVPR (2008)
  • L. Van der Maaten et al., "Visualizing data using t-SNE", JMLR (2008)
  • M-E. Nilsback et al., "Automated flower classification over a large number of classes" ICVGIP (2008)
  • A. Krizhevsky, "Learning multiple layers of features from tiny images", (2009)
  • M. Douze et al., "Evaluation of gist descriptors for web-scale image search", CIVR (2009)
  • M. Everingham et al., "The pascal visual object classes (voc) challenge", IJCV (2010)
  • J. Krause et al., "3d object representations for fine-grained categorization", ICCVW (2013)
  • M. Cuturi, "Sinkhorn distances: Lightspeed computation of optimal transport", NeurIPS (2013)
  • A. Dosovitskiy et al., "Discriminative unsupervised feature learning with convolutional neural networks", NeurIPS (2014)
  • B. Zhou et al., "Learning deep features for scene recognition using places database", NeurIPS (2014)
  • S. Ioffe et al., "Batch normalization: Accelerating deep network training by reducing internal covariate shift", ICML (2015)
  • O. Russakovsky et al., "Imagenet large scale visual recognition challenge", IJCV (2015)
  • K. He et al., "Deep residual learning for image recognition", CVPR (2016)
  • T. Salimans et al., "Weight normalization: A simple reparameterization to accelerate training of deep neural networks", NeurIPS (2016)
  • I. Loshchilov et al., "Sgdr: Stochastic gradient descent with warm restarts", arxiv (2016)
  • A. Tarvainen et al., "Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results", NeurIPS (2017)
  • A. Vaswani et al., "Attention is all you need", NeurIPS (2017)
  • I. Loshchilov et al., "Decoupled weight decay regularization", arxiv (2017)
  • P. Goyal et al., "Accurate, large minibatch sgd: Training imagenet in 1 hour", arxiv (2017)
  • J. Pont-Tuset et al., "The 2017 DAVIS challenge on video object segmentation", arxiv (2017)
  • R. Anil et al., "Large scale distributed neural network training through online distillation", arxiv (2018)
  • M. Caron et al., "Deep clustering for unsupervised learning of visual features", (ECCV) 2018
  • Z. Wu et al., "Unsupervised feature learning via non-parametric instance discrimination", CVPR (2018)
  • Y. Wu et al., "Group normalization", ECCV (2018)
  • F. Radenović et al., "Revisiting oxford and paris: Large-scale image retrieval benchmarking", CVPR (2018)
  • G. Horn et al., "The inaturalist challenge 2018 dataset". arxiv (2018)
  • F. Radenović, et al. "Fine-tuning CNN image retrieval with no human annotation", TPAMI (2018)
  • J. Devlin et al., "Bert: Pre-training of deep bidirectional transformers for language understanding", NAACL-HLT (2019)
  • A. Radford et al., "Language models are unsupervised multitask learners" (2019)
  • J. Revaud et al., "Learning with average precision: Training image retrieval with a listwise loss", ICCV (2019)
  • M. Berman et al., "Multigrain: a unified image embedding for classes and instances", arxiv (2019)
  • S. W. Oh et al., "Video object segmentation using space-time memory networks", ICCV (2019)
  • X. Wang et al., "Learning correspondence from the cycle-consistency of time," CVPR (2019)
  • J. Mairal, "Cyanure: An open-source toolbox for empirical risk minimization for python, c++, and soon more", arxiv (2019)
  • T. Chen et al. "A simple framework for contrastive learning of visual representations" ICML (2020)
  • X. Chen et al., "Improved baselines with momentum contrastive learning", arxiv (2020)
  • M. Caron et al., "Unsupervised learning of visual features by contrasting cluster assignments", NeurIPS (2020)
  • J-B. Grill et al., "Bootstrap your own latent-a new approach to self-supervised learning", NeurIPS (2020)
  • K. He et al., "Momentum contrast for unsupervised visual representation learning", CVPR (2020)
  • P. Richemond et al., "BYOL works even without batch statistics", arxiv (2020)
  • Q. Xie et al., "Self-training with noisy student improves imagenet classification", CVPR (2020)
  • Y. Tian et al. "What makes for good views for contrastive learning?" NeurIPS (2020)
  • T. Weyand et al., "Google landmarks dataset v2-a large-scale benchmark for instance-level recognition and retrieval", CVPR (2020)
  • Z. Lai et al., "MAST: A memory-augmented self-supervised tracker", CVPR (2020)
  • A. Jabri, "Space-time correspondence as a contrastive random walk", NeurIPS (2020)
  • M. Assran et al., "Recovering petaflops in contrastive semi-supervised learning of visual representations", arxiv (2020)
  • K. Sohn et al., "Fixmatch: Simplifying semi-supervised learning with consistency and confidence", NeurIPS (2020)
  • Q. Xie, et al., "Unsupervised data augmentation for consistency training", NeurIPS (2020)
  • T. Chen et al., "Big self-supervised models are strong semi-supervised learners", NeurIPS (2020)
  • I. Radosavovic et al., "Designing network design spaces", CVPR (2020)
  • S. Gidaris et al., "Obow: Online bag-of-visual-words generation for self-supervised learning", CVPR (2021)
  • A. Dosovitskiy, et al. "An image is worth 16x16 words: Transformers for image recognition at scale", ICLR (2021)
  • H. Touvron et al., "Training data-efficient image transformers & distillation through attention", ICML (2021)
  • M. Caron et al., "Emerging properties in self-supervised vision transformers", ICCV (2021)
  • J. Zbontar et al., "Barlow twins: Self-supervised learning via redundancy reduction", ICML (2021)
  • S. Gur et al., "Visualization of supervised and self-supervised neural networks via attribution guided factorization", AAAI (2021)
  • H. Pham et al., "Meta pseudo labels", CVPR (2021)
  • P. Goyal et al., "Vision models are more robust and fair when pretrained on uncurated images without supervision", arxiv (2022)