Contrastive Language-Image Pre-training (CLIP)


Summary: A video digest of the paper "Learning transferable visual models from natural language supervision" by A. Radford et al. published at ICML 2021, which introduced the CLIP family of models. The paper can be found on arxiv here.
Topics: computer vision, zero-shot learning, vision and language
Slides: link (pdf)

References
  • G. A. Miller, "WordNet: a lexical database for English", Communications of the ACM (1995)
  • Y. LeCun et al., "Gradient-based learning applied to document recognition", Proceedings of the IEEE (1998)
  • Y. Mori et al., "Image-to-word transformation based on dividing and vector quantizing images with words", MISRM (1999)
  • G. Bowker and S. L Star, "Sorting things out - Classification and its consequences", (1999)
  • A. Griewank and A. Walther, "Algorithm 799: revolve: an implementation of checkpointing for the reverse or adjoint mode of computational differentiation", TOMS (2000)
  • R. Fergus et al., "Learning object categories from google's image search", ICCV (2005)
  • A. Torralba et al., "80 million tiny images: A large data set for nonparametric object and scene recognition", TPAMI (2008)
  • J. Hays and A. Efros, "IM2GPS: estimating geographic information from a single image", CVPR (2008)
  • J. Varadarajan and J-M. Odobez, "Topic models for scene analysis and abnormality detection", ICCVW (2009)
  • C. H. Lampert et al., "Learning to detect unseen object classes by between-class attribute transfer", CVPR (2009)
  • A. Krizhevsky and G. Hinton, "Learning multiple layers of features from tiny images", (2009)
  • A. Farhadi et al., "Describing objects by their attributes", CVPR (2009)
  • J. Deng et al., "Imagenet: A large-scale hierarchical image database", CVPR (2009)
  • J. Xiao et al., "Sun database: Large-scale scene recognition from abbey to zoo", CVPR (2010)
  • Y. Netzer et al., "Reading Digits in Natural Images with Unsupervised Feature Learning", (2011)
  • J. Stallkamp et al., "The German traffic sign recognition benchmark: a multi-class classification competition", IJCNN (2011)
  • S. Oh et al., "A large-scale benchmark dataset for event recognition in surveillance video", CVPR, (2011)
  • Y. Netzer et al., "Reading Digits in Natural Images with Unsupervised Feature Learning", (2011)
  • A. Mishra et al., "Scene text recognition using higher order language priors", BMVC (2012)
  • O. Parkhi et al., "Cats and dogs", CVPR (2012)
  • K. Soomro et al., "UCF101: A dataset of 101 human actions classes from videos in the wild", arXiv (2012)
  • R. Socher et al., "Recursive deep models for semantic compositionality over a sentiment treebank", EMNLP (2013)
  • T. Lin et al., "Microsoft coco: Common objects in context", ECCV (2014)
  • A. Karpathy, (human estimate of human performance on ImageNet) https://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/ (2014)
  • P. Young et al., "From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions", ACL (2014)
  • D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization", ICLR (2015)
  • Z. Liu et al., "Deep learning face attributes in the wild", ICCV (2015)
  • K. He et al., "Delving deep into rectifiers: Surpassing human-level performance on imagenet classification", CVPR (2015)
  • X. Chen et al. "Microsoft coco captions: Data collection and evaluation server", arXiv (2015)
  • B. Thomee et al., "YFCC100M: The new data in multimedia research", Communications of the ACM (2016)
  • K. He et al., "Deep residual learning for image recognition", CVPR (2016)
  • A. Vaswani et al., "Attention is all you need", NeurIPS (2017)
  • R. Krishna et al., "Visual genome: Connecting language and vision using crowdsourced dense image annotations", IJCV (2017)
  • I. Loshchilov and F. Hutter, "Decoupled weight decay regularization", arXiv (2017)
  • A. Li, A. Jabri, A. Joulin, and L. Van Der Maaten, "Learning visual n-grams from web data", ICCV (2017)
  • D. Ha, A. Dai, Q. V. Le, "Hypernetworks", ICLR (2017)
  • J. Johnson et al., "Clevr: A diagnostic dataset for compositional language and elementary visual reasoning", ICCV (2017)
  • S. Dodge et al., "A study and comparison of human and deep learning recognition performance under visual distortions", ICCCN (2017)
  • P. J. Liu et al., "Generating wikipedia by summarizing long sequences", ICLR (2018)
  • Z. Wu et al., "Unsupervised feature learning via non-parametric instance discrimination", CVPR (2018)
  • R. Geirhos et al., "ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness", arXiv (2018)
  • P. Micikevicius et al., "Mixed precision training", ICLR (2018)
  • D. Mahajan et al., "Exploring the limits of weakly supervised pretraining", ECCV (2018)
  • C. Raffel, et al., "Exploring the limits of transfer learning with a unified text-to-text transformer", JMLR (2019)
  • M. Tan and Q. Le, "EfficientNet: Rethinking model scaling for convolutional neural networks", ICML (2019)
  • T. He et al., "Bag of tricks for image classification with convolutional neural networks", CVPR (2019)
  • X. Zhai et al., "The visual task adaptation benchmark", openreview.net, (2019)
  • S. Kornblith, J. Shlens and Q. V. Le, "Do better imagenet models transfer better?", CVPR (2019)
  • R. Zhang, "Making convolutional networks shift-invariant again", ICML (2019)
  • J. Lu et al., "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks", NeurIPS (2019)
  • B. Recht et al., "Do imagenet classifiers generalize to imagenet?", ICML (2019)
  • A. Barbu et al., "Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models", NeurIPS (2019)
  • D. Yogatama et al., "Learning and Evaluating General Linguistic Intelligence", (2019)
  • J. Carreira et al., "A short note on the kinetics-700 human action dataset", arXiv (2019)
  • A. Miech et al., "Howto100m: Learning a text-video embedding by watching hundred million narrated video clips", ICCV (2019)
  • Y. Zhang, H. Jiang, Y. Miura, C. Manning and C. P. Langlotz, "Contrastive learning of medical visual representations from paired images and text", arXiv (2020)
  • Q. Xie, et al., "Self-training with noisy student improves imagenet classification", CVPR (2020)
  • R. Taori et al., "Measuring robustness to natural distribution shifts in image classification", NeurIPS (2020)
  • H. Touvron et al., "Fixing the train-test resolution discrepancy: FixEfficientNet", arxiv (2020)
  • T. Brown et al., "Language models are few-shot learners", NeurIPS (2020)
  • A. Miech et al., "RareAct: A video dataset of unusual interactions", arXiv (2020)
  • H. Touvron et al., "Fixing the train-test resolution discrepancy: FixEfficientNet", arxiv (2020)
  • Q. Xie et al., "Self-training with noisy student improves imagenet classification", CVPR (2020)
  • J. Kaplan et al., "Scaling laws for neural language models", arXiv (2020)
  • D. Kiela et al., "The hateful memes challenge: Detecting hate speech in multimodal memes", NeurIPS (2020)
  • K. Desai and J. Johnson, "Virtex: Learning visual representations from textual annotations", CVPR (2021)
  • A. Radford et al., "Learning transferable visual models from natural language supervision", ICML (2021)
  • A. Dosovitskiy, et al. "An image is worth 16x16 words: Transformers for image recognition at scale", ICLR (2021)
  • D. Hendrycks et al., "The many faces of robustness: A critical analysis of out-of-distribution generalization", ICCV (2021)
  • K. Karkkainen and J. Joo, "Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation", WACV (2021)
  • D. Hendrycks et al., "Natural adversarial examples", CVPR (2021)