Flamingo: a Visual Language Model for Few-Shot Learning


Summary: A video digest of the paper "Flamingo: a Visual Language Model for Few-Shot Learning" by J-B. Alayrac and co-authors, posted on arxiv in April 2022. This paper, which introduced the Flamingo family of models, can be found on arxiv here.
Topics: computer vision, few-shot learning, vision and language
Slides: link (pdf)

References
  • E. Markman, "Categorization and naming in children: Problems of induction", MIT Press (1989)
  • S. Hochreiter et al., "Long short-term memory", Neural computation (1997)
  • D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization", ICLR (2015)
  • J. Ba et al., "Layer normalization", arxiv (2016)
  • I. Loshchilov and F. Hutter, "Decoupled weight decay regularization", arxiv (2017)
  • A. Vaswani et al., "Attention is all you need", NeurIPS (2017)
  • J. Bradbury et al., "JAX: composable transformations of Python+ NumPy programs" (2018)
  • E. Strubell et al., "Energy and policy considerations for deep learning in NLP", ACL (2019)
  • T. L. Griffiths, et al., "Doing more with less: meta-reasoning and meta-learning in humans and machines", Current Opinion in Behavioral Sciences (2019)
  • A. Radford et al., "Language models are unsupervised multitask learners", (2019)
  • J. Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", NAACL-HLT (2019)
  • M. Mitchell et al., "Model cards for model reporting", ACM FAccT (2019)
  • M. Shoeybi et al., "Megatron-lm: Training multi-billion parameter language models using model parallelism", arxiv (2019)
  • C. Raffel, et al., "Exploring the limits of transfer learning with a unified text-to-text transformer", JMLR (2019)
  • T. Brown et al., "Language models are few-shot learners", NeurIPS (2020)
  • T. Hennigan et al., "Haiku: Sonnet for Jax" (2020)
  • N. Carion et al., "End-to-end object detection with transformers", ECCV (2020)
  • S. Rajbhandari et al., "Zero: Memory optimizations toward training trillion parameter models", SC (2020)
  • H. Touvron et al., "Fixing the train-test resolution discrepancy: FixEfficientNet", arxiv (2020)
  • J. Kaplan et al., "Scaling laws for neural language models", arxiv (2020)
  • A. Radford et al., "Learning transferable visual models from natural language supervision", ICML (2021)
  • C. Jia et al., "Scaling up visual and vision-language representation learning with noisy text supervision", ICML (2021)
  • J. Cho et al., "Unifying vision-and-language tasks via text generation", ICML (2021)
  • A. Jaegle et al., "Perceiver: General perception with iterative attention", ICML (2021)
  • R. Mokady et al., "Clipcap: Clip prefix for image captioning", arxiv (2021)
  • M. Tsimpoukelli et al., "Multimodal few-shot learning with frozen language models", NeurIPS (2021)
  • J. Rae, et al. "Scaling language models: Methods, analysis & insights from training gopher", arxiv (2021)
  • T. Gebru et al., "Datasheets for datasets", Communications of the ACM (2021)
  • E. Perez et al., "Red teaming language models with language models", arxiv (2022)
  • E. Perez et al., "True few-shot learning with language models", NeurIPS (2021)
  • J. Liu et al., "What Makes Good In-Context Examples for GPT-$3?", arxiv (2021)
  • Z. Yang et al., "An empirical study of gpt-3 for few-shot knowledge-based vqa", arxiv (2021)
  • Z. Zhao et al., "Calibrate before use: Improving few-shot performance of language models", ICML (2021)
  • A. Brock et al., "High-performance large-scale image recognition without normalization", ICML (2021)
  • Z. Wang et al., "Simvlm: Simple visual language model pretraining with weak supervision", arxiv (2021)
  • L. Reynolds and K. McDonell, "Prompt programming for large language models: Beyond the few-shot paradigm", CHI (2021)
  • X. Zhai et al., "Scaling vision transformers", arxiv (2021)
  • J. Hoffmann et al., "Training Compute-Optimal Large Language Models", arxiv (2022)
  • A. Aghajanyan et al., "CM3: A Causal Masked Multimodal Model of the Internet", arxiv (2022)
  • S. Min et al., "Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?", arxiv (2022)
  • J-B. Alayrac et al., "Flamingo: a Visual Language Model for Few-Shot Learning", arxiv (2022)
  • M. Wortsman et al., "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time", arxiv (2022)
  • S. Yan et al., "Multiview Transformers for Video Recognition", arxiv (2022)
  • H. Pham et al., "Combined Scaling for Open-Vocabulary Image Classification", arxiv (2022)
  • L Yuan et al., "Florence: A New Foundation Model for Computer Vision", arxiv (2021)
  • P. Wang et al., "Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework", arxiv (2022)
  • Z. Luo et al., "VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training", arxiv (2022)
  • T. Wang et al., "What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?", arxiv (2022)