Flamingo: a Visual Language Model for Few-Shot Learning

Summary: A video digest of the paper "Flamingo: a Visual Language Model for Few-Shot Learning" by J-B. Alayrac and co-authors, posted on arxiv in April 2022. This paper, which introduced the Flamingo family of models, can be found on arxiv here.
Topics: computer vision, few-shot learning, vision and language
Slides: link (pdf)

References

E. Markman, "Categorization and naming in children: Problems of induction", MIT Press (1989)
S. Hochreiter et al., "Long short-term memory", Neural computation (1997)
D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization", ICLR (2015)
J. Ba et al., "Layer normalization", arxiv (2016)
I. Loshchilov and F. Hutter, "Decoupled weight decay regularization", arxiv (2017)
A. Vaswani et al., "Attention is all you need", NeurIPS (2017)
J. Bradbury et al., "JAX: composable transformations of Python+ NumPy programs" (2018)
E. Strubell et al., "Energy and policy considerations for deep learning in NLP", ACL (2019)
T. L. Griffiths, et al., "Doing more with less: meta-reasoning and meta-learning in humans and machines", Current Opinion in Behavioral Sciences (2019)
A. Radford et al., "Language models are unsupervised multitask learners", (2019)
J. Devlin et al. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", NAACL-HLT (2019)
M. Mitchell et al., "Model cards for model reporting", ACM FAccT (2019)
M. Shoeybi et al., "Megatron-lm: Training multi-billion parameter language models using model parallelism", arxiv (2019)
C. Raffel, et al., "Exploring the limits of transfer learning with a unified text-to-text transformer", JMLR (2019)
T. Brown et al., "Language models are few-shot learners", NeurIPS (2020)
T. Hennigan et al., "Haiku: Sonnet for Jax" (2020)
N. Carion et al., "End-to-end object detection with transformers", ECCV (2020)
S. Rajbhandari et al., "Zero: Memory optimizations toward training trillion parameter models", SC (2020)
H. Touvron et al., "Fixing the train-test resolution discrepancy: FixEfficientNet", arxiv (2020)
J. Kaplan et al., "Scaling laws for neural language models", arxiv (2020)
A. Radford et al., "Learning transferable visual models from natural language supervision", ICML (2021)
C. Jia et al., "Scaling up visual and vision-language representation learning with noisy text supervision", ICML (2021)
J. Cho et al., "Unifying vision-and-language tasks via text generation", ICML (2021)
A. Jaegle et al., "Perceiver: General perception with iterative attention", ICML (2021)
R. Mokady et al., "Clipcap: Clip prefix for image captioning", arxiv (2021)
M. Tsimpoukelli et al., "Multimodal few-shot learning with frozen language models", NeurIPS (2021)
J. Rae, et al. "Scaling language models: Methods, analysis & insights from training gopher", arxiv (2021)
T. Gebru et al., "Datasheets for datasets", Communications of the ACM (2021)
E. Perez et al., "Red teaming language models with language models", arxiv (2022)
E. Perez et al., "True few-shot learning with language models", NeurIPS (2021)
J. Liu et al., "What Makes Good In-Context Examples for GPT-$3?", arxiv (2021)
Z. Yang et al., "An empirical study of gpt-3 for few-shot knowledge-based vqa", arxiv (2021)
Z. Zhao et al., "Calibrate before use: Improving few-shot performance of language models", ICML (2021)
A. Brock et al., "High-performance large-scale image recognition without normalization", ICML (2021)
Z. Wang et al., "Simvlm: Simple visual language model pretraining with weak supervision", arxiv (2021)
L. Reynolds and K. McDonell, "Prompt programming for large language models: Beyond the few-shot paradigm", CHI (2021)
X. Zhai et al., "Scaling vision transformers", arxiv (2021)
J. Hoffmann et al., "Training Compute-Optimal Large Language Models", arxiv (2022)
A. Aghajanyan et al., "CM3: A Causal Masked Multimodal Model of the Internet", arxiv (2022)
S. Min et al., "Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?", arxiv (2022)
J-B. Alayrac et al., "Flamingo: a Visual Language Model for Few-Shot Learning", arxiv (2022)
M. Wortsman et al., "Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time", arxiv (2022)
S. Yan et al., "Multiview Transformers for Video Recognition", arxiv (2022)
H. Pham et al., "Combined Scaling for Open-Vocabulary Image Classification", arxiv (2022)
L Yuan et al., "Florence: A New Foundation Model for Computer Vision", arxiv (2021)
P. Wang et al., "Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework", arxiv (2022)
Z. Luo et al., "VC-GPT: Visual Conditioned GPT for End-to-End Generative Vision-and-Language Pre-training", arxiv (2022)
T. Wang et al., "What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?", arxiv (2022)

Samuel Albanie