BLOOM: A 176B-Parameter Open-Access Multilingual Language Model



Summary: A description of the the work 'BLOOM: A 176B-Parameter Open-Access Multilingual Language Model' by Le Scao et al. published on arxiv in November 2022 as part of the BigScience Workshop. This work provides an overview of the BLOOM model and the efforts involved in its creation.
Paper: arxiv link
Topics: foundation models, large language models, multilingual models
Slides: link (pdf)
Code and models: link (Hugging Face)

References
  • C. Shannon, "A mathematical theory of communication", Bell System Technical Journal (1948)
  • L. Winner, "Autonomous technology: Technics-out-of-control as a theme in political thought", MIT Press (1978)
  • L. Winner, "Do artifacts have politics?", Computer Ethics (1980)
  • R. Miikkulainen et al., "Natural language processing with modular PDP networks and distributed lexicon", Cognitive Science (1991)
  • J. Schmidhuber et al., "Sequential neural text compression", IEEE Transactions on Neural Networks (1996)
  • W. Klöpffer, "Life cycle assessment", Environmental Science and Pollution Research (1997)
  • Y. Bengio et al., "A neural probabilistic language model", NeurIPS (2000)
  • T. Mikolov et al., "Recurrent neural network based language model", Interspeech (2010)
  • R. Collobert et al., "Natural language processing (almost) from scratch", JMLR (2011)
  • H. Wu et al., "Optimizing data warehousing applications for GPUs using kernel fusion/fission", IPDPS (2012)
  • T. Mikolov et al., "Distributed representations of words and phrases and their compositionality", NeurIPS (2013)
  • O. Bojar et al., "Findings of the 2014 workshop on statistical machine translation", WMT (2014)
  • J. Nivre et al., "Universal dependencies v1: A multilingual treebank collection", LREC (2016)
  • J. Hestness et al., "Deep learning scaling is predictable, empirically", arxiv (2017)
  • A. Vaswani et al., "Attention is all you need", NeurIPS (2017)
  • N. Shazeer et al., "Outrageously large neural networks: The sparsely-gated mixture-of-experts layer", ICLR (2017)
  • M. Peters et al., “Deep Contextualized Word Representations”, NAACL (2018)
  • J. Howard et al., “Universal Language Model Fine-tuning for Text Classification”, ACL (2018)
  • A. Radford et al., "Improving language understanding by generative pre-training" (2018)
  • J. Ács, "Exploring bert’s vocabulary", http://juditacs.github.io/2019/02/19/bert-tokenization-stats.html (2019)
  • A. Conneau et al., "XNLI: Evaluating Cross-lingual Sentence Representations", EMNLP (2018)
  • P. Micikevicius et al., "Mixed Precision Training", ICLR (2018)
  • A. Radford et al., “Language Models are Unsupervised Multitask Learners” (2019)
  • E. Strubell et al., "Energy and Policy Considerations for Deep Learning in NLP", ACL (2019)
  • J. Devlin et al., "Bert: Pre-training of deep bidirectional transformers for language understanding", NAACL-HLT (2019)
  • P. Ortiz Suárez et al., "Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures", CMLC-7 (2019)
  • M. Shoeybi et al., "Megatron-lm: Training multi-billion parameter language models using model parallelism", arxiv (2019)
  • M. Mitchell et al., "Model cards for model reporting", FAccT (2019)
  • A. Wang et al., "Superglue: A stickier benchmark for general-purpose language understanding systems", NeurIPS (2019)
  • N. Reimers et al., "Sentence-bert: Sentence embeddings using siamese bert-networks", EMNLP (2019)
  • L. Gao et al., "The pile: An 800gb dataset of diverse text for language modeling", arxiv (2020)
  • A. Gu et al., "Hippo: Recurrent memory with optimal polynomial projections", NeurIPS (2020)
  • T. Brown et al., "Language models are few-shot learners", NeurIPS (2020)
  • J. Kaplan et al., "Scaling laws for neural language models", arxiv (2020)
  • A. Kunchukuttan et al., "Ai4bharat-indicnlp corpus: Monolingual corpora and word embeddings for indic languages", arxiv (2020)
  • W. Nekoto et al., "Participatory research for low-resourced machine translation: A case study in african languages", arxiv (2020)
  • J. Rasley et al., "Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters", SIGKDD (2020)
  • S. Rajbhandari et al., "Zero: Memory optimizations toward training trillion parameter models", SC (2020)
  • F. Ladhak et al., "WikiLingua: A new benchmark dataset for cross-lingual abstractive summarization", arxiv (2020)
  • W. Wang et al., "Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers", NeurIPS (2020)
  • K. Song et al., "MPNet: Masked and permuted pre-training for language understanding", NeurIPS (2020)
  • N. Nangia et al., "CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models", EMNLP (2020)
  • W. Zeng, et al. "PanGu-Alpha: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation", arxiv (2021)
  • E. Bender et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?🦜", FAccT (2021)
  • S. Mielke et al., "Between words and characters: A Brief History of Open-Vocabulary Modeling and Tokenization in NLP", arxiv (2021)
  • B. Kim et al., "What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers", arxiv (2021)
  • A. Birhane et al., "Multimodal datasets: misogyny, pornography, and malignant stereotypes", arxiv (2021)
  • N. Sambasivan et al., "“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI", CHI (2021)
  • J. Dodge et al., "Documenting large webtext corpora: A case study on the colossal clean crawled corpus", arxiv (2021)
  • J. Abadji et al., "Ungoliant: An optimized pipeline for the generation of a very large-scale multilingual web corpus", CMLC-9 (2021)
  • W. Fedus et al., "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity", arxiv (2021)
  • L. Gao et al., "A framework for few-shot language model evaluation", https://doi.org/10.5281/zenodo.5371628 (2021)
  • S. Narang et al., "Do transformer modifications transfer across implementations and applications?", arxiv (2021)
  • O. Press et al., "Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation", ICLR (2021)
  • D. Patterson et al., "Carbon emissions and large neural network training", arxiv (2021)
  • R. Bawden et al., "DiaBLa: a corpus of bilingual spontaneous written dialogues for machine translation", LREC (2021)
  • A. Tikhonov et al., "It's All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning", ACL/IJCNLP (2021)
  • S. Black et al., "GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow", (2021)
  • B. Wang et al., "GPT-J-6B: A 6 billion parameter autoregressive language model" (2021)
  • X. Lin et al. "Few-shot learning with multilingual language models", arxiv (2021)
  • A. Fan et al., "Beyond English-Centric multilingual machine translation", JMLR (2021)
  • M. Chen et al., "Evaluating large language models trained on code", arxiv (2021)
  • A. Simoulin et al., "Un modèle Transformer Génératif Pré-entrainé pour le ______ francais", ATALA (2021)
  • S. L. Blodgett et al., "Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets", ACL (2021)
  • C. Akiki et al., "BigScience: A Case Study in the Social Construction of a Multilingual LLM", WBRC (2022)
  • R. Johnson et al., "The Ghost in the Machine has an American accent: value conflict in GPT-3", arxiv (2022)
  • S. Biderman et al., "Datasheet for the Pile", arxiv (2022)
  • Y. Jernite et al., "Data governance in the age of large-scale data-driven language technology", FAccT (2022)
  • H. Laurençon et al., "The BigScience ROOTS Corpus: A 1.6 TB Composite Multilingual Dataset", NeurIPS Datasets Track (2022)
  • J. Kreutzer et al., "Quality at a glance: An audit of web-crawled multilingual datasets", ACL (2022)
  • Y. Li et al., "Competition-level code generation with AlphaCode", arxiv (2022)
  • A. McMillan-Major et al., "Documenting geographically and contextually diverse data sources: The bigscience catalogue of language data and resources" arxiv (2022)
  • N. Muennighoff et al., "Crosslingual Generalization through Multitask Finetuning", arxiv (2022)
  • H. Laurençon et al., "The BigScience ROOTS Corpus: A 1.6 TB Composite Multilingual Dataset", NeurIPS Datasets Track (2022)
  • V. Sanh et al., "Multitask Prompted Training Enables Zero-Shot Task Generalization", ICLR (2022)
  • T. Le Scao et al., "What Language Model to Train if You Have One Million GPU Hours?." arxiv (2022)
  • T. Dettmers et al., "Llm. int8 (): 8-bit matrix multiplication for transformers at scale", arxiv (2022)
  • Anon, "Hungry hungry hippos: Towards language modeling with state space models", ICLR 2023 submission (2022)
  • T. Wang et al., "What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization?", arxiv (2022)
  • Y. Tay et al., "Transcending scaling laws with 0.1% extra compute", arxiv (2022)
  • A. Zeng et al., "GLM-130B: An Open Bilingual Pre-trained Model", arxiv (2022)
  • S. Zhang et al., "OPT: Open pre-trained transformer language models", arxiv (2022)
  • S. Smith et al., "Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model", arxiv (2022)
  • J. Hoffmann et al., "Training Compute-Optimal Large Language Models", arxiv (2022)
  • N. Muennighoff, "SGPT: GPT Sentence Embeddings for Semantic Search", arxiv (2022)
  • N. Muennighoff et al., "MTEB: Massive Text Embedding Benchmark", arxiv (2022)
  • A. Luccioni et al., "Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model", arxiv (2022)
  • D. Contractor et al., "Behavioral use licensing for responsible AI", FAccT (2022)
  • S. Bach et al., "Promptsource: An integrated development environment and repository for natural language prompts", arxiv (2022)
  • N. Goyal et al., "The flores-101 evaluation benchmark for low-resource and multilingual machine translation", TACL (2022)
  • O. Shliazhko et al., "mGPT: Few-shot learners go multilingual", arxiv (2022)
  • S. Black et al., "GPT-NeoX-20B: An open-source autoregressive language model", arxiv (2022)
  • S. Soltan et al., "AlexaTM 20B: Few-shot learning using a large-scale multilingual seq2seq model, arxiv (2022)
  • Y. Wang et al., "Benchmarking generalization via in-context instructions on 1,600+ language tasks", (2022)
  • J. Ni et al., "Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models", ACL (2022)
  • K. Heffernan et al., "Bitext Mining Using Distilled Sentence Representations for Low-Resource Languages", arxiv (2022)
  • F. Feng et al., "Language-agnostic BERT Sentence Embedding", ACL (2022)
  • J. FitzGerald et al., "MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages", arxiv (2022)
  • H. Madabushi et al., "SemEval-2022 Task 2: Multilingual Idiomaticity Detection and Sentence Embedding", arxiv (2022)
  • O. Serikov et al., "Universal and Independent: Multilingual Probing Framework for Exhaustive Model Interpretation and Evaluation", arxiv (2022)
  • A. Névéol et al., "French CrowS-Pairs: Extending a challenge dataset for measuring social bias in masked language models to a language other than English", ACL (2022)
  • P. Liang et al., "Holistic evaluation of language models", arXiv (2022)
Other links