🤗 Datasets: A community library for natural language processing



Summary: A fast description of "🤗 Datasets: A community library for natural language processing" by Q. Lhoest et al. published as an EMNLP demo in 2021.
Source material: The original paper describing the library can be found here. The library can be found here, while the dataset hub can be found here. The library is under active development. Since its launch, it has expanded beyond NLP to also include computer vision and audio, as well as many other features beyond those described in the paper.
Topics: Hugging Face, machine learning, NLP
Slides: link (pdf)

References
  • P. F. Brown et al. "A statistical approach to language translation", COLING (1988)
  • M. P. Marcus et al. "Building a large annotated corpus of English: The Penn Treebank", Computational Linguistics (1993)
  • E. F. Sang et al., "Introduction to the CoNLL-2000 shared task: Chunking", arxiv (2000)
  • E. Hovy et al., "OntoNotes: the 90% solution", HLT-NAACL (2006)
  • S. Bird, "NLTK: the natural language toolkit", COLING/ACL Interactive Presentation Sessions (2006)
  • J. Nivre et al., "Universal dependencies v1: A multilingual treebank collection", LREC (2016)
  • P. Rajpurkar et al., "SQuAD: 100, 000+ Questions for Machine Comprehension of Text", EMNLP (2016)
  • M. Honnibal et al., "spaCy 2: Natural Language Understanding with Bloom Embeddings, Convolutional Neural Networks and Incremental Parsing", GitHub (2017)
  • A. Wang et al., "GLUE: A multi-task benchmark and analysis platform for natural language understanding", ICLR (2019)
  • J. Johnson et al., "Billion-scale similarity search with gpus", IEEE Transactions on Big Data (2019)
  • C. Clark et al. "BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions", NAACL-HLT (2019)
  • C. Raffel et al., "Exploring the limits of transfer learning with a unified text-to-text transformer", JMLR (2020)
  • T. Gebru et al., "Datasheets for datasets" Communications of the ACM (2021)
  • Q. Lhoest et al., "Datasets: A community library for natural language processing", EMNLP Demo (2021)
  • A. McMillan-Major et al., "Reusable Templates and Guides For Documenting Datasets and Models for Natural Language Processing and Generation: A Case Study of the HuggingFace and GEM Data and Model Cards", GEM (2021)
  • S. Gehrmann et al., "The gem benchmark: Natural language generation, its evaluation and metrics", arxiv (2021)
  • K. Goel et al., "Robustness gym: Unifying the nlp evaluation landscape", arxiv (2021)