Evaluating Large Language Models Trained on Code (Codex)

Summary: A video description of the paper entitled "Evaluating Large Language Models Trained on Code" by M. Chen et al. published on arxiv in July 2021.
Paper: The paper can be found on arxiv here.
Topics: codex, language models, foundation models, coding
Slides: link (pdf)

Acknowledgements: Particular thanks are due to Almut Sohpia Koepke for her help with decoding some of the code produced by Codex and other technical details.

References

H. Simon, "Experiments with a heuristic compiler", JACM (1963)
Z. Manna et al., "Toward automatic program synthesis", Comm. of ACM (1971)
T. McCabe, "A complexity measure", IEEE Trans. Softw. Eng. (1976)
S. Hochreiter et al., "Long short-term memory", Neural Computation (1997)
K. Papineni et al., "Bleu: a method for automatic evaluation of machine translation", ACL (2002)
A. Hindle et al., "On the naturalness of software", ICSE (2012)
A. Graves, "Generating sequences with recurrent neural networks", arxiv (2013)
T. Mikolov et al., "Efficient estimation of word representations in vector space", arxiv (2013)
W. Zaremba et al., "Learning to execute", arxiv (2014)
M. Clarkson et al., "Temporal logics for hyperproperties", ICPST (2014)
I. Sutskever et al., "Sequence to sequence learning with neural networks", NeurIPS (2014)
A. Dai et al., "Semi-supervised sequence learning", NeurIPS (2015)
D. P. Kingma et al., "Adam: A method for stochastic optimization", ICLR (2015)
T. Helmuth et al., "General program synthesis benchmark suite", GECCO (2015)
A. van den Oord et al., "Pixel recurrent neural networks", ICML (2016)
A. van den Oord et al., "Wavenet: A generative model for raw audio", arxiv (2016)
A. Gaunt et al., "TerpreT: A probabilistic programming language for program induction", arxiv (2016)
A. Vaswani et al., "Attention is all you need", NeurIPS (2017)
A. Das et al., "Visual dialog", CVPR (2017)
E. Pantridge et al., "On the difficulty of benchmarking inductive program synthesis methods", GECCO (2017)
K. Crawford, "The trouble with bias", NeurIPS (2017)
M. E. Peters, "Deep contextualised word representations", arxiv (2018)
A. Radford et al. "Improving language understanding by generative pre-training" (2018)
J. Menick et al., "Generating high fidelity images with subscale pixel networks and multidimensional upscaling", arxiv (2018)
P. Christiano, "Clarifying 'AI alignment'", https://ai-alignment.com/clarifying-ai-alignment-cec47cd69dd6 (2018)
A. van den Oord et al., "Representation learning with contrastive predictive coding", arxiv (2018)
D. Amodei et al., "AI and Compute", https://openai.com/blog/ai-and-compute/ (2018)
R. Child et al., "Generating long sequences with sparse transformers", arxiv (2019)
J. Devlin et al., "Bert: Pre-training of deep bidirectional transformers for language understanding", NAACL-HLT (2019)
E. Alley et al., "Unified rational protein engineering with sequence-based deep representation learning", Nature methods (2019)
J. Lu et al., "Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks", NeurIPS (2019)
H. Husain et al., "CodeSearchNet challenge: Evaluating the state of semantic code search", arxiv (2019)
S. Kulal et al., "SPoC: Search-based pseudocode to code", NeurIPS (2019)
J. Devlin et al., "Bert: Pre-training of deep bidirectional transformers for language understanding", NAACL-HLT (2019)
A. Holtzman et al., "The Curious Case of Neural Text Degeneration", ICLR (2019)
N. Keskar et al., "CTRL: A conditional transformer language model for controllable generation", arxiv (2019)
N. Leveson, "Improving the Standard Risk Matrix: Part 1" (2019)
M. Chen et al., "Generative pretraining from pixels", ICML (2020)
P. Dhariwal et al., "Jukebox: A generative model for music", arxiv (2020)
A. Baevski et al., "wav2vec 2.0: A framework for self-supervised learning of speech representations", NeurIPS (2020)
L. Gao et al., "The Pile: An 800gb dataset of diverse text for language modeling", arxiv (2020)
Z. Feng et al., "CodeBERT: A pre-trained model for programming and natural languages", EMNLP (2020)
C. Raffel et al., "Exploring the limits of transfer learning with a unified text-to-text transformer", JMLR (2020)
C. Clement et al. "PyMT5: multi-mode translation of natural language and Python code with transformers", arxiv (2020)
T. Brown et al., "Language models are few-shot learners", NeurIPS (2020)
S. Ren et al., "CodeBLEU: a method for automatic evaluation of code synthesis", arxiv (2020)
B. Roziere et al., "Unsupervised translation of programming languages", NeurIPS (2020)
J. Kaplan et al., "Scaling laws for neural language models", arxiv (2020)
M. O’Neill et al., "Automatic programming: The open issue?", GPEM (2020)
N. Stiennon et al., "Learning to summarize with human feedback", NeurIPS (2020)
S. L. Blodgett et al., "Language (Technology) is Power: A Critical Survey of “Bias” in NLP", ACL (2020)
D. Acemoglu, "The wrong kind of AI? Artificial intelligence and the future of labour demand", CJRES (2020)
R. Schwartz et al., "Green AI", Communications of the ACM (2020)
B. Smith, "Microsoft will be carbon negative by 2030", https://blogs.microsoft.com/blog/2020/01/16/microsoft-will-be-carbon-negative-by-2030/ (2020)
A. Ramesh et al., "Zero-shot text-to-image generation", ICML (2021)
M. Chen et al., "Evaluating large language models trained on code", arxiv (2021)
H. Bao et al., "BEiT: BERT Pre-Training of Image Transformers", ICLR (2021)
A. Rives et al., "Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences", PNAS (2021)
R. Zellers et al., "Merlot: Multimodal neural script knowledge models", NeurIPS (2021)
D. Hendrycks et al., "Measuring Coding Challenge Competence With APPS", NeurIPS (2021)
B. Wang et al., "GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model" (2021)
S. Black et al., "GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow" (2021)
Z. Kenton et al., "Alignment of Language Agents" (2021)
E. M. Bender et al., "On the Dangers of Stochastic Parrots: Can Language Models Be Too Big", FAccT (2021)
A. Abid et al., "Persistent anti-muslim bias in large language models", AIES (2021)
N. Carlini et al., "Extracting training data from large language models", USENIX Security (2021)
K. Crawford, "The atlas of AI: Power, politics, and the planetary costs of artificial intelligence", Yale University Press (2021)
A. Ziegler, "A first look at rote learning in github copilot suggestions" (2021)
F. Xu et al., "In-IDE code generation from natural language: Promise and challenges", TOSEM (2022)

Samuel Albanie