Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. 
“Layer Normalization.” https://arxiv.org/abs/1607.06450.
Bahdanau, Dzmitry, Kyunghyun Cho, and Yoshua Bengio. 2014. 
“Neural Machine Translation by Jointly Learning to Align and Translate.” arXiv Preprint arXiv:1409.0473. 
http://arxiv.org/abs/1409.0473.
Bengio, Yoshua, Réjean Ducharme, and Pascal Vincent. 2000. “A Neural Probabilistic Language Model.” Advances in Neural Information Processing Systems 13.
Brown, Tom B, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, et al. 2020. 
“Language Models Are Few-Shot Learners.” arXiv Preprint arXiv:2005.14165. 
https://doi.org/10.48550/arXiv.2005.14165.
Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. 
“BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint arXiv:1810.04805. 
https://doi.org/10.48550/arXiv.1810.04805.
Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986. “Learning Representations by Back-Propagating Errors.” Nature 323 (6068): 533–36.
Sutskever, I. 2014. “Sequence to Sequence Learning with Neural Networks.” arXiv Preprint arXiv:1409.3215.
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. 
“Attention Is All You Need.” In 
Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), 11. Long Beach, CA, USA. 
https://arxiv.org/abs/1706.03762.