A collection of resources to study Transformers in depth.
If you want an easily comprehensible overview of the paper, Yannic Kilcher's video is a great starting point. For a more discussion-based introduction to Transformers, take a look at AISC's or Microsoft Reading Group's recording. Lastly, Rachel from Kaggle has a 3-part series/livestream, where she reads and tries to understand the paper, while responding to viewers' questions.
Besides paper reviews, there are also incredible blog posts available. Jay Alammar's "The Illustrated Transformer", with its simple explanations and intuitive visualizations, is the best place to start understanding the different parts of the Transformer such as self-attention, the encoder-decoder architecture and positional encoding. From there, I would read Peter Bloem's blog post, which is one of the most well-written pieces I've encountered so far, with clear wording, beautiful graphics and understandable code. It goes into further detail on the self-attention mechanism and variations of the Transformer, while also providing accompanying PyTorch code. If you want to know more about different types of attention, head to Lilian Weng's blog.
- Jay Alammar - The Illustrated Transformer
- Peter Bloem - Transformers from scratch
- Lilian Weng - Attention? Attention!
- Attention craving RNNs(https://towardsdatascience.com/attention-craving-rnns-a-journey-into-attention-mechanisms-eec840fbc26f)
- Positional Encoding is Transformers
Beyond blog posts and paper reviews, you can also find some amazing lectures & talks on Transformers and self-attention online. Stanford CS224u's lecture goes into more details on the math, BERT and other contextual vectors, whereas the Stanford CS224n guest lecture (by the co-authors of the Transformer and Music Transformer) cover various use cases of self-attention. Rachel (and Jeremy) from fast.ai give another great overview of Transformer in their Intro To NLP course, but also answer some of the common confusions around Transformers such as the query, key, value system and address its application to language translation. Finally, if you want to hear more from the co-authors of the Transformer, Lukasz Kaiser and Ashish Vaswani each gave a wonderful talk on their work at Pi School 2017 and RAAIS 2019 respectively.
- Stanford CS224u
- Stanford CS224n
- Pascal Poupart: CS480/680 Lecture 19: Attention and Transformer Networks
- fastai Introduction to Transformers
- Lukasz Kaiser's Talk
- Ashish Vaswani's Talk
- ChrisMcCormickAI - BERT series
- Giuliano Giacaglia - How Transformers Work
- Michael Phi - Illustrated Guide to Transformers Neural Network: A step by step explanation
One of the best ways to understand a concept is implementing it in code. Harvard NLP published an annotated version of the original paper with commented-out code in PyTorch, which is discussed in one of the recorded AISC sessions linked below. If you prefer TensorFlow, there is also a TensorFlow 2.0 Tutorial with a Colab notebook that you can run for free. Once you're comfortable with the basic concepts, check out the NAACL Tutorial on Transfer Learning, which has an amazing Colab that teaches you how to pre-train a GPT2-like Transformer, fine-tune it and do multi-task learning as well as an amazing slide deck full of information about recent developments in Transfer Learning for Natural Language Processing.
- Harvard NLP's The Annotated Transformer
- AISC Video Recording of Code Review in PyTorch
- Transformers in TensorFlow 2.0
- NAACL Tutorial Slides
- NAACL Tutorial Colab
- Question Classification w/ Transformers
- Mark Saroufim - Implementing BERT and transformers from scratch
- Aurélién Geron - NLP in action using Transformers
Since the original paper was published, there has been a massive wave of papers building on the Transformer. Most notably, BERT, GPT-2, XLNet and Reformer.
- Linformer
- Reformer
- TransformerXL
- Evolved Transformer
- Image Transformer
- Music Transformer
- TTS Transformer
- Set Transformer
- Sparse Transformer
- Levenshtein Transformer
- BERT
- GPT-1
- GPT-2
- GPT-3
- UniLM
- XLNet
- MASS
- Adapative Attention Spans
- All Attention Layers
- Large Memory Layers with Product Keys
- Jacob Devlin's ICML Talk
- AISC
- Yannic Kilcher
- The Illustrated BERT, ELMo, and co.
- Yashu Seth's BERT FAQ
- Chris McCormick's BERT Embeddings Tutorial
- Chris McCormick's BERT Fine-Tuning Tutorial