The famous paper “Attention is all you need” in 2017 changed the way we were thinking about attention. With enough data, matrix multiplications, linear layers, and layer normalization we can perform state-of-the-art-machine-translation.
Nonetheless, 2020 was definitely the year of transformers! From natural language now they are into computer vision tasks. How did we go from attention to self-attention? Why does the transformer work so damn well? What are the critical components for its success?
Adaloglou, N. (2020, December 24). How Transformers work in deep learning and NLP: an intuitive introduction. AI Summer. https://theaisummer.com/transformer/