Back to Course
Computer Vision Engineering
Module 10 of 11
10. Vision Transformers (ViT)
1. Images are Words
ViT proves Convolution is not necessary.
- Cut image into 16x16 pixel patches.
- Flatten them into vectors (Tokens).
- Feed them to a standard Transformer.
2. Global Context
CNNs look at small windows (3x3). ViT looks at the whole image at once via Self-Attention. It scales much better than CNNs with massive datasets (JFT-300M).