10. Vision Transformers (ViT)

1. Images are Words

ViT proves Convolution is not necessary.

Cut image into 16x16 pixel patches.
Flatten them into vectors (Tokens).
Feed them to a standard Transformer.

2. Global Context

CNNs look at small windows (3x3). ViT looks at the whole image at once via Self-Attention. It scales much better than CNNs with massive datasets (JFT-300M).

Report an issue or suggest an improvement

Mark as Completed

Previous Next Lesson