TensorLearn
Back to Course
Computer Vision Engineering
Module 10 of 11

10. Vision Transformers (ViT)

1. Images are Words

ViT proves Convolution is not necessary.

  1. Cut image into 16x16 pixel patches.
  2. Flatten them into vectors (Tokens).
  3. Feed them to a standard Transformer.

2. Global Context

CNNs look at small windows (3x3). ViT looks at the whole image at once via Self-Attention. It scales much better than CNNs with massive datasets (JFT-300M).

Mark as Completed

TensorLearn - AI Engineering for Professionals