BEIT: BERT Pre-Training of Image Transformers

We are all aware of the fact that how successful was BERT for NLP applications. BERT was inspired from a transformer architecture and recently there has been a significant success in Vision, Audio as well using transformers. In the last year, we have seen lot of work in Vision domain (DINO, Image is worth of 16*16 words,etc) related to usage of transformers. One of the key ideas is that using image token as a text token in analogous to NLP.

Huge language models like BERT and GPT are benefited largely from pre-training with large corpus. Whereas Vision Transformers rely on the different version of contrastive learning for pre-training task. Though they achieve close to SOTA but they still require lot more unlabeled data than conventional convolution based neural networks. Pre-training through contrastive learning has certain limitations because there is dependency of high number of negative samples and mode collapse, Of-course, there are works which is trying to solve these limitations as well.

The current work “BEIT: BERT Pre-Training of Image Transformers” introduces a similar way of pre-training like BERT.

BERT uses MLM (Masked Language Modelling) and NSP (Next Sentence Prediction) as an objective function for pre-training task. As the name suggests for MLM, we mask certain tokens and ask the model to predict it. Whereas for NSP, we use one sentence as input and ask the model to predict next sentence. Now, how about using similar objective function for image??

So, if we mask certain image patches and ask the model to predict the patch, it will be very hard for the model to predict 16*16 tokens with each element can take a value from 0 to 255. In the current work of BEiT , authors have approached this problem with the help of Image tokenizers which was proposed in Zero-Shot Text-to-Image Generation

When we tokenize a text, for every word or sub-word or character depending on the tokenization algorithm used, corresponding integer will be assigned and embeddings will be learned for that. In case of image, Authors of Zero-Shot Text-to-Image Generation trained an image tokenizer, where they converted the Image of 256*256 image into token of 32*32 where each element can have a value between 0 to 8192. This model was trained using dVAE(Discrete variational encoder). In BEiT, they did not train their own image tokenizer instead they have used the publicly available images tokenizer released by authors of Zero-Shot Text-to-Image Generation. This is the main technique which helped to design an objective function analogous to MLM for vision.

In BERT, randomly N% of the tokens was masked but in BEiT , image patch(token) is masked as a block rather than masking randomly. So overall 40% of the masks needs to be masked but in blocks. Minimum of 16 patches constitute a block and aspect-ratio for the block is also chosen randomly.

Final workflow looks like this (refer the architecture diagram)

  1. Image is divided into grids(token).
  2. Blocks of token are masked randomly.
  3. Flatten the image patch into a vector.
  4. Positional embeddings and embeddings are learned for the patches.
  5. Now these embeddings are passed through BERT like architecture.
  6. For masked part, model has to predict image token.
  7. These tokens come from image tokenizer.
  8. Finally, image data can be reconstructed using tokens.

Once pre-training is done, it can be applied to the down-stream tasks like ImageNet classification, segmentation.

Link to the paper: https://arxiv.org/pdf/2106.08254.pdf