Einsum equation:

It’ an elegant way to perform matrix or vector manipulation.

I find it’s extremely useful if I have to perform matrix multiplication of matrices which is of higher dimension, it gives a great flexibility to sum and multiply among certain axis.

Ex : if you have to multiply…

Emerging Properties in Self-Supervised Vision Transformers (DINO)

Why:

It has information about semantic properties of image better than normal ViT , It achieves good accuracy on K-NN classifiers which means representations of different class are well separated for final classification

How:

Similar to typical contrastive learning different augmented views passed…

MLP-Mixer: An all-MLP Architecture for Vision

Why:

Comparable results can be achieved for vision related tasks without using CNN or ViT (Self-attention) simply by using MLP

How:

Like Vision transformers image patches are fed as input tokens and it process through MLP’s

What:

This technique is called as MLP mixer , it has 2 kind of MLP’s , one which interacts with channels and other interacts with spatial region

TL: DR

My interpretation of the architecture

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Why:

Adaption of NLP’s famous transformer architecture for vision tasks and state of art has been achieved with relatively less computational resources compared to convolutions.

How:

Convert the image into sequence of patches and treat them as tokens…

Rakshith V.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store