MLP-Mixer: An all-MLP Architecture for Vision

Link to my blogging profile :


Comparable results can be achieved for vision related tasks without using CNN or ViT (Self-attention) simply by using MLP


Like Vision transformers image patches are fed as input tokens and it process through MLP’s


This technique is called as MLP mixer , it has 2 kind of MLP’s , one which interacts with channels and other interacts with spatial region


My interpretation of the architecture
Architecture from paper

Architecture is self-explanatory, main idea here is we have 2 type of MLP layers ,

MLP1 deals with it’s channel component (Own Patch) and MLP2 interacts across spatial region (Other patches)

Points to Ponder:

1. Still property of parameter sharing isn’t used completely as much we do in CNN

2. Like vision transformers interesting to observe what happens if we change the order of patches or overlapping patch?

link to the paper :




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Recurrent Neural Networks and Natural Language Processing.

Data Science 11:- Using image data, predict the gender and age range of an individual in Python.

Time Series Causality for Machine Learning Interpretability

Modern Neural Networks (AlexNet) in the ILSVRC

Data augmentation using Keras

Learn How Decision Trees are Grown

Solving Traveling Salesman Problem with SBM (Simulated Bifurcation Machine)

March Madness — Analyze video to detect players, teams, and who attempted the basket

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Rakshith V.

Rakshith V.

More from Medium

NeurIPS 2021 — Curated papers — Part 2

How well does GLIDE: the text-to-image generator work?

Multimodality: attention is all you need is all we needed

AI in Maritime Industry: How Artificial Intelligence Solutions Benefit the Shipping Sector