← all projects
Deep Learning · November 2024 · 1 min read

Vision Transformer + Masked Autoencoder

ViT classifier achieving 73.5% on CIFAR-10, then self-supervised MAE pretraining boosts finetuned accuracy to 76.8%. Full implementation of patchify, attention pooling, and mask reconstruction.

PythonPyTorchVision TransformersSelf-Supervised Learning

Pure attention-based image classifier — no convolutions. 73.5% CIFAR-10 accuracy from scratch, jumping to 76.8% after MAE self-supervised pretraining. Implemented patchify/unpatchify, multiheaded attention pooling, and full transformer encoder-decoder.

architectureViT: 4 transformer layers, 256-dim embeddings, 4×4 patches pretrainingMAE with 75% mask ratio, encoder-decoder asymmetric resultsViT scratch: 73.5% · MAE finetune: 76.8% CIFAR-10

Masked Autoencoder Reconstruction

Reconstructs images from 25% of patches. The asymmetric design — heavy encoder, lightweight decoder — learns robust representations by forcing the model to predict missing content from minimal visible context.

Self-Supervised Learning Analysis

The 75% mask ratio is the sweet spot — too little masking is too easy, too much loses all context. MAE pretraining dramatically improves data efficiency: with only 10% of labels, MAE finetuning reaches 62% while training from scratch only gets 46%.

#vision-transformer#masked-autoencoder#self-supervised#transformers

Related projects