[Log] Vision transformers done
progress: https://x.com/gaussian_noise_/status/1959236617908183255
this is a simple decoder block with a mix of bert like CLS token.
> be vision transformer
> chunk image into 2D patches (only place where we use 2D spatial info)
> use conv and is anyway like a linear feature. treat em as tokens.
> add learnable positional info with trunc norm.
> append a CLS token where attention aggregates info about all the patches into the CLS token (since this is regular attn, not causal), very similar to attention sinks.
> basic blocks with ff, norm.
> cut off CLS token and pass to out-proj for classifying into n-classes.
> classf is done.
more learnings
> weak inductive bias, struct info is used only during making patches.
> mogs CNN when trained with more data.
> easily be overfitted. data augmentation is needed.
> follow hf recipe to reproduce for most part.
> the classification head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at fine-tuning time