[Log] Vision transformers done

23 Aug, 2025

progress: https://x.com/gaussian_noise_/status/1959236617908183255

this is a simple decoder block with a mix of bert like CLS token.

> be vision transformer
> chunk image into 2D patches (only place where we use 2D spatial info)
> use conv and is anyway like a linear feature. treat em as tokens.
> add learnable positional info with trunc norm. 
> append a CLS token where attention aggregates info about all the patches into the CLS token (since this is regular attn, not causal), very similar to attention sinks.
> basic blocks with ff, norm.
> cut off CLS token and pass to out-proj for classifying into n-classes.
> classf is done.

more learnings

> weak inductive bias, struct info is used only during making patches.
> mogs CNN when trained with more data.
> easily be overfitted. data augmentation is needed.
> follow hf recipe to reproduce for most part. 
> the classiﬁcation head is implemented by a MLP with one hidden layer at pre-training time and by a single linear layer at ﬁne-tuning time