DINOv2 Specifically

Meta A released the self-supervised DINOv2, a ‘foundational model for computer vision’

Broad specs:

  • Family: Pretrained Visual transformers
  • Training: Self-supervised, Distillation
  • Size: 1 Billion Params
    • Smaller models exist: 21M, 86M, 300M, 1100M
  • Input: Image
  • Output: Embeddings / Visual Features
    • A tensor:
      • Patch size of 14x14
        • This means a 224x224 image gets divided into a 16x6 grid of 14x14 patches. Each patch gets its own embedding, helps to encode localization
      • Embedding dimensions ranging from 384-1536 depending on size of model
      • For a 224x224 image on small model: [16][16][384]
    • Also a CLS token as well
      • CLS token useful for whole image classification, one flat vector embedding

Using outputs:

Embeddings
  • Embeddings can be used as inputs to smaller models:
    • Semantic segmentation
    • Depth estimation
  • Embeddings can be frozen
    • Dramatically simplifies training, DINO embeddings generated once, speeds up smaller model training loop
    • Allows smaller computers to train, because they don’t have to continuously run the DINO model
Attention Heads

  • Unrelated cool visual of ViT attention heads

Attention heads encode some higher level information of similarity

Embeddings also encode some degree of ‘whats in this region, in the context of the whole image’

Can be used as well: If one attention head successfuly seperates a target from the background, these attention maps can be used as inputs to a smaller model as well

Model Architecture

Visual Transformer

Training

Model Distillation

Transfering knowledge from a large trained model into a smaller model

  • Better results than training smaller models directly
    A smaller model attempts to match generations (embeddings in this case) of larger model
  • Minimize difference
    Teacher frozen, student continuously running
Self Supervised Learning

Self-supervised learning (SSL) is a subset of unsupervised learning that trains models on unlabeled data by automatically generating “pseudo-labels” from the data’s own structure

Why use unsupervised learning? Can use uncurated data

  • There is far more unlabeled data than labeled data, smart training methods allow models to take advantage of this scale difference
  • Supervised learning can be brittle, if we train model on labeled animals, they might not learn any information but things about animals. For a foundational model, its imperative that it has general knowledge
  • Better pixel level understanding

DINO approach:

  1. The teacher and the student are fundamentally the same model, just the teacher is: “an older, momentum-averaged version of the student”
  2. Labels come from cropping:
    • Teacher gets the global view and produces the ‘target embeddings’
    • Student gets the multiple cropped views (a corner of an image, the center etc) and is trained to produce the same embeddings as the teacher
      • Make sure the ear of a cat and the tail of cat produce the same output as the teacher, who is labeling a full cat

Data Source

Resources

yt vid 1 - DINOv2 from Meta AI - Finally a Foundational Model in Computer Vision?
yt vid 2 - DINOv2 Explained: Visual Model Insights & Comprehensive Code Guide

blog - DINOv2 - Steps by steps explanations - Picsellia
blog - DINOv2: Self-supervised Learning Model Explained
blog - DINOv2: A Complete Guide to Self-Supervised Learning and Vision Transformers

yt/blog writeup - Mashaan blog: Inside the DINOv2 Architecture