Parameter-Efficient Fine-Tuning for CLIP

Let’s face it: fine-tuning massive Vision-Language models like CLIP often feels like trying to parallel park a semi-truck in downtown Manhattan—it’s resource-heavy, slightly terrifying, and one wrong move leads to “catastrophic forgetting.” In our latest project, Parameter-Efficient Fine-Tuning for Vision-Language Models, we decided to trade the heavy machinery for a surgical scalpel. We explored how Parameter-Efficient Fine-Tuning (PEFT) techniques—like Prompt Tuning and Adapters—can help these models adapt to new tasks without needing a supercomputer or losing their “common sense” pre-trained knowledge.

We put these methods through a gauntlet of 19 diverse datasets, moving beyond the usual ImageNet comfort zone to see how they handle the “weird” stuff: satellite imagery, medical scans, and complex scenes that require actual reasoning (like figuring out the elevation of a camera or the strategy of a game). The results were a bit of a reality check: while PEFT is an absolute rockstar at handling specialized data and small samples, it still gets a little “modal-mixed-up” when things get structurally complex. We’ve open-sourced our code and evaluation framework to help the community bridge this gap, proving that you don’t need to retrain billions of parameters to make a model smart—you just need to tune the right ones.

An emprical study of parameter efficient fine-tuning for adapting CLIP to downstream tasks:

Dataset: VTAB-1k
Fine-tuning Strategy
Backbone

Environment Setup

All the code is tested on python 3.9+, CUDA 11.7/12.0

# create a new conda environment
conda create -n peft_clip python=3.9
conda activate peft_clip

pip install -r requirements.txt

Optional: Install and configure wandb for logging and visualization.

pip install wandb
wandb login

Supported Tasks(Dataset) and Backbone

Supported Strategy

Model
CLIP-Adapter
VPT-CLIP-Shallow
VPT-CLIP-Deep

Running

python train.py \
      --data "<dataset_name>" \     # Specify the dataset(task) name from table in Supported Tasks
      --backbone "<backbone_name>" \ # Choose the backbone architecture from table in Supported backbone
      --model "<strategy_name>" \   # Define the strategy model from table in Supported Strategy
      --type "<inference_type>" \   # Set the inference type to either "vision" or "vision-language"
      --shots "<num_shots>" \       # Indicate the number of shots
      --seeds "<seed>"              # Provide the seed value for reproducibility