Nick Jiang*, Amil Dravid*, Alexei A. Efros, Yossi Gandelsman
Controlling high-norm tokens in Vision Transformers. As shown in Darcet et al. (2024), high-norm outlier tokens emerge in ViTs and lead to noisy attention maps (“Original”). By identifying the mechanism responsible for their emergence, we demonstrate that we can shift them to arbitrary positions at test time (“Shifted”). Shifting the outlier tokens outside of the image area mimics register behavior at test-time (“w/ Test-time Register”), resulting in more interpretable attention patterns and downstream performance comparable to models retrained with registers.
Please consider starring ⭐ if you find this repository useful.
We provide OpenCLIP models and LLaVA Llama-3 8B on HuggingFace that include precomputed register neurons and test-time registers. Please visit the links below and take a look openclip_example.ipynb and llava_demo.ipynb for example usage. Note that LLaVa will need transformers==4.37.0. These models can be further fine-tuned or used for other downstream applications.
| Model | 🤗 Link |
|---|---|
| OpenCLIP ViT-B/16 with test-time register | Link |
| OpenCLIP ViT-L/14 with test-time register | Link |
| LLaVA Llama-3 8B with test-time register | Link |
To access DINOv2 with test-time registers on its own, load the model directly from PyTorch Hub. See an example in dinov2_example.ipynb.
import torch
model = torch.hub.load("nickjiang2378/test-time-registers", model = "dinov2_vitl14_tt_reg")
git clone git@github.com:nickjiang2378/test-time-registers.git
cd test-time-registers
conda env create -f environment.yml
test-time-registers/
├── clip/
│ ├── ...
│ ├── clip_hook_manager.py
│ └── clip_state.py
├── dinov2/
│ ├── ...
│ ├── dinov2_hook_manager.py
│ └── dinov2_state.py
└── shared/
├── ...
├── algorithms.py
├── hook_manager.py
└── hook_fn.py
Here are the most important files in this repo:
hook_manager.py: manages all hooks (interventions, logging) registered for the model. CLIPHookManager and Dinov2HookManager are both subclasses.
hook_fn.py: contains the hook functions for intervening on register neurons and logging model internals for analysis
algorithms.py: contains algorithm for detecting register neurons
clip_state.py / dinov2_state.py: loads model, instantiates hook manager, and passes important metadata like number of layers
See register_neurons.ipynb to automatically find register neurons and analyze the effects of intervening upon them with test-time registers.
Many sections rely on access to a corpus of images for use in analysis; our paper uses ImageNet 2012. You can also create a folder of custom images, but the folder should only consist of images (ie. JPEGs). Once collected or downloaded, pass the folder path to the IMAGENET_PATH variable in the notebook.
To study ViTs beyond CLIP and DINOv2, we recommend creating a new environment to avoid dependency conflicts with DINOv2 and CLIP. Copy the new model's code into this repo (similar to what's done with CLIP and DINOv2), and create two additional files in the new model's folder:
custom_hook_manager.py: initialize a subclass of theHookManagerclass inshared/hook_manager.pyand fill out the abstract methods, which tell us how to hook into important model components like the MLP, attention heads, etc. Seedinov2/dinov2_hook_manager.pyfor an example.custom_state.py: create aload_model_statefunction that loads in a model based on a config (to specify size, etc.), instantiates the hook manager, and returns metadata like number of layers. Seedinov2/dinov2_state.pyfor an example.
Lastly, you should modify the model code to enable adding in extra tokens initialized to zeros. This is necessary for creating our "test-time" registers to shift outliers from the image to. In CLIP, we pass in the number of registers during the forward pass. In DINOv2, we set this number as an attribute of the model class. See their respective folders for more details.
The hook manager provides access to model internals such as neuron activations, layer outputs, and attention maps. It can also perform interventions such as ablating register neurons or applying functions to layer outputs.
# Declare the hook manager
hook_manager = HookManager(model)
# Reinitialize logged values and hooks
hook_manager.reinit(mode = HookMode.ANALYSIS)
# Add any intervention hooks
hook_manager.intervene_register_neurons(...)
# Finalize hooks - this step actually registers all the hooks in the model
hook_manager.finalize()
# Call the model
run_model(...)
# Access model internals
hook_manager.get_neuron_activations(), etc.
For a full list of methods, check out shared/hook_manager.py.
Resets saved model internals and removes previous hooks. The mode parameter controls which model internals are logged:
HookMode.ANALYSIS: Logs comprehensive model internals including:- Neuron activations
- Layer outputs
- Attention maps (both pre and post-softmax)
HookMode.INTERVENE: No logging - only registers intervention hooks if specified
Note: More hooks = slower forward passes.
intervene_register_neurons(num_registers: int, neurons_to_ablate: dict, scale: float = 1.0, normal_values: str = "zero")
Registers an intervention on specified neurons. Parameters:
num_registers: Number of registers used by the modelneurons_to_ablate: Dictionary mapping layers to lists of neuronsscale: Multiplier for max activation on test-time register (default: 1.0)normal_values: Strategy for modifying image patches:"zero": Set to zero"mean": Set to mean activation"only_outliers": Only modify outlier activations"same": Keep original values
Registers all intervention and logging hooks. Log hooks are registered after interventions, so logged internals will reflect any changes made by interventions.
Test-time registers and register neurons are native to the OpenCLIP HuggingFace models above, so we recommend using those versions for evaluating OpenCLIP (see openclip_example.ipynb). For DINOv2, we provide access to a model integrated with test-time registers via torch hub (see dinov2_example.ipynb). Below we provide the instructions on how to reproduce results from our paper. We use the default hyperparameters provided by these repos unless none are defined.
DINOv2 IN-1k Linear Probe (Tables 1,2): We use the original DINOv2 repo which already provides the set of hyperparameters.
OpenCLIP IN-1k Linear Probe (Table 2): We use the linear probing code from here with a learning rate of 0.01 for 10 epochs.
ADE20k Segmentation and NYUv2 Depth Estimation Linear Probe (Table 2): Add the models here, which already provides the set of hyperparameters for both experiments.
OpenCLIP Zero-Shot ImageNet Classification (Table 3): We follow the standard zero-shot protocol in here. An example can also be found in openclip_demo.ipynb.
Imagenet Zero-Shot Segmentation (Table 4): We use this repo for evaluation of ImageNet zero-shot segmentation.
LOST Unsupervised Object Discovery (Table 5): We use the official LOST repo. We sweep over the last four layers of both OpenCLIP and DINOv2. For OpenCLIP, we use the value projection features, and for DINOv2, we the use the key features. We find that normalizing the features before LOST computation can improve results for all methods. Finally, following Darcet et al. (2024), we sweep over a manual bias term for the affinity matrix. For normalized features, sweep over [-1, 0].
VLM Evaluation (Table 6): We use VLMEvalKit for evaluation using the main eight benchmarks from here.
Please cite our paper as:
@inproceedings{jiangvision,
title={Vision Transformers Don't Need Trained Registers},
author={Jiang, Nick and Dravid, Amil and Efros, Alexei A and Gandelsman, Yossi},
booktitle={arXiv preprint arXiv:2506.08010},
year={2025}
}
