Improving Performance of Pre-Trained Vision Models via Inference-Time Attention Engineering

July 8, 2025

In today's world, we generate and manage vast amounts of image data through smartphones and cloud services. To efficiently search and organize this wealth of image data for our daily tasks and work, accurate image recognition technologies, including image clustering methods, are crucial. Image clustering refers to the technique of grouping large sets of image data based on similarities. Figure 1 illustrates an application of image clustering technology. When image data is processed by a trained deep learning model, each image is transformed into a vector representation. These vectors exist in a high-dimensional space known as feature space, which is not interpretable by humans but can be categorized into groups using clustering algorithms. Results classified in feature space are also useful for person recognition in smartphone photo applications.

Fig.1 Clustering Techniques and Their Applications

In recent years, deep learning models have been pre-trained on large datasets to enhance image recognition accuracy, with the Vision Transformer (ViT) gaining attention as a powerful architecture for such models (e.g., DINOv2^[1]). While ViT has demonstrated high performance across various vision tasks, including image recognition, object detection, and image generation, its effectiveness in clustering tasks has been less satisfactory^[2]. To improve the clustering accuracy of ViT models, we developed an approach called Inference-Time Attention Engineering (ITAE)^[2].

Figure 2 provides an overview chart of ITAE. This method extends ViT, adding components highlighted in yellow. The ViT architecture includes (1) a feature vector that represents the characteristics of the entire image (referred to as the whole image vector V_CLS) and (2) feature vectors corresponding to individual subregions (referred to as subregions image vector V_CLS). ITAE enhances image clustering performance by automatically identifying anomalies in the ViT model that negatively impact the whole image vector V_CLS and treating them effectively.

Fig.2 Overview chart of ITAE ©2025 Springer^[2]

The inspiration for this research stemmed from earlier work that focused on the individual subregion vectors V_i produced by the ViT model. This previous study noted that large output anomaly vectors in certain subregions degraded the model's accuracy across various tasks^[3]. Improving these anomalies also led to better overall performance. However, since the focus was on subregion vectors, the accuracy improvement for clustering tasks using the whole image vector V_CLS was limited. Therefore, we directed our attention to internal anomalies within the model to further enhance clustering accuracy. In the following sections, we first explain how ViT operates, followed by the detailed methodology we have developed.

Figure 3 depicts the image processing flow in ViT, where the input image is divided into subregions, each producing a feature vector V_i. Simultaneously, a special vector V_CLS representing the entire image is generated. These vectors are output after being processed by the Transformer Encoder, which consists of N layers, all sharing a similar structure.

Within the Transformer Encoder, the Multi-Head Attention mechanism used in ITAE enables the simultaneous input of all vectors (V_CLS, V_i), allowing for individual updates at the output. To clarify the computation involved in Multi-Head Attention, Figure 4 shows a simplified diagram of the calculation of one head of Multi-head. Specifically, the Attention values are computed to measure the relationships between the whole image vector V_CLS and all vectors (V_CLS, V_i)(A). Each Attention value is a single real number representing the correlation, subsequently weighted and summed to compute the updated whole image vector V_CLS. Thus, the new V_CLS (C) is an aggregated weighted sum of all vectors (V_CLS, V_i) (B), where the weights reflect the Attention values that quantify the associations between the vectors.

We studied the behavior of the subregions within Multi-Head Attention where anomalies had previously been noted^[3] and discovered that these regions also exhibited abnormal features in their vectors—as they had lengths significantly exceeding those of other vectors. Such internal anomalies negatively impacted both the whole image vector V_CLS and the individual subregion vectors V_i. Consequently, we believed that addressing these internal anomalies would further improve clustering accuracy.

Fig.4 Simplified diagram of the calculation of one head of Multi-Head Attention

Rather than directly eliminating the internal anomalies through retraining, ITAE strategically reduces the weights assigned to these anomalous regions by lowering their Attention values (as illustrated in Figure 5). In the Multi-Head Attention of the last layer of the Transformer Encoder, the lengths of the vectors before updating are assessed to identify the problematic small regions (D). Their corresponding Attention values are then attenuated(E), resulting in an updated whole image vector V_CLS(F).

Since the Attention value effectively determines the influence of the different regions on V_CLS, decreasing the Attention values of the anomalous regions minimizes their impact, enabling the model to allocate attention more appropriately across the remaining regions. Importantly, this adjustment does not necessitate retraining and can be seamlessly implemented during inference.

Fig.5 Simplified diagram of ITAE calculation

Figure 6 overlays the Attention values for each sub-region (highlighted in orange boxes of Figure 5) on the image in reference to V_CLS. The color gradient from blue to yellow indicates ascending Attention values. The central image illustrates how the ViT model allocated Attention values throughout the actual image, revealing areas of high Attention in the background, which typically lack significant objects. Thus, since anomalous feature vectors tend to reside in regions displaying elevated Attention values, applying ITAE to reduce these values is likely to improve Attention distribution significantly.

Fig.6 Manipulation of Attention values by ITAE

The modification of Attention values through ITAE effectively mitigates the adverse effects of model anomalies, leading to enhanced outputs of V_CLS and improved clustering accuracy in validation experiments. Our experiments utilized datasets like CIFAR-10^[4], CIFAR-100^[4], STL-10^[5], and Tiny ImageNet^[6], comparing the clustering accuracies of conventional models^[1][3] with those incorporating ITAE. Clustering accuracy is measured by the alignment of predicted groupings with correct groupings, where 100% signifies perfect correspondence. Notably, ITAE yielded average improvements in clustering accuracy of 1.66% over Method 1 and 5.05% over Method 2, underscoring the direct correlation between mitigating internal anomalies within Multi-Head Attention and improved clustering performance.

The primary advantages of ITAE lie in its efficiency and ease of use; it does not require retraining or additional training of existing large-scale pre-trained models, and it can run efficiently within a single inference pass, making it an accessible and effective solution for enhancing ViT models' performance.

* Table can be scrolled horizontally.

Dataset	CIFAR-10^[4]	CIFAR-100^[4]	STL-10^[5]	Tiny ImageNet^[6]
Conventional1^[1]	82.16	68.69	65.78	71.98
Conventional2^[3]	78.67	68.01	56.84	71.53
Ours	82.49	69.04	70.51	73.19

This achievement was presented at the Asian Conference on Computer Vision 2024(ACCV2024)^[2].

This material is a partial excerpt and a reconstruction of Ref [2] ©2025 Springer.
Company names, product names, and service names may be trademarks of third-party companies.

Reference
[1] Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., HAZIZA, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research (2024)
[2] Nakamura, K., Nozawa, Y., Lin, Y. C., Nakata, K., Ng, Y.: Improving Image Clustering with Artifacts Attenuation via Inference-Time Attention Engineering. In: In Asian Conference on Computer Vision (pp. 277-295). Springer, Singapore. (2025)
[3] Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision Transformers Need Registers In: The Twelfth International Conference on Learning Representations (2024)
[4] Krizhevsky, A., Hinton., G.: Learning Multiple Layers of Features from Tiny Images. Technical report, University of Toronto (2009)
[5] Coates, A., Ng, A., Lee, H.: An Analysis of Single-Layer Networks in Unsupervised Feature Learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. pp. 215–223. JMLR Workshop and Conference Proceedings (2011)
[6] Le, Y., Yang, X.: Tiny ImageNet Visual Recognition Challenge. CS 231N 7(7), 3 (2015)

All Topics

Improving Performance of Pre-Trained Vision Models via Inference-Time Attention Engineering

Share