UNCHA: Uncertainty-guided Compositional Hyperbolic Alignment with Part-to-Whole Semantic Representativeness

1Dept. of Electrical and Computer Engineering, 2INMC & IPAI
*These authors contributed equally to this work

Seoul National University, Republic of Korea

CVPR 2026

We present UNCHA, a novel framework for hyperbolic vision-language models that captures part-to-whole semantic representativeness via uncertainty-guided alignment. UNCHA improves hierarchical compositional understanding and achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks.

Abstract

While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole.

We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by an entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks.

Motivation

Motivation

Varying representativeness of part images to the whole scene. Not all part images equally represent the whole scene. Some parts capture the overall context (e.g., the street), while others correspond to less informative or ambiguous regions (e.g., small objects like traffic signs). As a result, part images differ significantly in how well they reflect the global scene. When all parts are treated equally, the model cannot distinguish more representative regions from less relevant ones, which leads to suboptimal multi-object alignment and inefficient use of the embedding space.

To address this, we model the representativeness of each part as uncertainty—assigning lower uncertainty to more representative parts and higher uncertainty to less informative ones. This enables uncertainty-aware part–whole alignment, improving compositional understanding and overall alignment quality.

Method

UNCHA method overview

Previous approaches have explored hyperbolic representations to capture hierarchical relationships in vision-language models. For instance, MERU focuses on modeling inter-modal entailment between whole scene images and text representations. HyCoCLIP extends this idea by additionally incorporating intra-modal entailment, enabling alignment between part-level and whole-scene representations within the same modality. Building on these works, UNCHA further introduces uncertainty to explicitly quantify how well each part represents the whole scene. By assigning lower uncertainty to more representative parts and higher uncertainty to less informative ones, UNCHA enables adaptive weighting in the contrastive objective, leading to more accurate part–whole alignment. In addition, uncertainty is calibrated through the entailment loss, and entropy regularization is applied to ensure stable and balanced utilization of the hyperbolic embedding space across different uncertainty levels and modalities. Together, these components allow UNCHA to achieve more effective compositional understanding and improved alignment performance.

Results

Zero-shot classification results

Quantitative results of zero-shot image classification. UNCHA (Ours) consistently outperforms prior approaches across all benchmark datasets, demonstrating generalization and robust performance on downstream tasks. Bold numbers indicate the best performance within each architecture, and $\dagger$ denotes ATMG trained on GRIT.

Zero-shot retrieval and hierarchical classification results

Quantitative results on zero-shot retrieval and hierarchical classification benchmarks. UNCHA (Ours) consistently outperforms prior methods across both retrieval and hierarchical metrics, demonstrating its improved ability to preserve the structural hierarchy of the class labels within the embedding space, partly due to the uncertainty-guided alignment.

Multi-label classification results

Quantitative results on multi-object representation and multi-label classification. UNCHA (Ours) consistently achieves higher mAP across varying object configurations and datasets, demonstrating its effectiveness in compositional multi-object understanding. UNCHA outperforms all baselines across both multi-label classification and multi-object representation benchmarks which indicate that our uncertainty-aware modeling provides a substantially stronger compositional understanding. These results highlight UNCHA's ability to better disentangle object-level semantics and maintain robust alignment in complex multi-object scenes.

Analysis

Uncertainty modeling analysis

Analysis of uncertainty modeling. (a) Part images are sorted by uncertainty (low to high), where more semantically representative parts exhibit lower uncertainty, while less informative or blurred parts show higher uncertainty. (b) On an ImageNet subset, part-to-whole similarity shows a strong negative correlation with uncertainty ($r = -0.739$), indicating that less representative parts are assigned higher uncertainty.

Embedding distribution analysis

Analysis of embedding distribution. (a) Visualization on the COCO val2017 dataset. We compare the embedding distributions of HyCoCLIP and UNCHA using HoroPCA based on their distance from the origin. HyCoCLIP embeddings tend to lie closer to the origin, whereas UNCHA embeddings are positioned farther away in the hyperbolic space. In addition, UNCHA produces a more dispersed distribution with reduced overlap between part and whole representations, indicating more effective utilization of the available hyperbolic space. (b) Visualization of hyperbolic embedding radii for 10,000 ImageNet images and their randomly cropped parts. In HyCoCLIP, embeddings of whole images and their parts tend to collapse into a narrowly concentrated region, resulting in minimal separation. In contrast, UNCHA produces a more structured geometry, where part embeddings lie closer to the origin while whole-scene embeddings are positioned farther away, forming a clear separation between the two. This behavior is driven by uncertainty calibration and entropy regularization, which encourage more meaningful and well-organized representations.

Loss analysis

Analysis of our newly introduced loss terms. (a) Cosine similarity between gradients of different loss components, showing that the uncertainty calibration loss acts as a regularizer by opposing the entailment loss, while the uncertainty-guided contrastive loss remains aligned with the main contrastive objective. (b) Visualization of embedding distributions using HoroPCA on COCO, where the full model exhibits well-structured representations, while removing each loss term leads to degraded alignment or concentrated embeddings.

BibTeX

@inproceedings{kim2026uncha,
  author    = {Kim, Hayeon and Jang, Ji Ha and Kim, Junghun James and Chun, Se Young},
  title     = {UNCHA: Uncertainty-guided Compositional Hyperbolic Alignment with Part-to-Whole Semantic Representativeness},
  booktitle = {CVPR},
  year      = {2026},
}