The same reference under identity-preserving contextual changes and identity-changing edits with similar context.
Many vision tasks depend on whether two images depict the same visual identity under changing conditions. Yet most existing metrics capture general similarity, not identity consistency.
ID-Sim is a feed-forward similarity metric designed for identity-focused evaluation. It stays stable under identity-preserving changes like pose, viewpoint, background, and lighting, while responding to true identity changes. Across 49 evaluation setups, it outperforms prior methods in 48.
The same reference under identity-preserving contextual changes and identity-changing edits with similar context.
This behavior reflects a key requirement: invariance to contextual changes and sensitivity to identity changes.
To make this precise, we adopt a strictly visual definition of identity.
Under this definition, pose, viewpoint, lighting, and background can vary without changing identity, while changes to intrinsic properties alter it.
A useful metric therefore requires selective sensitivity: invariance to context and sensitivity to identity changes.
Perceptual metrics emphasize overall appearance, while foundation model embeddings often prioritize semantic alignment. As a result, they can be sensitive to contextual changes such as background or viewpoint, and may fail to distinguish visually similar but distinct instances.
Instance retrieval and re-identification systems address identity, but are often either domain-specific or lack a consistent notion of instance-level identity across settings.
ID-Sim measures identity consistency directly across settings.
Human judgments compared against prior metrics and ID-Sim on fine-grained identity decisions.
Clean supervision for identity is difficult to obtain from natural data alone. Real-world images rarely provide controlled contrast between identity change and contextual variation.
ID-Sim is trained on 10k triplets spanning roughly 10k instances across 10 diverse domains. The training data combines real instance matches with synthetic edits that introduce identity-preserving variation and identity-altering hard negatives.
The data pipeline combines real instance-level datasets with carefully designed generative edits. Real images provide broad domain coverage, while the edits create controlled identity-preserving positives and identity-altering negatives that are difficult to collect cleanly at scale from natural data alone.
Generative edits provide controlled identity-preserving positives and identity-breaking negatives that are difficult to obtain cleanly from real data alone.
The model uses a DINOv3 ViT-L backbone with lightweight projection heads and LoRA adapters. Training combines a global CLS contrastive objective for holistic identity discrimination with a patch-level contrastive objective for fine-grained local correspondence.
Global and patch-level supervision together encourage both holistic identity discrimination and fine-grained correspondence.
We evaluate ID-Sim across three task types: concept preservation (DreamBench++, Subjects2k), instance retrieval (PODS, DeepFashion2), and re-identification (PetFace, AerialCattle, CUTE). Baselines include perceptual metrics such as LPIPS and DreamSim, foundation models such as DINOv3 and CLIP, and a strong universal embedding retrieval model. Gains are largest when matching the same instance across strong context shifts while separating extremely but distinct identities.
Across various tasks, ID-Sim outperforms perceptual metrics, foundation embeddings, and strong supervised retrieval baselines in 48 of 49 cases.
Beyond a global similarity score, ID-Sim learns strong patch-level embeddings. These local features outperform DINOv3 patch features across identity-focused evaluations and transfer effectively to personalized segmentation in PerSAM.
The learned representation is useful both globally for comparison and locally for correspondence.
Select a reference shoe to compare against the shared scene. The overlay shows where ID-Sim finds the strongest local correspondence.
The paper performs a controlled analysis across four factors: identity, background, viewpoint, and lighting. Compared with prior metrics, ID-Sim shows the most desirable pattern: high sensitivity to identity changes and relatively low sensitivity to contextual variation.
ID-Sim achieves the strongest trade-off: high sensitivity to identity changes with low sensitivity to background, viewpoint, and lighting.
@misc{chae2026idsimidentityfocusedsimilaritymetric,
title = {ID-Sim: An Identity-Focused Similarity Metric},
author = {Julia Chae and Nicholas Kolkin and Jui-Hsien Wang and Richard Zhang and Sara Beery and Cusuh Ham},
year = {2026},
eprint = {2604.05039},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2604.05039},
}