ID-Sim: An Identity-Focused Similarity Metric

Julia Chae

Many vision tasks depend on whether two images depict the same visual identity under changing conditions. Yet most existing metrics capture general similarity, not identity consistency.

ID-Sim is a feed-forward similarity metric designed for identity-focused evaluation. It stays stable under identity-preserving changes like pose, viewpoint, background, and lighting, while responding to true identity changes. Across 49 evaluation setups, it outperforms prior methods in 48.

Selective sensitivity animation showing contextual changes versus identity changes

The same reference under identity-preserving contextual changes and identity-changing edits with similar context.

Defining Identity

This behavior reflects a key requirement: invariance to contextual changes and sensitivity to identity changes.

To make this precise, we adopt a strictly visual definition of identity.

Visual identity

refers to intrinsic properties such as shape, texture, and color.

An instance

is an object that shares that visual identity across images.

Under this definition, pose, viewpoint, lighting, and background can vary without changing identity, while changes to intrinsic properties alter it.

A useful metric therefore requires selective sensitivity: invariance to context and sensitivity to identity changes.

Where Existing Metrics Fall Short

Perceptual metrics emphasize overall appearance, while foundation model embeddings often prioritize semantic alignment. As a result, they can be sensitive to contextual changes such as background or viewpoint, and may fail to distinguish visually similar but distinct instances.

Instance retrieval and re-identification systems address identity, but are often either domain-specific or lack a consistent notion of instance-level identity across settings.

ID-Sim measures identity consistency directly across settings.

Human judgments compared against prior metrics and ID-Sim on fine-grained identity decisions.

Selected metric disagreement triplet showing Different ID, Ref, and Same ID columns

Human ID-Sim DINOv2 DreamSim UNeD

Human

ID-Sim

DINOv2

DreamSim

UNeD

Training ID-Sim

Clean supervision for identity is difficult to obtain from natural data alone. Real-world images rarely provide controlled contrast between identity change and contextual variation.

ID-Sim is trained on 10k triplets spanning roughly 10k instances across 10 diverse domains. The training data combines real instance matches with synthetic edits that introduce identity-preserving variation and identity-altering hard negatives.

Dataset curation pipeline combining real instances with identity-preserving and identity-altering edits

The data pipeline combines real instance-level datasets with carefully designed generative edits. Real images provide broad domain coverage, while the edits create controlled identity-preserving positives and identity-altering negatives that are difficult to collect cleanly at scale from natural data alone.

Generative edits provide controlled identity-preserving positives and identity-breaking negatives that are difficult to obtain cleanly from real data alone.

ID-Sim training overview with shared encoder and joint global and local supervision

The model uses a DINOv3 ViT-L backbone with lightweight projection heads and LoRA adapters. Training combines a global CLS contrastive objective for holistic identity discrimination with a patch-level contrastive objective for fine-grained local correspondence.

Global and patch-level supervision together encourage both holistic identity discrimination and fine-grained correspondence.

Results Across Tasks

We evaluate ID-Sim across three task types: concept preservation (DreamBench++, Subjects2k), instance retrieval (PODS, DeepFashion2), and re-identification (PetFace, AerialCattle, CUTE). Baselines include perceptual metrics such as LPIPS and DreamSim, foundation models such as DINOv3 and CLIP, and a strong universal embedding retrieval model. Gains are largest when matching the same instance across strong context shifts while separating extremely but distinct identities.

Across various tasks, ID-Sim outperforms perceptual metrics, foundation embeddings, and strong supervised retrieval baselines in 48 of 49 cases.

Global results across identity-focused tasks comparing baselines and ID-Sim

Local Correspondence with Patch Features

Beyond a global similarity score, ID-Sim learns strong patch-level embeddings. These local features outperform DINOv3 patch features across identity-focused evaluations and transfer effectively to personalized segmentation in PerSAM.

The learned representation is useful both globally for comparison and locally for correspondence.

Select a reference shoe to compare against the shared scene. The overlay shows where ID-Sim finds the strongest local correspondence.

Tennis shoes reference — Reference gallery

Red Nike shoe reference — Reference gallery

Multi-object scene containing several shoes — Target scene

Heatmap for tennis shoes reference over the multi-object shoe scene — Overlay

Selective Sensitivity Analysis

The paper performs a controlled analysis across four factors: identity, background, viewpoint, and lighting. Compared with prior metrics, ID-Sim shows the most desirable pattern: high sensitivity to identity changes and relatively low sensitivity to contextual variation.

ID-Sim achieves the strongest trade-off: high sensitivity to identity changes with low sensitivity to background, viewpoint, and lighting.

Selective sensitivity analysis showing response to identity and contextual factors

Key Takeaways

ID-Sim is a general-purpose identity-focused metric trained to reflect human selective sensitivity rather than generic appearance similarity.
The paper builds a unified training setup from real instance data plus controlled synthetic edits, giving the model explicit supervision for both identity preservation and identity change.
Across 49 evaluation setups spanning concept preservation, retrieval, and re-identification, ID-Sim outperforms prior methods in 48 cases and remains competitive with multimodal LLM judges at much lower cost.
Its representation is useful at both scales: the global embedding improves identity comparison, while the patch embeddings strengthen correspondence and substantially improve PerSAM segmentation performance.

BibTeX

@misc{chae2026idsimidentityfocusedsimilaritymetric,
  title         = {ID-Sim: An Identity-Focused Similarity Metric},
  author        = {Julia Chae and Nicholas Kolkin and Jui-Hsien Wang and Richard Zhang and Sara Beery and Cusuh Ham},
  year          = {2026},
  eprint        = {2604.05039},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2604.05039},
}