TY - JOUR
T1 - DICE
T2 - Tuning-Free Dynamic High-Fidelity Identity Customization and Enhancement using Multi-Modal Contrastive Fusion for Consumer Devices
AU - Khowaja, Sunder Ali
AU - Pathan, Muhammad Salman
AU - Dev, Kapal
AU - Lee, Ik Hyun
N1 - Publisher Copyright:
© 1975-2011 IEEE.
PY - 2025
Y1 - 2025
N2 - High Fidelity (HiFi) identity customization with text-to-image generation has gained a lot of interest from all four quadrants, such as industries, consumers, researchers, and digital content creators. Such generational models are capable of personalizing images with pretrained diffusion models without extensive fine-tuning. However, existing works often compromise HiFi or generative behavior of the original model due to computational constraints associated with training identity customization on consumer electronic devices. Furthermore, when using auxiliary images for fusion, existing models often compromise the identity customization. In this regard, we propose Dynamic high-fidelity Identity Customization and Enhancement (DICE) that integrates a vision transformer (ViT), specifically dealing with facial and non-facial images to extract semantic features, a dynamic and multi-model contrastive fusion strategy, denoising diffusion model, and a composite loss function. The DICE leverages evolved feature extraction, multi-scale feature fusion, adaptive contrastive paths, and adaptive composite loss to achieve high fidelity, editability, and minimal refinement to the base model even for the fusion of base image with the auxiliary one. Such tuning-free identity customization is appropriate for the consumers on their resource constrained electronic devices, as it requires no retraining, shifting the computational burden to a one-time, server-side training process. Experiments demonstrate that DICE outperforms existing state-of-the-art methods while offering a flexible solution for personalized image generation.
AB - High Fidelity (HiFi) identity customization with text-to-image generation has gained a lot of interest from all four quadrants, such as industries, consumers, researchers, and digital content creators. Such generational models are capable of personalizing images with pretrained diffusion models without extensive fine-tuning. However, existing works often compromise HiFi or generative behavior of the original model due to computational constraints associated with training identity customization on consumer electronic devices. Furthermore, when using auxiliary images for fusion, existing models often compromise the identity customization. In this regard, we propose Dynamic high-fidelity Identity Customization and Enhancement (DICE) that integrates a vision transformer (ViT), specifically dealing with facial and non-facial images to extract semantic features, a dynamic and multi-model contrastive fusion strategy, denoising diffusion model, and a composite loss function. The DICE leverages evolved feature extraction, multi-scale feature fusion, adaptive contrastive paths, and adaptive composite loss to achieve high fidelity, editability, and minimal refinement to the base model even for the fusion of base image with the auxiliary one. Such tuning-free identity customization is appropriate for the consumers on their resource constrained electronic devices, as it requires no retraining, shifting the computational burden to a one-time, server-side training process. Experiments demonstrate that DICE outperforms existing state-of-the-art methods while offering a flexible solution for personalized image generation.
KW - Diffusion Models
KW - Identity Customization
KW - Multi-Modal Contrastive Fusion
KW - Personalized Image Generation
KW - Vision Transformers
UR - https://www.scopus.com/pages/publications/105019933785
U2 - 10.1109/TCE.2025.3624567
DO - 10.1109/TCE.2025.3624567
M3 - Article
AN - SCOPUS:105019933785
SN - 0098-3063
JO - IEEE Transactions on Consumer Electronics
JF - IEEE Transactions on Consumer Electronics
ER -