T-Person-GAN: Text-to-Person image generation with identity-consistency and manifold mix-up

Deyin Liu; Lin Yuanbo Wu; Bo Li; Ye Zhao; Zongyuan Ge; Jian Zhang

doi:10.1016/j.eswa.2025.128178

T-Person-GAN: Text-to-Person image generation with identity-consistency and manifold mix-up

Deyin Liu, Lin Yuanbo Wu, Bo Li, Ye Zhao, Zongyuan Ge, Jian Zhang

电子信息学院

科研成果: 期刊稿件 › 文章 › 同行评审

摘要

In this paper, we introduce an end-to-end solution for generating high-resolution person images based solely on textual descriptions. While text-to-image models have made great strides in generating images of objects like flowers and birds, creating person images presents a unique set of challenges: 1) Identity Consistency: For the same person, it's crucial that the generated images exhibit visual details that maintain identity consistency. This means that features like identity-related textures, clothing, and even footwear should be consistent across different images of the same person. 2) Discriminative Power: The generated person images need to be robust in the face of inter-person variations caused by visual ambiguities. To tackle these challenges, we propose a generative model that leverages two novel mechanisms: 1) T-Person-GAN-ID: This mechanism integrates a one-stream generator with an identity-preserving network. It regularizes the representations of generated data in their feature space to ensure identity-consistency. This ensures that images of the same person maintain their unique identity-related features. 2) T-Person-GAN-ID-MM: Manifold mix-up is introduced to create mixed images, which involves linear interpolation between generated images from different manifold identities. We further enforce these interpolated images to be linearly classified in the feature space, essentially learning a linear classification boundary that can perfectly separate images from two distinct identities. The proposed method demonstrates a significant improvement in the challenging task of generating person images from text descriptions. We achieve impressive results with a Fre´chet Inception Distance of 47.81, an Inception Score of 3.96, and a Visual-Semantic Similarity of 0.21 on the benchmark dataset.

源语言	英语
文章编号	128178
期刊	Expert Systems with Applications
卷	288
DOI	https://doi.org/10.1016/j.eswa.2025.128178
出版状态	已出版 - 1 9月 2025

访问文件

10.1016/j.eswa.2025.128178

其它文件与链接

链接到 Scopus 的出版物

引用此

@article{3393cc5e69444eafa09ceb6cac1cd78f,

title = "T-Person-GAN: Text-to-Person image generation with identity-consistency and manifold mix-up",

abstract = "In this paper, we introduce an end-to-end solution for generating high-resolution person images based solely on textual descriptions. While text-to-image models have made great strides in generating images of objects like flowers and birds, creating person images presents a unique set of challenges: 1) Identity Consistency: For the same person, it's crucial that the generated images exhibit visual details that maintain identity consistency. This means that features like identity-related textures, clothing, and even footwear should be consistent across different images of the same person. 2) Discriminative Power: The generated person images need to be robust in the face of inter-person variations caused by visual ambiguities. To tackle these challenges, we propose a generative model that leverages two novel mechanisms: 1) T-Person-GAN-ID: This mechanism integrates a one-stream generator with an identity-preserving network. It regularizes the representations of generated data in their feature space to ensure identity-consistency. This ensures that images of the same person maintain their unique identity-related features. 2) T-Person-GAN-ID-MM: Manifold mix-up is introduced to create mixed images, which involves linear interpolation between generated images from different manifold identities. We further enforce these interpolated images to be linearly classified in the feature space, essentially learning a linear classification boundary that can perfectly separate images from two distinct identities. The proposed method demonstrates a significant improvement in the challenging task of generating person images from text descriptions. We achieve impressive results with a Fre´chet Inception Distance of 47.81, an Inception Score of 3.96, and a Visual-Semantic Similarity of 0.21 on the benchmark dataset.",

keywords = "Conditional generative adversarial networks, Manifold mix-up, Text-to-Person image generation",

author = "Deyin Liu and Wu, {Lin Yuanbo} and Bo Li and Ye Zhao and Zongyuan Ge and Jian Zhang",

note = "Publisher Copyright: {\textcopyright} 2025 Elsevier Ltd",

year = "2025",

month = sep,

day = "1",

doi = "10.1016/j.eswa.2025.128178",

language = "英语",

volume = "288",

journal = "Expert Systems with Applications",

issn = "0957-4174",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - T-Person-GAN

T2 - Text-to-Person image generation with identity-consistency and manifold mix-up

AU - Liu, Deyin

AU - Wu, Lin Yuanbo

AU - Li, Bo

AU - Zhao, Ye

AU - Ge, Zongyuan

AU - Zhang, Jian

PY - 2025/9/1

Y1 - 2025/9/1

N2 - In this paper, we introduce an end-to-end solution for generating high-resolution person images based solely on textual descriptions. While text-to-image models have made great strides in generating images of objects like flowers and birds, creating person images presents a unique set of challenges: 1) Identity Consistency: For the same person, it's crucial that the generated images exhibit visual details that maintain identity consistency. This means that features like identity-related textures, clothing, and even footwear should be consistent across different images of the same person. 2) Discriminative Power: The generated person images need to be robust in the face of inter-person variations caused by visual ambiguities. To tackle these challenges, we propose a generative model that leverages two novel mechanisms: 1) T-Person-GAN-ID: This mechanism integrates a one-stream generator with an identity-preserving network. It regularizes the representations of generated data in their feature space to ensure identity-consistency. This ensures that images of the same person maintain their unique identity-related features. 2) T-Person-GAN-ID-MM: Manifold mix-up is introduced to create mixed images, which involves linear interpolation between generated images from different manifold identities. We further enforce these interpolated images to be linearly classified in the feature space, essentially learning a linear classification boundary that can perfectly separate images from two distinct identities. The proposed method demonstrates a significant improvement in the challenging task of generating person images from text descriptions. We achieve impressive results with a Fre´chet Inception Distance of 47.81, an Inception Score of 3.96, and a Visual-Semantic Similarity of 0.21 on the benchmark dataset.

AB - In this paper, we introduce an end-to-end solution for generating high-resolution person images based solely on textual descriptions. While text-to-image models have made great strides in generating images of objects like flowers and birds, creating person images presents a unique set of challenges: 1) Identity Consistency: For the same person, it's crucial that the generated images exhibit visual details that maintain identity consistency. This means that features like identity-related textures, clothing, and even footwear should be consistent across different images of the same person. 2) Discriminative Power: The generated person images need to be robust in the face of inter-person variations caused by visual ambiguities. To tackle these challenges, we propose a generative model that leverages two novel mechanisms: 1) T-Person-GAN-ID: This mechanism integrates a one-stream generator with an identity-preserving network. It regularizes the representations of generated data in their feature space to ensure identity-consistency. This ensures that images of the same person maintain their unique identity-related features. 2) T-Person-GAN-ID-MM: Manifold mix-up is introduced to create mixed images, which involves linear interpolation between generated images from different manifold identities. We further enforce these interpolated images to be linearly classified in the feature space, essentially learning a linear classification boundary that can perfectly separate images from two distinct identities. The proposed method demonstrates a significant improvement in the challenging task of generating person images from text descriptions. We achieve impressive results with a Fre´chet Inception Distance of 47.81, an Inception Score of 3.96, and a Visual-Semantic Similarity of 0.21 on the benchmark dataset.

KW - Conditional generative adversarial networks

KW - Manifold mix-up

KW - Text-to-Person image generation

UR - http://www.scopus.com/inward/record.url?scp=105006699560&partnerID=8YFLogxK

U2 - 10.1016/j.eswa.2025.128178

DO - 10.1016/j.eswa.2025.128178

M3 - 文章

AN - SCOPUS:105006699560

SN - 0957-4174

VL - 288

JO - Expert Systems with Applications

JF - Expert Systems with Applications

M1 - 128178

ER -

T-Person-GAN: Text-to-Person image generation with identity-consistency and manifold mix-up

摘要

访问文件

其它文件与链接

指纹

引用此