T-Person-GAN: Text-to-Person image generation with identity-consistency and manifold mix-up

Deyin Liu; Lin Yuanbo Wu; Bo Li; Ye Zhao; Zongyuan Ge; Jian Zhang

doi:10.1016/j.eswa.2025.128178

T-Person-GAN: Text-to-Person image generation with identity-consistency and manifold mix-up

Deyin Liu, Lin Yuanbo Wu, Bo Li, Ye Zhao, Zongyuan Ge, Jian Zhang

School of Electronics and Information

Research output: Contribution to journal › Article › peer-review

Abstract

In this paper, we introduce an end-to-end solution for generating high-resolution person images based solely on textual descriptions. While text-to-image models have made great strides in generating images of objects like flowers and birds, creating person images presents a unique set of challenges: 1) Identity Consistency: For the same person, it's crucial that the generated images exhibit visual details that maintain identity consistency. This means that features like identity-related textures, clothing, and even footwear should be consistent across different images of the same person. 2) Discriminative Power: The generated person images need to be robust in the face of inter-person variations caused by visual ambiguities. To tackle these challenges, we propose a generative model that leverages two novel mechanisms: 1) T-Person-GAN-ID: This mechanism integrates a one-stream generator with an identity-preserving network. It regularizes the representations of generated data in their feature space to ensure identity-consistency. This ensures that images of the same person maintain their unique identity-related features. 2) T-Person-GAN-ID-MM: Manifold mix-up is introduced to create mixed images, which involves linear interpolation between generated images from different manifold identities. We further enforce these interpolated images to be linearly classified in the feature space, essentially learning a linear classification boundary that can perfectly separate images from two distinct identities. The proposed method demonstrates a significant improvement in the challenging task of generating person images from text descriptions. We achieve impressive results with a Fre´chet Inception Distance of 47.81, an Inception Score of 3.96, and a Visual-Semantic Similarity of 0.21 on the benchmark dataset.

Original language	English
Article number	128178
Journal	Expert Systems with Applications
Volume	288
DOIs	https://doi.org/10.1016/j.eswa.2025.128178
State	Published - 1 Sep 2025

Keywords

Conditional generative adversarial networks
Manifold mix-up
Text-to-Person image generation

Access to Document

10.1016/j.eswa.2025.128178

Cite this

@article{3393cc5e69444eafa09ceb6cac1cd78f,

title = "T-Person-GAN: Text-to-Person image generation with identity-consistency and manifold mix-up",

abstract = "In this paper, we introduce an end-to-end solution for generating high-resolution person images based solely on textual descriptions. While text-to-image models have made great strides in generating images of objects like flowers and birds, creating person images presents a unique set of challenges: 1) Identity Consistency: For the same person, it's crucial that the generated images exhibit visual details that maintain identity consistency. This means that features like identity-related textures, clothing, and even footwear should be consistent across different images of the same person. 2) Discriminative Power: The generated person images need to be robust in the face of inter-person variations caused by visual ambiguities. To tackle these challenges, we propose a generative model that leverages two novel mechanisms: 1) T-Person-GAN-ID: This mechanism integrates a one-stream generator with an identity-preserving network. It regularizes the representations of generated data in their feature space to ensure identity-consistency. This ensures that images of the same person maintain their unique identity-related features. 2) T-Person-GAN-ID-MM: Manifold mix-up is introduced to create mixed images, which involves linear interpolation between generated images from different manifold identities. We further enforce these interpolated images to be linearly classified in the feature space, essentially learning a linear classification boundary that can perfectly separate images from two distinct identities. The proposed method demonstrates a significant improvement in the challenging task of generating person images from text descriptions. We achieve impressive results with a Fre´chet Inception Distance of 47.81, an Inception Score of 3.96, and a Visual-Semantic Similarity of 0.21 on the benchmark dataset.",

keywords = "Conditional generative adversarial networks, Manifold mix-up, Text-to-Person image generation",

author = "Deyin Liu and Wu, {Lin Yuanbo} and Bo Li and Ye Zhao and Zongyuan Ge and Jian Zhang",

note = "Publisher Copyright: {\textcopyright} 2025 Elsevier Ltd",

year = "2025",

month = sep,

day = "1",

doi = "10.1016/j.eswa.2025.128178",

language = "英语",

volume = "288",

journal = "Expert Systems with Applications",

issn = "0957-4174",

publisher = "Elsevier Ltd",

}

TY - JOUR

T1 - T-Person-GAN

T2 - Text-to-Person image generation with identity-consistency and manifold mix-up

AU - Liu, Deyin

AU - Wu, Lin Yuanbo

AU - Li, Bo

AU - Zhao, Ye

AU - Ge, Zongyuan

AU - Zhang, Jian

PY - 2025/9/1

Y1 - 2025/9/1

N2 - In this paper, we introduce an end-to-end solution for generating high-resolution person images based solely on textual descriptions. While text-to-image models have made great strides in generating images of objects like flowers and birds, creating person images presents a unique set of challenges: 1) Identity Consistency: For the same person, it's crucial that the generated images exhibit visual details that maintain identity consistency. This means that features like identity-related textures, clothing, and even footwear should be consistent across different images of the same person. 2) Discriminative Power: The generated person images need to be robust in the face of inter-person variations caused by visual ambiguities. To tackle these challenges, we propose a generative model that leverages two novel mechanisms: 1) T-Person-GAN-ID: This mechanism integrates a one-stream generator with an identity-preserving network. It regularizes the representations of generated data in their feature space to ensure identity-consistency. This ensures that images of the same person maintain their unique identity-related features. 2) T-Person-GAN-ID-MM: Manifold mix-up is introduced to create mixed images, which involves linear interpolation between generated images from different manifold identities. We further enforce these interpolated images to be linearly classified in the feature space, essentially learning a linear classification boundary that can perfectly separate images from two distinct identities. The proposed method demonstrates a significant improvement in the challenging task of generating person images from text descriptions. We achieve impressive results with a Fre´chet Inception Distance of 47.81, an Inception Score of 3.96, and a Visual-Semantic Similarity of 0.21 on the benchmark dataset.

AB - In this paper, we introduce an end-to-end solution for generating high-resolution person images based solely on textual descriptions. While text-to-image models have made great strides in generating images of objects like flowers and birds, creating person images presents a unique set of challenges: 1) Identity Consistency: For the same person, it's crucial that the generated images exhibit visual details that maintain identity consistency. This means that features like identity-related textures, clothing, and even footwear should be consistent across different images of the same person. 2) Discriminative Power: The generated person images need to be robust in the face of inter-person variations caused by visual ambiguities. To tackle these challenges, we propose a generative model that leverages two novel mechanisms: 1) T-Person-GAN-ID: This mechanism integrates a one-stream generator with an identity-preserving network. It regularizes the representations of generated data in their feature space to ensure identity-consistency. This ensures that images of the same person maintain their unique identity-related features. 2) T-Person-GAN-ID-MM: Manifold mix-up is introduced to create mixed images, which involves linear interpolation between generated images from different manifold identities. We further enforce these interpolated images to be linearly classified in the feature space, essentially learning a linear classification boundary that can perfectly separate images from two distinct identities. The proposed method demonstrates a significant improvement in the challenging task of generating person images from text descriptions. We achieve impressive results with a Fre´chet Inception Distance of 47.81, an Inception Score of 3.96, and a Visual-Semantic Similarity of 0.21 on the benchmark dataset.

KW - Conditional generative adversarial networks

KW - Manifold mix-up

KW - Text-to-Person image generation

UR - http://www.scopus.com/inward/record.url?scp=105006699560&partnerID=8YFLogxK

U2 - 10.1016/j.eswa.2025.128178

DO - 10.1016/j.eswa.2025.128178

M3 - 文章

AN - SCOPUS:105006699560

SN - 0957-4174

VL - 288

JO - Expert Systems with Applications

JF - Expert Systems with Applications

M1 - 128178

ER -

T-Person-GAN: Text-to-Person image generation with identity-consistency and manifold mix-up

Abstract

Keywords

Access to Document

Other files and links

Fingerprint

Cite this