Adaptive Contextual Feature Grafting and Hierarchical Structure-Aware Initialization for Training-Free Subject-Driven Text-to-Image Generation

Salma Ali; Noah Fang

doi:10.20944/preprints202512.1688.v1

Submitted:

17 December 2025

Posted:

18 December 2025

You are already at the latest version

Abstract

Subject-driven text-to-image (T2I) generation presents a significant challenge in balancing subject fidelity and text alignment, with traditional fine-tuning approaches proving inefficient. We introduce ContextualGraftor, a novel training-free framework for robust subject-driven T2I generation, leveraging the powerful FLUX.1-dev multimodal diffusion-transformer. It integrates two core innovations: Adaptive Contextual Feature Grafting (ACFG) and Hierarchical Structure-Aware Initialization (HSAI). ACFG enhances feature matching in attention layers through a lightweight contextual attention module that dynamically modulates reference feature contributions based on local semantic consistency, ensuring natural integration and reduced semantic mismatches. HSAI provides a structurally rich starting point by employing multi-scale structural alignment during latent inversion and an adaptive dropout strategy, preserving both global geometry and fine-grained subject details. Comprehensive experiments demonstrate that ContextualGraftor achieves superior performance across key metrics, outperforming state-of-the-art training-free methods like FreeGraftor. Furthermore, our method maintains competitive inference efficiency, offering an efficient and high-performance solution for seamless subject integration into diverse, text-prompted environments.

Keywords:

text-to-image generation

;

subject-driven

;

diffusion transformer

;

training-free

;

feature grafting

Subject:

Computer Science and Mathematics - Computer Vision and Graphics

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Adaptive Contextual Feature Grafting and Hierarchical Structure-Aware Initialization for Training-Free Subject-Driven Text-to-Image Generation

Abstract

Keywords:

Subject:

MDPI Initiatives

Important Links

Subscribe