Subject-driven text-to-image (T2I) generation presents a significant challenge in balancing subject fidelity and text alignment, with traditional fine-tuning approaches proving inefficient. We introduce ContextualGraftor, a novel training-free framework for robust subject-driven T2I generation, leveraging the powerful FLUX.1-dev multimodal diffusion-transformer. It integrates two core innovations: Adaptive Contextual Feature Grafting (ACFG) and Hierarchical Structure-Aware Initialization (HSAI). ACFG enhances feature matching in attention layers through a lightweight contextual attention module that dynamically modulates reference feature contributions based on local semantic consistency, ensuring natural integration and reduced semantic mismatches. HSAI provides a structurally rich starting point by employing multi-scale structural alignment during latent inversion and an adaptive dropout strategy, preserving both global geometry and fine-grained subject details. Comprehensive experiments demonstrate that ContextualGraftor achieves superior performance across key metrics, outperforming state-of-the-art training-free methods like FreeGraftor. Furthermore, our method maintains competitive inference efficiency, offering an efficient and high-performance solution for seamless subject integration into diverse, text-prompted environments.