FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

Haohang Xu, Lin Liu *, Zhibo Zhang, Rong Cong, Xiaopeng Zhang , Qi Tian
Huawei Inc.
* Project Leader     Corresponding Author
ECCV 2026

Abstract

Diffusion-based image editing models have achieved significant progress in real world applications. However, conventional models typically rely on natural language prompts, which often lack the precision required to localize target objects. Consequently, these models struggle to maintain background consistency due to their global image regeneration paradigm. Recognizing that visual cues provide an intuitive means for users to highlight specific areas of interest, we utilize bounding boxes as guidance to explicitly define the editing target. This approach ensures that the diffusion model can accurately localize the target while preserving background consistency. To achieve this, we propose FineEdit, a multi-level bounding box injection method that enables the model to utilize spatial conditions more effectively. To support this high precision guidance, we present FineEdit-1.2M, a large scale, finegrained dataset comprising 1.2 million image editing pairs with precise bounding box annotations. Furthermore, we construct a comprehensive benchmark, termed FineEdit-Bench, which includes 1,000 images across 10 subjects to effectively evaluate region based editing capabilities. Evaluations on FineEdit-Bench demonstrate that our model significantly outperforms state-of-the-art open-source models (e.g., Qwen-Image-Edit and LongCat-Image-Edit) in instruction compliance and background preservation. Further assessments on open benchmarks (GEdit and ImgEdit Bench) confirm its superior generalization and robustness.

Object Removal

Object Replacement

Object Style

Addition & Compose

Method Overview

Mixed Video-Image Finetuning

Overview of FineEdit framework, which includes two training stages: (a) Pre-training stage establishes multi-level spatial priors using early and deep fusion. (b) Post-training stage applies reinforcement learning with a novel decoupled reward function.

U-GAF Pipeline

Mixed Video-Image Finetuning

The proposed U-GAF Pipeline (Unified Generation, Annotation, and Filtering) consists of four synergistic stages for high-quality data synthesis: Data Curation, (2) Data Annotation, (3) Edit Generation, and (4) Data Refinement.