3D MedDiffusion: A 3D Medical Diffusion Model for Controllable and High-quality Medical Image Generation

Haoshen Wang1, Zhentao Liu1, Kaicong Sun1, Xiaodong Wang2, Dinggang Shen1,3,4, Zhiming Cui1
1School of Biomedical Engineering & State Key Laboratory of Advanced Medical Materials and Devices, ShanghaiTech University, Shanghai, China,
2Shanghai United Imaging Healthcare Co., Ltd., Shanghai, China,
3Shanghai Clinical Research and Trial Center, Shanghai, China,
4Shanghai United Imaging Intelligence Co., Ltd., Shanghai, China,
Tree Animation

Fig. 1. Unconditional generation results across multiple modalities (CT & MRI) and multiple regions (from head to leg).

Abstract

The generation of medical images presents significant challenges due to their high-resolution and three-dimensional nature. Existing methods often yield suboptimal performance in generating high-quality 3D medical images, and there is currently no universal generative framework for medical imaging. In this paper, we introduce the 3D Medical Diffusion (3D MedDiffusion) model for controllable, high-quality 3D medical image generation. 3D MedDiffusion incorporates a novel, highly efficient Patch-Volume Autoencoder that compresses medical images into latent space through patch-wise encoding and recovers back into image space through volume-wise decoding. Additionally, we design a new noise estimator to capture both local details and global structure information during diffusion denoising process. 3D MedDiffusion can generate fine-detailed, high-resolution images (up to 512x512x512) and effectively adapt to various downstream tasks as it is trained on large-scale datasets covering CT and MRI modalities and different anatomical regions (from head to leg). Experimental results demonstrate that 3D MedDiffusion surpasses state-of-the-art methods in generative quality and exhibits strong generalizability across tasks such as sparse-view CT reconstruction, fast MRI reconstruction, and data augmentation.

How It Works

Our method introduces the Patch-Volume Autoencoder, which compresses images into a latent space in a patch-wise manner and decodes them volume-wise. In the latent space, we perform diffusion and denoising processes. The proposed noise estimator, BiFlowNet, denoises the latent representation through two branches: the intra-patch flow, which independently restores each latent patch, and the inter-patch flow, which reconstructs the entire latent volume cohesively.

Pipeline Overview
Fig. 2. Patch-Volume Autoencoder. In the first stage, the model is trained solely to compress and reconstruct small patches from high-resolution volumes. In the second stage, all parameters are fixed except for the decoder, which is fine-tuned on high-resolution volumes to become a joint decoder.
Sample Outputs
Fig. 3. BiFlowNet noise estimator. The intra-patch flow focuses on denoising each patch and recovering fine-grained local details, while the inter-patch flow is designed to capture and reconstruct the global structures across the entire volume.

Results

Pipeline Overview
Fig. 4. Qualitative comparison on unconditional MR image generation.
Sample Outputs
Fig. 5. Qualitative comparison on unconditional CT image generation.

Downstream Tasks

The pre-trained 3D MedDiffusion can be seamlessly adapted to various downstream tasks by integrating it with ControlNet. By keeping the 3D MedDiffusion parameters fixed and applying high-efficient fine-tuning to ControlNet, the general-purpose generative 3D MedDiffusion is transformed into a task-specific 3D MedDiffusion.

Pipeline Overview
Fig. 6.. Fine-tuning 3D MedDiffusion to adapt to downstream tasks.
Sample Outputs
Fig. 7. Qualitative results of sparse-view CT reconstruction.
Sample Outputs
Fig. 8. Qualitative results of fastMRI reconstruction.

BibTeX

BibTex Code Here