nature.com

A deep learning model based on Mamba for automatic segmentation in cervical cancer brachytherapy

Introduction

Brachytherapy plays a crucial role in the comprehensive treatment of tumors, especially for cervical cancer. Accurately delineating target volumes and organs at risk (OARs) is essential for the efficacy and safety evaluating of brachytherapy1,2,3. According to reports and guidelines from the Groupe Européen de Curiethérapie (GEC) and the European Society for Radiotherapy and Oncology (ESTRO)4,5,6, there is subjective inconsistency in the delineation of OARs in 3D image-guided brachytherapy, including inter-observer and intra-observer variations. Furthermore, the 3D image-guided brachytherapy process is complex and labor-intensive, with limited time for contouring. Therefore, rapid and accurate automatic delineation of target volumes and OARs is crucial in the brachytherapy for cervical cancer.

In recent years, deep learning models based on convolutional neural networks (CNNs) have significantly advanced in the automatic segmentation of various tumor radiotherapy applications1,7,8,9,10,11,12. Among these, the U-net architecture13 has been widely adopted and has demonstrated exceptional performance in clinical target volume segmentation for cancers such as prostate cancer12, cervical cancer1,11,14, and breast cancer8,9. Despite their effectiveness, CNNs have inherent limitations related to their local focus, posing challenges in capturing global contextual information15,16. Transformer-based models adeptly capture long-range dependencies in input sequences using features like self-attention mechanisms and position encoding. However, they often require extensive computational resources, particularly when applied to high-resolution 3D medical image segmentatio17,18.

To enhance global modeling and reduce computational overhead, Mamba, based on the state space model (SSM)19, was designed to facilitate long-range dependency modeling while enhancing training speed and inference efficiency20. This approach has been extensively applied in medical image segmentation. U-Mamba21 used this approach to medical image segmentation for the first time by constructing a novel SSM-CNN hybrid model. Xing et al.22 introduced a 3D Mamba model by combining SSM with CNNs, achieving excellent performance in colorectal cancer segmentation.

To our knowledge, research on the automatic segmentation of HRCTV and OARs for cervical cancer brachytherapy remains sparse and largely relies on conventional methods. Therefore, we developed the AM-UNet model based on the Mamba and CNN architectures, which utilizes 3D CT as input to achieve improved automatic segmentation performance at lower computational costs. Our experimental results demonstrate that the proposed AM-UNet accurately and efficiently segments HRCTV and OARs with reduced computational costs.

Methods

The study was approved by the Ethics Committee of Fujian Cancer Hospital (No.SQ2022-191), and the requirement for informed consent was waived. All research involving human participants was conducted in accordance with the Declaration of Helsinki. The dataset included patients CT images used for model training and testing the model and the model’s performance was evaluated both quantitatively and qualitatively. Additionally, the impact of automatic segmentation on dose distribution was assessed. The overall workflow of the study design is illustrated in Fig. 1.

figure 1

Flowchart of automatic segmentation and dosimetric evaluation of the clinical target volume in this study.

Full size image

Dataset and data preprocessing

In this study, we retrospectively analyzed 645 groups of CT scan from 179 patients who underwent brachytherapy for cervical cancer at Fujian Cancer Hospital between 2006 and 2020. Prior to each CT scan, each patient needs to take an oral contrast agent (iopamidol) for small bowel preparation. Additionally, the bladder is filled with 100 cc of saline, and CT images are obtained on a Brilliance CT Big Bore (Philips Medical Systems Inc., Cleveland, OH, USA) with a dimension of 512 × 512 pixels and a slice thickness of 2.5 mm.

According to the GEC-ESTRO definition of HRCTV, the prescription dose is administered to the surface of the uterus, cervix, parametrium and the upper vagina using intrauterine applicator and interstitial needles. The OARs in the brachytherapy plan include the bladder, rectum, and sigmoid. According to ICRU Report No. 8923, the volumetric dose to 2–3 cm3 of normal tissue and organ walls is linked to brachytherapy side effects. The dose exposure to normal tissue, assessed by outlining the organ’s outer wall with a single line, is comparable to using a double line. The contours of the HRCTV and OARs were delineated by two experienced radiation oncologists (LLZ and JL with ten and six years of clinical experience, respectively) using 3D Slicer, under the guidance of a senior radiation oncologist (QX). To reduce the computational cost and time required for model training, we cropped all images based on the contour locations to a size of 64 × 224 × 224. The radiotherapy plans were developed based on the manual contours and automatic contours by a medical physicist (JHC, with seven years of clinical experience) with the Oncentra (Nucletron, Elekta AB, Stockholm, Sweden, V.4.3). The prescribed dose were 7 Gy to the HRTCV.

3D model AM-UNet

The AM-UNet primarily consists of three components: (1) a 3D feature encoder composed of 3D Mamba blocks and downsampling convolutions, (2) a 3D decoder that performs upsampling through transposed convolutions, and (3) a channel attention mechanism composed of convolutional layers and residual connections. Figure 2 provides a detailed illustration of the overall architecture of the proposed AM-UNet. The encoder section primarily consists of a stem layer and four 3D Mamba layers. The stem layer is composed of two convolutional layers with a kernel size of 3, padding of 1, and a stride of (1) Each 3D Mamba layer comprises a Mamba layer, MLP, and a downsampling convolutional layer with a kernel size of 3, padding of 1, and a stride of (2) The specific structure of the Mamba layer is detailed in Supplementary Material 1. The convolutional channel attention (CCA) module consists of five blocks of convolutional layers and Rectified Linear Unit (ReLU) activation functions, leveraging convolutional layers and residual connections to further learn spatial relationships and inter-channel correlations within the feature maps. Finally, the output from each layer is processed through convolutional layers to generate a four-channel segmentation result.

figure 2

The overview of the proposed AM-UNet.

Full size image

Model training

All data were divided into training, validation, and test sets by patient in a ratio of 6:2:2. Specifically, 426 groups of CT scan from 107 patients were used for training, 137 groups of CT scan from 36 patients for validation, and 130 groups of CT scan from 36 patients for independent testing. During training, all models were optimized using the AdamW optimizer, with an initial learning rate ranging from 1e−3 to 1e−2 and a weight decay of 1e−4. The best hyperparameters for other baseline models were determined based on the settings from the original literature and further fine-tuned through our experiments. The loss function combined cross-entropy and Dice coefficient, with dynamic weight adjustment based on validation performance. Data augmentation was applied with a 30% probability, including random flipping, scaling, and other operations to enhance model generalization. The batch size was set to 2, and training was conducted over 100 epochs. The specific training hyperparameters for all models are detailed in Supplementary Material 2. To ensure reproducibility, random seeds were set for all training processes. All experiments were performed on an NVIDIA RTX A5000 24GB GPU. The code is available online: https://github.com/khuanging/AM-UNet.

Dosimetric assessment

To evaluate the impact of geometric errors in automatic segmentation on dose distribution, two independent brachytherapy plans were generated for each CT scan in the test set: (1) manually segmented contours and (2) automatically segmented contours using the AM-UNet model. Dose-volume indices (DVI) and dose-volume histograms (DVH) were used for dose quantification analysis. For HRCTV, we compared the differences in Dmean and D90% between the two plans, where Dmean and D90% represent the mean dose and minimum dose to 90% of the HRCTV, respectively. For OAR, Dmean, D0.1cc, D2cc, and D5cc were used to compare dose differences between the different brachytherapy plans, where Dxcc represents the minimum dose received by x cm³ of the OAR. The specific brachytherapy plan structures and corresponding dosimetric indices are detailed in Supplementary Material 3.

Performance evaluation

For analyzing the performance of the automatic segmentation model, we used both quantitative metrics and subjective evaluations to assess its clinical usability. The quantitative analysis involved using the Dice similarity coefficient (DSC) and 95% Hausdorff distance (HD95) to measure the differences between manual delineations and model-predicted contours. Details of the performance indicator calculations are in Supplementary Material 4.

To further analyze the clinical applicability of the automatic segmentation model, model predictions were visually evaluated by two experienced radiation oncologists (Observer #1: LLC and Observer #2: JL) in collaboration. In this part of the analysis, the oncologists subjectively scored the prediction results of model predictions based on the following four-grade clinical criteria11: A—Acceptance (Contours are clinically acceptable), B—Minor Revision (Contours need minor adjustments), C—Major Revision (Contours need major adjustments), and D—Rejection (Contours are clinically unacceptable).

Statistical analysis

The nonparametric Wilcoxon 2-sided test (Wilcox) was used to compare the performance differences between different models. A p-value strictly less than 0.05 was considered statistically significant. All statistical analyses were performed using Python software (version 3.8.18, Anaconda Inc.).

Results

Automatic segmentation performance

Figure 3 shows the automatic segmentation results of HRCTV and OARs for each model in the same patient undergoing cervical cancer brachytherapy. Table 1 summarizes the specific metric results of each model on the test set. Among all models, AM-UNet demonstrated superior performance in both quantitative metrics for HRCTV and OAR segmentation. For HRCTV, AM-UNet achieved the best DSC (0.862), and HD95 (3.638 mm) significantly outperformed other models (p = 0.108 ) except nnU-Net. AM-UNet demonstrated superior performance in both DSC (0.937) and HD95 (2.578 mm) for bladder segmentation, surpassing other models (p < 0.05 and p = 0.096), except for nnU-Net. For rectum, AM-UNet also had the best metrics (DSC, 0.823; HD95, 5.312 mm), significantly outperforming other models (p < 0.05). Although all models performed moderately in sigmoid segmentation, AM-UNet still showed the best performance with a DSC of 0.725 and an HD95 of 18.739 mm, which was statistically significant (p < 0.05). The box plots of the results of different models on the test set are shown in Supplementary Material 5.

figure 3

Axial, sagittal, coronal, and 3d views showing HRCTV, bladder, rectum and sigmoid predicted by the model and delineated by manual. HRCTV high-risk clinical target volume.

Full size image

Table 1 Summary of quantitative metrics for HRCTV and OAR automatic segmentation results on the test set by different models.

Full size table

Dosimetric evaluation

Table 2 compares the dose metrics of HRCTV and OARs between AM-UNet automatic segmentation and manual contouring in brachytherapy planning. For HRCTV, the relative differences in Dmean and D90% between manual and automatic segmentation were less than 1%, with no significant difference (p > 0.05). Among all OARs, the radiation doses for Bladder_D0.1cc, Bladder_D2cc, and Bladder_D5cc showed significant differences (p < 0.05), while the other metrics showed no significant differences. Except for Bladder_D0.1cc, which had a relatively substantial difference (− 8.36%), the other differences were all less than 5%. Figure 4 illustrates the dose distribution and DVH for cases with the best and worst DSC. In Fig. 4A, B, differences in dose distributions for the best DCS patients were minimal between manual and automated segmentation, while the worst results showed some gaps between dose distributions; in the DVH curves of Fig. 4C, D, the dosimetric curves for the best DSC nearly overlapped, while the worst DSC showed slight differences in the sigmoid and bladder dose curve.

Table 2 Mean dose for each clinical target volume and assessment segment of the OAR calculated from the manual contour (Manual), AM-UNet (Auto), and minimum value of the HRCTV.

Full size table

Fig. 4

figure 4

Dose distributions and DVH curves for treatment evolution using manual contouring and automatic segmentation of contours. (A,B) Measurement distribution plots for the best and worst test patients for DSC, (C,D) DVH curves for DSC best and worst test patients. DVH dose-volume histogram, DSC dice similarity coefficient.

Full size image

Qualitative evaluation

The oncologist evaluated and confirmed all predictions of the test cases on a slice-by-slice basis. Figure 5 shows the experts’ qualitative assessment of the model segmentation results. Among all cases in the test set, 0.77% (n = 1) were deemed acceptable, 63.85% (n = 83) required minor revisions and 35.38% (n = 46) required major revisions; no patients were rejected. When considering specific site evaluations, HRCTV achieved the best results (Acceptance, 15.38%; Minor revision, 77.69%; Major revision, 6.92%), followed by rectum (10.77%, 84.62%, 4.62%). In the automatic segmentation of sigmoid, one patient CT was deemed to require re-outlining. The specific results of the oncologists’ qualitative scores are presented in Supplementary Material 6. Additionally, the visualization results for patients of each grade are provided in Supplementary Material 7.

figure 5

Stacked histograms show the qualitative ratings of tumor experts on the segmentation results of AM-UNet.

Full size image

Discussion

The delineation of HRCTV and OARs are crucial in brachytherapy for cervical cancer. In clinical practice, radiation oncologists are required to delineate this task volume through time-consuming and laborious manual contouring. Despite the availability of standard guidelines, the process remains highly dependent on the expertise of clinicians, and there are issues with intra- and inter-observer consistency24. Currently, there are few studies on the automatic segmentation of HRCTV and OARs in cervical cancer brachytherapy1,11,25, most of which use traditional convolutional neural networks. Therefore, we collected 645 groups of CT scan from 179 cervical cancer patients receiving brachytherapy at our institution. We proposed a novel deep learning network, AM-UNet, for the automatic segmentation of HRCTV and OARs.

The Mamba architecture, based on SSM, offers a novel design that combines the ability to capture long-range dependencies akin to Transformers, without being constrained to the local features of 3D images as seen in traditional CNNs21,26,27. Unlike the self-attention mechanism in Transformers, which involves quadratic computational complexity, Mamba significantly reduces computational demands, making it particularly advantageous for tasks like 3D image segmentation that are computationally intensive.

We found that the quantitative metrics for HRCTV, bladder, rectum, and sigmoid predicted by AM-UNet were superior to other models, except for nnU-Net. Due to nnU-Net’s unique adaptive training and data processing methods, it outperforms AM-UNet in certain aspects. Among these, bladder segmentation performed exceptionally well across all deep-learning models, primarily due to the bladder’s distinct shape and volume compared to HRCTV and other OARs. In contrast, all models performed poorly in segmenting the sigmoid. However, this outcome is consistent with findings from other studies2,28,29, which have noted that the sigmoid is challenging to segment due to its complex anatomy and low image contrast2,30. AM-UNet, based on the Mamba architecture, effectively captures long-range three-dimensional features in CT images, which enhances segmentation performance for larger or irregularly shaped organs like the bladder, rectum, and sigmoid. Additionally, for HRCTV, which has blurred boundaries and a complex structure, the model employs convolutional attention and multi-scale feature fusion for more precise feature aggregation. As shown in Fig. 3, AM-UNet demonstrates better segmentation performance compared to other models.

In their study on automatic CT segmentation for cervical cancer brachytherapy patients, Zhang et al.28 introduced the DSD-UNET model, achieving DSC of 0.83, 0.87, 0.82, and 0.65 for HRCTV, bladder, rectum, and sigmoid, respectively. These values are lower compared to our study, possibly due to limitations in CNN’s ability to extract spatial contextual features, thereby struggling with the varied shapes and volumes of OARs in CT images. On the other hand, Wang et al.2 proposed an enhanced CNN model, achieving higher DSC values of 0.87, 0.94, 0.86, and 0.79 for HRCTV, bladder, rectum, and sigmoid, respectively, slightly outperforming our results. However, their study was limited by a small sample size of only 10 cases in the test group, making it challenging to fully assess the model’s generalizability and stability. This issue is prevalent in deep learning research, particularly in medical imaging30,31,32. Deep-learning models require independent testing with broad datasets. Personalized studies based on multiple CT scans per patient better reflect clinical realities, confirming the effectiveness of our model in practical clinical applications.

Dose assessment is more informative than geometric assessment in automated segmentation studies for brachytherapy. Yoganathan et al.33 found that geometric metrics for the sigmoid were poor but highly consistent with manually delineated dosimetry. This finding is similar to our study, where all dose metrics for the sigmoid showed no significant differences (p > 0.05). In Wang et al.‘s study2, substantial dose variations were observed in the bladder and rectum, similar to our findings where, despite favorable geometric parameters of the bladder, dosimetric parameters such as bladder_D0.1cc, bladder_D2cc, and bladder_D5cc exhibited significant differences. These results emphasize the importance of considering both geometric and dosimetric parameters when assessing automatic segmentation performance, and the critical role of experienced oncology radiation experts in reviewing particularly high-dose region structures.

To further demonstrate the clinical applicability of AM-UNet, we conducted a subjective evaluation of the segmentation results. Upon comparing quantitative metrics with subjective evaluation results, we found that although the bladder had better results in the quantitative analysis, it had poorer results in the subjective assessment (only one patient was deemed clinically acceptable). This discrepancy may be because the bladder’s distinctiveness prompts oncologists to adopt stricter evaluation criteria. Meanwhile, HRCTV performed best in subjective evaluation, with 15.38% of cases deemed clinically acceptable and 77.69% requiring only minor adjustments (Supplementary Material 6). This can be attributed to the complexity of HRCTV, as there is no detailed consensus or standard for its contouring. Different clinical centers or oncologists may have varying standards and preferences for HRCTV delineation.

One significant advantage of automated HRCTV and OARs segmentation is the reduction in manual contouring time for radiation oncologists, decreasing the complexity of the brachytherapy process. The average time to manually contour HRCTV and OARs for a cervical cancer patient is 90–120 min34, while our AM-UNet model requires only 5 s. Even considering potential further manual adjustments to the contours, the total processing time per patient does not exceed 10 min on average. The AM-UNet significantly simplifies the treatment workflow and reduces subjective inconsistencies among experts.

Despite the excellent results achieved in our work, all current data come from a single center. In the future, it will be necessary to include more centers for independent testing to evaluate the model’s generalization performance. Although our model performed best in the segmentation of the sigmoid, the variability of this structure necessitates further consideration of data processing and model structure optimization to enhance its segmentation ability. On the other hand, considering our success in CT images of patients with cervical cancer patients, this approach may apply to the volumetric segmentation of other cancers or MRI images. We will explore this possibility in future work.

Conclusion

The delineation of HRCTV and OARs is a critical component of brachytherapy planning, directly impacting dose distribution to these organs and target volumes. Automatic segmentation technology enables rapid and repeatable contour delineation, effectively reducing the time and labor costs associated with the treatment process. The proposed AM-UNet model achieved good consistency between automatic and manual segmentation contours for HRCTV and OARs in cervical cancer brachytherapy, although some results still have clinical acceptability issues. Nonetheless, this study offers an efficient solution for the demanding clinical workload.

Read full news in source page