1. Introduction
Cancer has become a crucial topic in public health, threatening the world, and breast cancer is one of the most prevalent types of cancer among women [
1]. Today, with advancements in prevention, diagnosis, and treatment methods, survival rates for this disease have increased. According to conducted research, the average survival rate for women with breast cancer after 5 years is approximately 90%, and after 10 years, it is approximately 84%. Breast cancer results from the uncontrolled growth of abnormal cells in breast tissue, including ducts (tubes that transfer milk to the nipple) and lobules (glands that produce milk). This cancer is the most common type among Iranian women, constituting nearly one-fourth of all cancers [
2]. Recent studies evaluating the incidence of this cancer in Iran report a lower average age of patients compared with other parts of the world. In other words, the peak incidence in Iranian women occurs in the fourth and fifth decades of life, a decade earlier than the global peak incidence [
3]. As you know, QL is a multidimensional structure that includes domains such as physical, social, and psychological. It reflects the well-being of individuals and communities, delineating both negative and positive life characteristics [
2]. Breast cancer patients have various concerns regarding treatment, family, and financial matters. Although medical professionals provide care, patient concerns are sometimes overlooked in clinical settings [
4]. Conversely, the prevalence of depression and anxiety in patients with breast cancer is approximately 30%. The patient’s adaptation to breast cancer treatment significantly impacts their QL. Consequently, appropriate and timely interventions may be crucial in improving patient adaptation and QL [
5].
Furthermore, hope, defined as an individual’s belief in the ability to achieve goals, especially in situations where the individual can influence the outcome, plays a significant role. Pain resulting from chemotherapy can impede the adaptation of cancer patients to treatment and affect subsequent outcomes. A qualitative study conducted in 2022 demonstrated that hope is a positive coping strategy for cancer patients, fostering courage and resilience [
6]. In this research, the relationship between HL and QL is investigated and the amount of QL is predicted. Shen et al., exploring the relationship between hope and life of quality, proved that income, hope, self-efficacy, and social support are positive predictors of quality of life, whereas cancer stage is a negative predictor. They recommended that support programs and interventions aimed at increasing hope levels, self-efficacy, and social support during the care of this group receive attention [
7]. In 2022, Li et al. studied the relationship between HL and QL in women who have undergone breast cancer chemotherapy. Regression analysis revealed that the QL of patients was significantly associated with age, marital status, education level, chemotherapy cycle, and hope. However, researchers believe that further studies are needed to determine whether nurses can influence this aspect of care [
6]. Quality of Life is a crucial issue for cancer patients. According to the Savić et al. findings, general, physical, self-efficacy, and hope parameters significantly affect the QL of women with breast cancer in Iran [
8]. In 2023, Zhang et al. examined the effectiveness of family-centered positive psychological intervention on resilience, hope, perceived benefits, and QL in breast cancer patients and their caregivers.
The study findings showed that implementing a family-centered positive psychological intervention led to significant improvements in psychological well-being and quality of life for both breast cancer patients and their caregivers [
9]
All studies that investigated the relationship between QL and hope are listed in
Table 1.
With the emergence of medical databases containing extensive information related to quality of life, it is possible to predict quality of life using machine learning techniques. Machine learning models have been used to assess QL in various cancer fields. For example, early detection and intervention for lymphedema are essential to improve QL in breast cancer survivors. Therefore, Wei et al. presented a predictive model for the early detection of lymphedema associated with breast cancer.
Finally, the output model has been implemented as an open-access, web-based application, which allows users to separately estimate the likelihood of lymphedema in real time [
10]
However, a decrease in QL in thyroid cancer patients after thyroidectomy is common, but we can notice the lack of predictive methods to estimate the extent of QL reduction. Because of the studies conducted in 2022, researchers presented a model to predict QL in thyroid cancer patients with relatively high accuracy. They believe that these findings should be clinically employed to optimize healthcare interventions [
11]. Appropriate and timely interventions may enhance the adaptability, resilience, and QL of breast cancer patients during the treatment process and post-disease period.
Nutinen et al. conducted a study to investigate the effect of machine learning in a clinical decision support system to improve physical performance in predicting patients’ QL during the treatment process.
They found that the performance of physicians in evaluating patients’ QL increased using machine learning model predictions [
12]
As sleep disturbance is a primary symptom of breast cancer and can seriously affect QL during and years after treatment, a study in Japan recommends routine screening. Their predictive model, which uses machine learning, provides important clinical insights for early diagnosis of insomnia and intervention in breast cancer survivors [
13]. On the one hand, artificial intelligence data analysis provides the highest predictive score for stress hormones and inflammation in breast cancer survivors. Disease control, health, and QL are important factors associated with the best predictive outcomes [
14].
In 2023, Nascimben et al. undertook a significant study focused on Italian patients grappling with breast cancer (BC) and experiencing upper limb unilateral lymphedema (BCRL). BCRL, a condition with multifaceted origins, poses considerable challenges, impacting not only physical capabilities but also overall QL for breast cancer survivors over the medium to long term. Leveraging advanced methodologies, the researchers meticulously analyzed data to stratify the risk of BCRL. They employed unsupervised low-dimensional data embeddings and cutting-edge clustering techniques to discern distinct patient groups and their respective characteristics. Through this rigorous analysis, the researchers were able to identify factors associated with heightened risk within specific clusters, thus shedding light on the nuanced dynamics of BCRL progression. The culmination of these efforts yielded a comprehensive prognostic map, delineating three distinct patient clusters, each characterized by its unique attributes and associated risk factors [
15]
Kang et al. used a machine learning algorithm to analyze data collected from patients with breast cancer along different survivorship trajectories to identify patient-centered factors associated with their QL. The study ultimately identified important factors related to QL among breast cancer survivors across various survival trajectories. Based on these results, emotional and physical functions were the most important features before surgery and within 1 year after surgery, respectively. [
16]
In 2024, Choe et al. harnessed the power of machine learning (ML) to craft predictive models concerning diminished Quality of Life (QL) among post-treatment cancer survivors in South Korea. As outlined in the findings, the Random Forest (RF) model outshone its counterparts—support vector machine and extreme gradient boosting—alongside three deep learning models in terms of efficacy. Notably, factors such as survivorship concerns encompassing distress, pain, and fatigue emerged as pivotal influencers of compromised QL. This ML-driven framework unveiled in the study exhibits promise in bolstering clinical decision-making processes facilitating the early identification of survivors susceptible to diminished QL [
17].
All studies examining the application of machine learning in predicting QL are presented in
Table 2.
In this study, we examined the relationship between hope for life and quality of life(QL), factors influencing QL, and the best machine learning model for predicting QL in breast cancer survivors in Iran. This research not only utilizes hope for life in the prediction model but also, for the first time, focuses on predicting the QL of Iranian patients, which is considered a prominent innovation in this field of research.
2. Method
In this article, we aimed to predict QL among breast cancer (BC) survivors in Iranian women and investigate the influential factors. As depicted in
Figure 1, firstly, we collect the data and then proceed with data preprocessing and preparation. Subsequently, we develop a comprehensive model that incorporates feature selection, parameter tuning, and model fitting to predict the QL. Finally, we evaluate the model’s performance using metrics such as Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared, comparing and contrasting the results.
3.1. Problem Statement Analysis
In contemporary society, QL is vital parameter for assessing the status of communities. Compared with the past, this issue has gained greater importance and is considered an influential sociological criterion. Meanwhile, given that breast cancer is prevalent among women, investigating and predicting the Quality of Life(QL) of women affected by this type of cancer can be a significant step toward improving this aspect of society.
Recent research indicates a strong correlation between breast cancer and individuals’ mental well-being and hope in life. Therefore, in addition to data related to the QL, special attention has been given to collecting information regarding hope life.
This study aimed to utilize regression models to forecast the QL of Iranian breast cancer patients, aiding advancement in associated research and offering valuable insights.
3.2. Data Collection
In this research, the dataset of the National Institute for Medical Research Development (NIMAD) was used. The dataset of the research consists of the information of 1114 breast cancer patients who were collected after treatment through four Iranian specialized hospitals: Imam Hussein, Imam Khomeini, Mahdiyeh, and Khatam al-Anbia. They collected data over one year, and after treatment, patients completed three questionnaires, consisting of EORTC_C30, EORTC_BR23, and the Schneider Hope Scale.
3.3. Ethical Approval and Consent to Participate
This study was approved by the National Institute for Medical Research Development (NIMAD), Iran (reference number: IR.NIMAD.REC.1397.322, registration date: June 22, 2018). All procedures were performed in accordance with relevant guidelines and regulations. Written informed consent was obtained from all participants by NIMAD in accordance with these guidelines.
3.4. Dataset Examination
The European Organization for Research and Treatment of Cancer (EORTC_C30) established the Quality of Life(QL) Study Group in 1980, initiating a unified approach for assessing patients participating in clinical trials in 1986. This questionnaire comprises functional scales, symptom scales, QL scales, and financial impact reported by cancer patients [
41].
According to
Table 3, using the EORTC-C30 questionnaire, the dimensions applied in this questionnaire can be used to assess QL in cancer patients.
The percentage frequency of questions covering each dimension is specified in
Figure 2.
Drawing from the EORTC-30 questionnaire, the findings illustrate that Physical aspects comprised 19% of the responses, whereas social functioning accounted for 11.5%.
“EORTC_BR23”: This scale encompasses specific subscales concerning psychopathology. The questionnaire addresses symptoms and side effects associated with various treatment methods. It includes aspects such as body image, sexual functioning, and future perspectives [
41].
Table 4 shows the dimensions used in the questionnaire to assess the QL of breast cancer patients.
Figure 3 shows the percentage frequency of questions related to each dimension.
Incorporating data from the EORTC-BR23 questionnaire, the analysis revealed that General aspects accounted for 26.4% of the responses, whereas symptoms and signs constituted 15.1%.
Using the Schneider Hope Scale, we can measure the level of hope in individuals. This questionnaire evaluates two dimensions of hope: energy for achieving goals in life and personal plans for achieving goals in life. The Scoring method was conducted on the basis of a 5-point Likert scale [
42]
Table 5 shows the dimensions used by the questionnaire to assess hope life.
Figure 4 shows the percentage frequency of questions related to each dimension.
Based on the Schneider Hope Scale, the analysis indicated that Anticipation of Future Success accounted for 25% of the responses, while Effort constituted 33.3%.
These extensive and cohesive data was carefully gathered from reputable hospitals in Iran, improving the accuracy and validity of the research findings.
3.5. Data Preprocessing
Data preprocessing is a crucial stage in data analysis and modeling. This stage contributes to the precision and validity of the analysis and modeling results, preventing the impact of unintended data variations.
The data-cleaning stage was performed as follows:
In the next stage, the dataset was encoded. As you know, when dealing with categorical data, it is necessary to convert them into numerical values; therefore, according to
Table 6 and
Table 7, the categorical datasets were encoded to the numerical values.
In data preprocessing, we used feature scaling to align the range of variations for all features. This helps standardize the effect of all factors. In this study, we employed the standardization method, which is in Equation 1. After this transformation, the features exhibit a mean and standard deviation of 0 and 1 [
43].
- 2.
Statistical Analysis
Various statistical analysis methods were used to explore the connection between hope and QL in the dataset of this study:
- I.
General Linear Model (GLM): The GLM explores the impact of various independent variables on a dependent variable. By accounting for other variables, it offers a versatile model [
44].
- II.
Analysis of Variance (ANOVA): ANOVA is a robust method for comparing variations between groups, allowing for statistical comparisons among more than two groups [
45].
3.6. Modeling
The process of fitting the model is a crucial and fundamental stage in machine learning and data analysis. During this stage, a machine learning model learns from the training data and strives to make the best possible predictions for the data.
3.6.1. Feature Selection
Feature selection is a crucial stage in the data analysis and machine learning process that significantly impacts the performance of the models. In this stage, essential and meaningful features are chosen from the available variables in the data to create a more accurate and efficient model.
Feature selection involves either eliminating or selecting several variables based on their importance. This selection can be performed using statistical sampling methods or machine learning models, depending on the knowledge acquired in the problem domain. The primary purpose of feature selection is to enhance model performance, decrease data dimensions, and avoid overfitting in models [
46].
We decided to use the mutual information approach for feature selection in this study because of its ability to capture the connections effectively between variables. Mutual Information measures the information gained about one variable from observing another. Therefore, the most important features were identified to provide valuable information to the predictive model. Moreover, mutual information is essential in real-world datasets where feature relationships may not be linear or direct. Furthermore, mutual information does not depend on specific data distribution, making it suitable for all datasets with different characteristics.
Moreover, the mutual information approach provides a systematic method for selecting features without requiring complex parameter adjustments or assumptions about data distribution. The mutual information approach identifies and preserves the most informative features while reducing the dataset’s dimensionality. This ultimately leads to improved model performance, enhanced generalization, and greater interpretability of the resulting machine-learning models.
3.6.2. Spilling Dataset
In the healthcare dataset, we allocate 60% of the data as “training data” for the model to learn. Then, we set aside another 20% as “validation data” to evaluate the model. If necessary, we repeated the training phase to enhance the model. The remaining 20% of the data is designated as “test data,” which the model has never seen,
Table 8. Therefore, we evaluate it by its error.
3.6.3. Regression Models
Throughout model fitting, the model uncovers information from the training data and recognizes patterns and connections between input features and the desired output. Because of the continuous data and small dataset size, we employed regression models. In additionally, the models are classified into two groups: basic algorithms and advanced algorithms. The basic models encompass Random Forest Regression, k-Nearest Neighbors Regression, Decision Tree Classification, and Regression, whereas the advanced models include Extreme Gradient Boosting, AdaBoost, and Gradient Boosting Machines.
3.6.4. Model Parameter Optimization
When fitting machine learning models, they have parameters that can be adjusted to optimize the model’s performance. Optimizing these parameters is crucial to ensure that the model delivers the best performance [
47].
In this study, the grid search method was selected for parameter optimization because of its comprehensiveness and simplicity. The Grid search systematically explores a predefined set of hyperparameters, covering a wide range of possible values for each parameter. This exhaustive search approach ensures that no combination of parameter values is overlooked, thereby increasing the likelihood of finding the optimal configuration for the model.
3.6.5. Model Evaluation
Evaluating machine learning models is crucial to ensure that the selected model performs well on test data and provides accurate predictions. Various evaluation metrics are used for this purpose, allowing comparison between different models.
- i
Mean Squared Error (MSE) calculates the variance between predicted and actual data values(Equation 2). MSE indicates the accuracy of predictions, and the lower the MSE, the more accurate the model’s predictions [
48].
- ii
Mean Absolute Error (MAE) shows the average absolute disparities between predicted and actual data values, revealing the extent of prediction errors(Equation 3). MAE does not consider the direction of these errors [
48].
- iii
The R-squared (R2) value reflects the model’s ability to explain the variations in the predicted variable (Equation 4). A higher R2 value indicates a more explanatory model [
49].
These evaluation metrics play a vital role in assessing the performance of machine learning models and facilitating meaningful comparisons between them.
4. Results and Discussion
4.1. Investigating the Relationship between Hope Life and QL
To investigate the correlation between hope in life and QL, two statistical methods were employed: the General Linear Model (GLM) and Analysis of Variance (ANOVA). The outcomes from both analyses demonstrate a significant and positive relationship between hope in life and QL.
Figure 5 illustrates that augmenting hope directly corresponds to an improvement in QL. This relationship is depicted by the equation Y = 5.88X + 47.54.
The findings of this research highlight a clear linear relationship between QL and HL under investigation, indicating a significant mutual influence. These results align closely with prior research, particularly the studies conducted by Shen et al. in 2020 and Li et al. in 2022. Notably, these studies have also identified the overestimation of hope of life parameters as a crucial predictor for QL, further underscoring its importance in understanding the dynamics at play [
6,
7].
4.2. Influential Parameters
According to our findings, 29 questions have a more pronounced impact on QL. These questions encompass dimensions of the patient’s life concerning general condition, physical well-being, functionality, pain, sleep, self-efficacy, economic conditions, disease symptoms, and hope of life.
As a result, not only have we identified significant factors affecting the QL of Iranian breast cancer survivors, but we have also been able to predict the QL of this group of society by reducing the dimensions of the issue and selecting 29 out of 52 questions in the questionnaires, by asking fewer questions using our machine learning algorithms.
Furthermore, these observations are consistent with those of previous research. Shen et al. similarly demonstrated that income, hope, self-efficacy, and social support are positive predictors of QL, whereas cancer stage is a negative predictor of QL [
7]. Li et al. also found a significant relationship between the QL of these women and variables such as age, marital status, education level, chemotherapy cycle, and hope [
6].
Because clinical data from patients were not available and only questionnaire information was accessible, other dimensions of the disease, such as the stage of cancer, were not investigated in this study.
4.3. Model Result
Considering that QL, measured by QL questionnaires, and hope life are numerical variables ranging from 0 to 100, the response variable is continuous. In this section, six regression algorithms, including K-Nearest Neighbors Regression (K-NN), Random Forest Regression (RF), Classification and Regression Trees (CART), Extreme Gradient Boosting (XGboost), AdaBoost, and Gradient Boosting Machines (GBM), have been fitted to the dataset, validation results on
Table 9. After analyzing the test results from
Table 10, it was evident that the poorest performance was seen in the classification and regression trees due to high error rates and the lowest R-squared score. Extreme Gradient Boosting and Gradient Boosting Machines showcased superior performance, achieving accuracy levels above 80% with minimal errors. All the related illustrations are shown on
Figure 6,
Figure 7, and
Figure 8.
In this context, the study in 2022 by Liu et al. closely aligns with the results of this research. They used basic machine learning algorithms to forecast the declining Quality of Life(QL) in individuals with thyroid cancer. They attained accuracies of 89.7% and 83.4% in the training and testing datasets, respectively, using the random forest algorithm.
In machine learning, more extensive datasets lead to more accurate and reliable results. Therefore, due to the limited size of the dataset, the model’s accuracy did not reach higher levels, and the use of deep learning tools was not feasible.
In this study, only six algorithms were analyzed, and the use of deep learning algorithms was restricted due to data volume limitations and study period constraints. In addition, larger datasets recognize that it consistently improves the accuracy and performance of models, indicating the potential for further exploration in this area with a focus on extensive data analysis.