Breast cancer survival analysis and machine learning to predict the impact of different treatments

Breast cancer is the most common form of cancer among women, impacting approximately one million women worldwide. New treatments are being developed yearly, improving breast cancer patients’ survival rates. To explore the impact of different treatments, we conducted this study using data from the Surveillance, Epidemiology, and End Results (SEER) database. The study employed Kaplan-Meier analysis to examine breast cancer-specific survival (BCSS) and overall survival (OS) rates across various treatment options, including ‘chemotherapy’, ‘radiotherapy, ‘both therapies’, and ‘no therapy’. The log-rank test was also utilized to assess the statistical significance of differences observed between multiple survival curves. We found that recommended treatment for most breast cancer cases, based on BCSS analysis, is the combination of ‘both’ chemotherapy and radiotherapy. On the other hand, according to OS analysis, ‘radiotherapy only’ or ‘in conjunction with chemotherapy’ is the superior treatment for most breast cancer cases. They are often preferred over ‘chemotherapy only’ for most breast cancer patients. Machine learning was used to develop ten models predicting the survivability for OS and BCSS. C5.0 algorithm consistently achieves strong overall performance. It achieves high accuracy 0.98 and sensitivity of 0.99 for both OS and BCSS, reasonably RMSE of (0.14, 0.15 for BCSS and OS, respectively), and good ROC score of (0.91, 0.88 for BCSS and OS, respectively).


Introduction
Breast cancer is a widely prevalent disease, accounting for approximately one out of every eight cancer diagnoses globally.In 2020, million new breast cancer cases were recorded, resulting in 685,000 deaths worldwide.By 2040, breast cancer cases are expected to increase exceeding 3 million new cases annually (a 50% increase) and over 1 million deaths per year (about 32% increase) [1] .Among women, breast, lung, and colorectal cancers are the three most common types, collectively accounting for half of all newly diagnosed cancer cases [2] .Breast cancer only represents 25% of all new cancer diagnoses in women [3] .Promisingly, there has been a significant decrease of 43% in breast cancer death rates among women from 1989 to 2020, primarily attributed to improved early-stage detection and diagnosis [4] .Early cancer detection plays a crucial role in increasing the chanced of patient survival.Despite advancements in breast cancer awareness and treatment, mortality rates still remain relatively high [5] .Cancer can aggressive and pose a life-threatening risk, but the chances of survival are considerably higher with early detection and timely treatment [6,7] A study [8] compared the efficiency of breast-conserving therapy and mastectomy on OS using SEER database.It showed that BCT patients have improved survival compared with mastectomy in early-stage breast cancer patients.itencourage patients to receive BTC rather than mastectomy in early stage.

1493
A retrospective cohort study [9] for patients with stage II/III BC, aged ¿= 55 compared patients aged ¿= 75 with younger patients.The study found that BCSS was approximately 95% in all ages but it was better in young patients with triple negative and HER+.BCSS was.The OS was poorer in the older group for all subtypes.Chemotherapy improved OS in different subtypes.
A study [10] investigated risk factors treatment impacts for male breast cancer.It suggested that Surgery and chemotherapy are recommended for early stages (I-III), however omitting other treatments, radiation or chemotherapy, worsened outcomes.For stage IV, both chemotherapy and radiation is suggested and improved BCSS.
A study [11] determined the comparative effect of chemotherapy and non-chemotherapy on breast cancer patients with different stages.The study aimed to clarify the potential differences in treatment effectiveness depending on the cancer stage.In the model, the relationship between the patient's marital status, age, race, tumor type, ER, PR, HER2, nodes, and primary locations have significantly influenced the overall survival rate (OS) and the specific survival rate (BCSS) for breast cancer in contrast to mastectomy.Chemotherapy significantly lowers the death rate in both.We observed that married patients receiving chemotherapy have significantly improved survival, highlighting the potential impact of marriage on prognosis.
A study suggests that breast cancer is more aggressive in young patient.Conversely, middle-aged patients tend to exhibit better outcomes compared to younger and older counterparts [12].Multiple studies have indicated that women under 45 have lower survival rates and higher risk of recurrence than older women.However, a comparative study between patients under 40 years old and middle-aged women found that the former group had better survival prospects, except those diagnosed with stage III [13].Although around 30-40% of breast cancer patient are over 70 years old, managing elder breast cancer remains a controversy due to the lack of clinical trial data (enrollment rate is less than 20%) [14,15].Furthermore, male patients have been found to have higher mortality rates across all stages of breast cancer compared to their female counterparts [16].
The impact of therapy on survival in the context of cancer treatment is a complex topic, and it can vary depending on several factors such as: • Cancer stages: determine the extent and spread of cancer within the body.As the cancer progresses and spreads to other parts of the body, it is classified as stages I through IV [4,17,18,19].The 5-year survival rate for cases diagnosed with local, regional, and distant cancer is reported to be 99%, 85%, and 27%, respectively [20,21].• The TNM system [22] employs alphanumeric codes to characterize tumor size, lymph node, and metastasis of cancer, which vary depending on the specific cancer type.• The grade of cancer [23] describes the microscopic appearance of cancer cells.Lower grades signify slowergrowing cancer, while higher grades indicate faster growth.• Hormone status [24] : Certain breast cancers are influenced by female naturally hormones such as estrogen and progesterone.Breast cancer cells possess receptors on their surface that bind to these hormones.• Treatment: Chemotherapy utilizes drugs to target and destroy cancer cells throughout the body, while radiation therapy uses high-energy beams to target and destroy cancer cells in specific areas.The combination of these treatments aims to attack cancer from multiple angles and improve overall outcomes [14,15].• The sequence of radiation with surgery [25] : the sequence of radiation with surgery can have an impact on survival.It is based on individual patient characteristics and the type and stage of the cancer being treated.
Most Previous research that studied the impact of breast cancer treatment focused on analyzing single treatment options such as 'chemotherapy only' [13,26,27,28], 'Radiotherapy-only' [2,29,30,31], 'both chemotherapy and hormone therapy' [32] or 'both chemotherapy and radiotherapy' [33].In contrast, our current research explores the effects of four treatment approaches: chemotherapy only', 'radiotherapy-only', 'both chemotherapy and radiotherapy' and 'no therapy'.Furthermore, while most previous studies have conducted a stratified analysis based on a single variable such as stage [34], risk groups [23,35,36], or age [12,5,16], our research takes a more comprehensive approach by performing stratified analysis using 15 variables: grade, stage, age, breast subtype, ER status, HER2 status, laterality, marital status, metastasis status, nodal status, PR status, race, radiation sequence with surgery, sex, and tumor size.By considering a broader range of variables, our research aims to provide a more nuanced understanding of the impact of treatment on different breast cancer patients.
BREAST CANCER SURVIVAL ANALYSIS AND MACHINE LEARNING TO PREDICT THE IMPACT...

Methodology
The study focused on BCSS, measuring the time from breast cancer diagnosis to breast cancer-specific death, treating other deaths as censored data.OS, a secondary outcome, tracked the time from diagnosis to death or last follow-up, treating lost-to-follow-up as censored data [15,19].In our statistical analysis, survival curves were generated using the Kaplan-Meier method, and the log-rank test was used to determine the statistical significance of differences between groups in BCSS or OS rates between the survival curves.We considered a P-value of less than 0.05 as statistically significant.We used Cox proportional hazards regression model to identify factors (treatment, tumor characteristics, and patient demographics) that influence OS and BCSS.It estimates hazard ratios (HRs) with 95% CIs to quantify how the risk of death or disease progression changes based on different factors, considering censored data and accounting for the timing of events.The significant variables from the univariate analysis were included in the multivariate analysis.All statistical analysis was performed using IBM SPSS Statistics, version 25 [37] .
2.1.Survival analysis [24,38] Survival analysis studies time-to-event data using Kaplan-Meier curves and the log-rank test.The Kaplan-Meier estimator calculates the survival function St for different time intervals.The survival function is estimated as: where • S(t) is the estimated survival probability at time t.
• n i is the number of individuals at risk (alive) at the previous time point • d i is the number of observed events (deaths) that occur at time t.
The log-rank test compares survival curves to determine significant differences between groups.The test statistic X 2 is calculated as: Where: • O is the observed number of events (death) in each group.
• E is the expected number of events in each group, assuming no survival difference.

Data sources and data cleaning
The study used the

Comparison of survival between different groups
Based on the information provided, univariate and multivariate analyses were conducted to identify prognostic factors that could predict OS and BCSS in the cohort.In the univariate analysis Table 2, all variables except year of diagnosis were found to significantly impact on OS and BCSS, so they were included in the multivariate analysis.The result of multivariate analysis shown in Table 3 revealed a better survival in patients received radiotherapy, according to both OS and BCSS (HRs=0.280,95% CI=0.265-0.296,p<0.001;HRs= 0.267, 95% CI=0.246-0.289,p<0.001) followed by patients who receive a combination of both chemotherapy and radiotherapy, (HRs= 0.492, 95% CI=0.467-0.519,p<0.001;HRs=0.849,95% CI=0.797-0.905,p<0.001).4 presents a comparison of OS and BCSS between different treatment patients stratified by grade.Additionally, Kaplan-Meier survival curves for the effect of chemotherapy on grade for BCSS in Fig. 2 and for OS in Fig. 1 are shown.The key findings from this analysis are as follows: • All breast cancer grades can benefit from different treatments according to OS and BCSS, except for grade I and grade II patients in BCSS who do not benefit from 'chemotherapy only' treatment but can benefit from other treatments.The highest treatment that can benefit grade I, grade II and grade IV patients is 'Radiotherapy-only' treatment, according to both OS and BCSS.• For patients in grade III, 'both chemotherapy and radiotherapy' treatment provide the highest survival according to OS, while 'Radiotherapy-only' treatment provides the highest survival according to BCSS.
These results highlight the importance of considering the grade of breast cancer when determining the most effective treatment approach.The findings emphasize that different treatments can yield favorable outcomes based on the grade of the disease, and personalized treatment strategies can optimize survival rates for patients in different grades of breast cancer.
The study investigated the impact of various treatments on breast cancer patients at different stages.Table 5 compared outcomes, helping identify optimal treatments.Kaplan-Meier curves for OS in Fig. 3 and Fig. 4 illustrated the effect of different treatments on BCSS stratified by stage.Overall, patients across all breast cancer stages can benefit from various treatments, as both OS and BCSS indicate.However, the findings highlight the importance of tailoring treatment strategies based on the specific stage of breast cancer.'Radiotherapy-only' treatment appears to be the optimal choice for stage I patients, while 'both chemotherapy and radiotherapy' treatment benefits stage II and stage IV patients.The results emphasize the need for personalized treatment approaches to optimize survival outcomes for breast cancer patients at different stages.
The study employed a more comprehensive approach by conducting a stratified analysis for 15 variables.By comparing OS and BCSS among patients receiving different treatments across all variables, we aimed to identify the optimal treatment strategy for each variable category in table 6.
There seem to be disparities between the outcomes of the BCSS and OS examinations concerning the optimal treatment choices for various groups of patients.While 'both therapies' are typically linked to improved survival rates in BCSS, distinct interventions can yield favorable survival outcomes in OS, contingent upon the patient subset.There are some discrepancies between the treatments that yield the highest survival rates according to the BCSS and OS measures, as well as different subgroups of patients.Here is a summary of the findings: • Age: for patients under 70 years old, chemotherapy only' and 'both treatments have the highest survival rates according to BCSS.However, according to OS, 'Radiotherapy-only' and 'both' treatments yield better survival rates for all ages.• Breast Subtypes: The highest survival rates for patients with different breast subtypes are observed in Her2-/HR+ and triple-negative subtypes when they receive 'both' treatments, according to BCSS.However, OS suggests that different treatments provide high survival rates across various subtypes.• Grade: BCSS suggests that patients with different grades who receive 'both' treatments have better survival rates.However, OS indicates that different treatments provide high survival rates across different grades.• HER2 Status: Negative HER2 patients who receive both treatments have better survival rates according to BCSS.OS suggests that different treatments provide high survival rates for most negative HER2 patients, except for those who receive 'chemotherapy only.' • Laterality: According to BCSS, both 'chemotherapy only' and 'both' treatments provide better survival for both right and left breast laterality patients.However, according to OS, different treatments provide high survival rates regardless of breast laterality.• Marital Status: BCSS indicates that 'both treatments provide better survival rates for all marital statuses.
In addition, for 'married' patients, chemotherapy only' also yields high survival rates.On the other hand, according to OS, 'Radiotherapy-only' and 'both' treatments result in the highest survival rates for both 'single' and 'married' patients.All treatments provide better survival rates for 'other' marital status patients.• PR Status: BCSS suggests that 'both' treatments provide better survival rates for all PR statuses.In contrast, according to OS, different treatments provide high survival rates across different PR statuses.Race: According to BCSS, 'both treatments provide better survival rates for both 'white' and 'black' patients.For 'other' patients, 'chemotherapy only' and 'both' treatments yield the highest survival rates.However, according to OS, different treatments provide high survival rates for most ethnicities, except for 'black' patients who receive chemotherapy only.• Sex: BCSS indicates that 'chemotherapy only' and 'both' treatments result in the highest survival rates for 'male patients.For 'female' patients, receiving 'both' treatments leads to the highest survival rates.Conversely, according to OS, different treatments provide high survival rates for both sexes.• Tumor Size: According to BCSS, patients with tumor sizes less than 50 mm who receive 'both' treatments have the highest survival rates.However, patients with tumor sizes larger than 50 mm have lower survival rates regardless of the treatment received.In contrast, OS suggests different treatments provide high survival rates across different tumor sizes.

Machine-learning-based outcome prediction
Based on the results for both BCSS and OS classification using various algorithms shown in table 7: • C5.0, GBM, and AdaBoost excel in accuracy, sensitivity, and ROC scores.C5.0 slightly outperforms.
• NB has a top ROC score (0.99) but lower sensitivity, impacting positive instance identification.
• LDA achieves a good ROC score (0.98) but slightly lower accuracy.
• Treebag, RPART, and RF have high accuracy but lower sensitivity and ROC scores.
• Treebag has good discrimination ability with slightly lower accuracy and sensitivity.
• RPART boasts perfect sensitivity but slightly lower accuracy and ROC score.
• LDA, NN, and AdaBoost have lower sensitivity, potentially leading to more false negatives.
• NB has the lowest accuracy and sensitivity.

Discussion
This study suggests that there are differences in the recommended treatment approaches based on patient's individual circumstances.This study may help healthcare professionals to assess the potential outcomes and plan appropriate treatment strategies for patients with different characteristics.Differences in individual patient characteristics contribute to variations in the optimal treatment for breast cancer patients, resulting in distinct recommendations for OS and BCSS.The results underscore the importance of personalized treatment strategies for breast cancer patients, taking into account factors such as: • Age: Age plays a role in cancer patients' treatment and survival rates.Patients older than 70 tend to have worse survival rates than younger patients.A combination of chemotherapy and radiotherapy' is significantly reducing the risk of death, as reflected by lower HRS in both BCSS and OS with highly significant p-values.• Subtypes: patients with HR+ and HER2-subtypes generally have a more favorable prognosis compared to those with HR-and HER2+ subtypes.Supplemental therapies are recommended to improve treatment outcomes.Triple-negative and negative hormone receptor (ER/PR) subtypes have a worse prognosis.
Treatment for triple-negative breast cancer often involves a combination of chemotherapy and radiation therapy, similar to HER2-subtype patients.• Metastasis and nodal status profoundly influence survival.Patients without metastasis (MO) generally experience improved outcomes, with treatments, especially chemotherapy, reducing the risk of death.Furthermore, the number of affected lymph nodes (nodal status) plays a pivotal role, with lower nodal stages (NO and N1) associated with better treatment responses.These results underscore the importance of early detection and intervention to prevent the progression of the disease to advanced stages.Patients with N2 and N3 nodal status may need supplemental therapy.• Laterality: The breast cancer patients with tumors in the right breast tend to have better survival rates compared to those with tumors in the left breast.However, it's important to note that the difference in survival rates based on laterality is not consistently observed in all studies and may vary among different patient populations.• Marital status: All marital status appears to influence the effectiveness of a combination both chemotherapy and radiotherapy.• Race: It is important to clarify that medical treatment recommendations should not be based solely on a patient's race or ethnicity.The results show that white patients generally experiencing better survival outcomes compared to Black patients.Regardless of the race, both chemotherapy and radiation therapy are recommended for breast cancer patients.• The sequence of radiation with surgery: The sequence of radiation with surgery plays a significant role in determining patient survival rates.Current study suggests that administering radiation after surgery, utilizing intraoperative radiation, or employing a combination of radiation before and after surgery generally to leads to better survival outcomes than radiation before or without radiation.However, the optimal treatment approach should be determined by considering the patient's specific characteristics and their cancer, as well as consulting with healthcare professionals.• Sex: Female patients have advantage in survival rates than males in breast cancer.Male patients may benefit from chemotherapy only' or a combination of chemotherapy and radiotherapy, while female patients may benefit from the combination of chemotherapy and radiotherapy.• Tumor size: Tumor size is one of the key factors considered in treatment decision-making for breast cancer.
Larger tumors generally have a higher risk of spreading to lymph nodes or distant sites, and may require more aggressive treatment approaches.If the tumor size is less than 50 mm, 'both' therapy is recommended.For tumors larger than 50 mm, a supplemental therapy is recommended.The term 'supplemental therapy' is broad and can include additional treatments such as targeted therapy, hormonal therapy, or extended adjuvant therapy.• According BCSS, the combination of 'both' chemotherapy and radiotherapy is the recommended treatment for the following patients: HER-, triple negative, stage II, grade III, HER2-/HR+ and tumor size ¡ 50.'Radiotherapy' is the recommended treatment for patients with stage I, all grades except grade III.• According to OS, radiotherapy only or in combination with chemotherapy is recommended for majority cases.
The results of BCSS classification using the accuracy, sensitivity, RMSE, and ROC scores for each algorithm show that: • The majority of algorithms, including C5.0, GBM, AdaBoost, NN, Treebag, and RPART have relatively high accuracy, sensitivity, and ROC scores, indicating strong overall performance and effectiveness in accurately classifying patients into BCSS categories.Additionally, the low RMSE values suggest an accurate prediction of survival durations.These algorithms show promise for BCSS prediction.However, the RPART and RF models show lower sensitivity compared to the others.• They achieve a high accuracy of 0.98 or above.This indicates that these models are capable of making correct predictions for a large proportion of the instances.• Most models, such as C5.0, GBM, AdaBoost, NN, Treebag, exhibit a high sensitivity score of 0.99.This means that these models are effective in correctly identifying positive instances, as they have a low false negative rate.
• However, both RPART and RF models have lower sensitivity scores compared to the other models, with values of 0.80 and 0.78, respectively.This suggests that these models may struggle to accurately detect positive instances, resulting in a higher false negative rate.• On the lower end, models like NB and MLP show comparatively lower performance.They have lower sensitivity scores and lower ROC scores, indicating a higher rate of false negatives and poor discrimination between positive and negative instances.

Conclusion
The research addresses the following objectives: (1) Determine the effect of radiotherapy and chemotherapy on breast cancer survival by analyzing the survival curves and hazard ratios; (2) Identify factors associated with improved survival outcomes, including tumor characteristics, patient demographics, and treatment regimens; (3) Develop and compare predictive models using statistics and machine learning algorithms to accurately estimate survival probabilities based on treatment variables.Based on the findings of the stratified analysis considering OS and BCSS, it can be concluded that the optimal treatment for breast cancer patients varies based on several factors, including age, breast subtype, metastasis status, nodal status, ER/PR status, laterality, marital status, sex, tumor size, and the sequence of radiation with surgery.The 'both' treatment, which combines chemotherapy and radiation therapy, generally emerges as the most effective treatment option, consistently demonstrating higher survival rates across many analyzed variables.However, there are certain subgroups where alternative approaches may be more beneficial.In terms of BCSS, patients with specific criteria such as HER-, triple-negative, stage II, grade III, HER2-/HR+, and tumor size ¡ 50 benefited most from a combined chemotherapy and radiotherapy approach, while those in stage I, with grades other than III found 'radiotherapy only' to be adequate.In the context of OS analysis, Radiotherapy only' or 'in combination with chemotherapy emerged as more effective treatments across a wide range of cases, often outperforming 'chemotherapy only.' Machine learning models were developed to forecast OS and BCSS, and the C5.0 algorithm consistently demonstrated robust overall performance.These discoveries enhance the decision-making process for breast cancer treatment.

Future Work:
Future research in studying breast cancer treatment should consider several important aspects: Combined Variable Analysis: It is crucial to conduct combined variable analysis, which takes into account multiple factors simultaneously.This approach provides a more comprehensive understanding of the complex interactions between variables and treatment outcomes.By considering various factors together, more precise treatment strategies can be identified.Exploring Immunotherapy and Targeted Therapy: The investigation of emerging treatment modalities, such as immunotherapy and targeted therapy, is of utmost importance.These therapies have demonstrated promising results in various cancer types, and assessing their effectiveness specifically in breast cancer patients can yield valuable insights for improving treatment outcomes.Evaluating long-term side effects and quality of life: Understanding the long-term side effects associated with different treatments is essential.It is also crucial to assess the impact of these side effects on patients' quality of life.By evaluating these factors, we can gain a better understanding of the overall treatment experience and make informed decisions that prioritize both efficacy and patients' well-being.By addressing these research areas, we can enhance our understanding of breast cancer treatment, improve patient outcomes, and make strides towards reducing the burden of this disease.

Figure 1 .
Figure 1.Kaplan-Meier survival curves for the effect of different treatments stratified by grade for OS

Figure 2 .
Figure 2. Kaplan-Meier survival curves for the effect of different treatments stratified by grade for BCSS

Figure 3 .
Figure 3. Kaplan-Meier survival curves for the effect of different treatments stratified by stage for OS

Figure 4 .
Figure 4. Kaplan-Meier survival curves for the effect of different treatments stratified by stage for BCSS

Table 4 .
Comparison of BCSS and OS between different treatments in a specific grade

Table 5 .
Comparison of BCSS and OS between different treatments in a specific stage

Table 6 .
Comparison of BCSS and OS between different treatment patients for all other variables

Table 7 .
Model performance for BCSS and OS • The RMSE values are relatively consistent across all models, ranging from 0.14 to 0.21.These values indicate the average difference between the predicted values and the actual values.Lower RMSE values generally indicate better predictive performance.•ROCscoresmeasurethe overall classification performance of the models.Most models, including C5.0, GBM, AdaBoost, NN, Treebag, and RPART, achieve relatively high ROC scores between 0.89 and 0.91.These scores indicate that these models have a good ability to distinguish between positive and negative instances.•Notably, the MLP model has a lower ROC score of 0.52, suggesting that it may struggle with classification and distinguishing between positive and negative instances.The results for OS classification using accuracy, sensitivity, RMSE, and ROC scores for each algorithm:• C5.0, RF, and GBM performed well in terms of accuracy, achieving scores above 0.97.They also showed competitive performance in sensitivity and ROC scores, indicating their ability to accurately classify patients into OS categories.Additionally, the RMSE values suggest a relatively accurate prediction of survival durations.These algorithms demonstrate promise for OS prediction.•C5.0 appears to be the best-performing model overall.It achieves high accuracy (0.98) and sensitivity (0.99),indicating that it has a low rate of both false positives and false negatives.Additionally, it has a reasonably high ROC score (0.88), indicating good discrimination between positive and negative instances.• RF and GBM come next in the ranking.While their accuracy (0.98) is comparable to C5.0, their sensitivity values are slightly lower, indicating a higher rate of false negatives.However, they still demonstrate good overall performance.• Treebag, RPART and AdaBoost also perform well with high accuracy values (0.97) and reasonably good sensitivity scores.They have good ROC scores, suggesting relatively good discrimination capabilities.The high ROC score suggests excellent discrimination ability.• LDA and NN achieved a high accuracy score, but NN has a lower sensitivity score compared to LDA.However, the lower sensitivity scores compared to the top-performing algorithms indicate potential limitations in identifying patients with adverse OS outcomes and discriminating between different OS categories.The low ROC score indicates fair discrimination ability.