The evaluation of both centralized and federated models was conducted using a combination of predictive, efficiency, and privacy-preservation metrics. Discriminative ability was primarily assessed through the Area Under the Receiver Operating Characteristic Curve (AUROC), which is widely regarded as a robust measure in imbalanced clinical prediction tasks. Overall predictive accuracy was also reported to provide a complementary perspective on performance. Given the imbalance between readmitted and non-readmitted patients, Precision-Recall (PR) curves were additionally employed to capture the trade-offs between sensitivity and positive predictive value.
Trade-offs between performance, efficiency, and privacy in multi-hospital collaborative learning.
Table 1.
Patient Records Table (Main Dataset).
| Column Name |
Description |
| patient_id |
Unique anonymized patient identifier |
| hospital |
Hospital of admission (ABC, FGH, XYZ) |
| admission_date |
Admission date |
| age |
Patient age |
| sex |
Male / Female |
| residence |
Urban / Rural |
| systolic_bp |
Systolic blood pressure (mmHg) |
| diastolic_bp |
Diastolic blood pressure (mmHg) |
| heart_rate |
Heart rate (beats per minute) |
| glucose_mg_dl |
Blood glucose level (mg/dL) |
| creatinine_mg_dl |
Serum creatinine level (mg/dL) |
| diabetes |
Binary indicator for diabetes diagnosis |
| hypertension |
Binary indicator for hypertension |
| heart_failure |
Binary indicator for heart failure |
| copd |
Binary indicator for chronic obstructive pulmonary disease (COPD) |
| length_of_stay_days |
Length of stay in days |
| prior_admissions_30d |
Number of admissions in the past 30 days |
| discharge_disposition |
Discharge outcome (Home / Transfer / AMA / Death) |
| medications_count |
Number of prescribed medications |
| procedure_flag |
Binary indicator for major procedure |
| readmitted_30d |
Outcome: readmitted within 30 days (0 = No, 1 = Yes) |
Table 2.
Hospital Metadata Table.
| Column Name |
Description |
| hospital_id |
Hospital identifier (ABC, FGH, XYZ) |
| location |
Hospital location (Ibadan, Oyo State) |
| capacity |
Number of available beds |
| EHR_system |
Electronic Health Record system used (e.g., OpenMRS, Epic) |
| data_sharing_policy |
Hospital policy on inter-hospital data sharing |
Table 3.
Ethical & Compliance Table.
| Column Name |
Description |
| hospital_id |
Hospital identifier |
| IRB_approval_number |
Institutional Review Board (IRB) approval reference |
| HIPAA_compliance |
Whether the hospital complies with HIPAA (Yes/No) |
| GDPR/NDPR_status |
GDPR (Europe) or NDPR (Nigeria) compliance status |
| data_sharing_restrictions |
Restrictions on patient data usage/sharing |
Table 4.
Model Training Metadata Table.
| Column Name |
Description |
| experiment_id |
Unique identifier for training experiment |
| model_type |
Type of model used (Logistic Regression, Random Forest, MLP, FedAvg) |
| hyperparameters |
Key training parameters (learning rate, batch size, epochs) |
| local_epochs |
Number of epochs trained locally at each hospital |
| communication_rounds |
Number of rounds of parameter exchange in federated learning |
| convergence_time |
Time (minutes) taken for training to converge |
Table 5.
Model Evaluation Results Table.
| Column Name |
Description |
| model_type |
Model evaluated (LR, RF, MLP, FedAvg) |
| hospital |
Hospital source of evaluation (ABC, FGH, XYZ, or Global) |
| AUROC |
Area Under the Receiver Operating Characteristic Curve |
| Accuracy |
Overall correct classification rate |
| Precision |
Positive predictive value |
| Recall |
Sensitivity / true positive rate |
| F1_score |
Harmonic mean of Precision and Recall |
| PR_AUC |
Area under the Precision-Recall curve |
| communication_efficiency |
Bandwidth usage per round (MB) |
Data Sources and Structure
The current research utilized a comprehensive dataset that combines patient-specific clinical records, hospital-level information, ethical and compliance documents, along with records for training and assessing machine learning (ML) models. This organized framework facilitated conventional statistical analyses as well as sophisticated federated learning experiments across various healthcare institutions.
The Patient Records
Table 1 served as the main dataset, containing anonymized clinical data at the individual level. Every record was associated with a distinct patient identifier (patient_id) to maintain confidentiality while allowing for ongoing monitoring. Important demographic factors comprised age, gender, and living environment (urban versus rural), enabling the classification of patients based on socio-demographic attributes.
Clinical indicators included vital signs like systolic and diastolic blood pressure, heart rate, and blood glucose along with laboratory biomarkers such as serum creatinine. These variables aided in the description of comorbid conditions. Moreover, binary indicators for diabetes, hypertension, heart failure, and COPD were added to reflect significant chronic disease burdens.
Metrics of healthcare utilization, including duration of stay, count of readmissions within 30 days, and prescribed medications, offered valuable insights into patient complexity and healthcare usage. Crucially, the dataset contained the discharge disposition (e.g., home, transfer, death) and the primary outcome variable, 30-day readmission, which was vital for model development.
The Hospital Metadata
Table 2 provided contextual details regarding each participating institution. Incorporating hospital capacity (bed availability) allowed for adjustments based on institutional size, while information on the Electronic Health Record (EHR) systems (OpenMRS, Epic) offered perspectives on variations in data management platforms. Furthermore, institutional guidelines regarding inter-hospital data sharing were recorded, an essential element for the feasibility and adherence to federated learning.
To guarantee compliance with international and regional data governance standards, the Ethical and Compliance
Table 3 documented hospital-level supervision and legal limitations. Every site needed to record IRB approval numbers, verify HIPAA adherence (if relevant), and reveal compliance with either the GDPR (hospitals in Europe) or the NDPR (hospitals in Nigeria). The table also detailed data-sharing limitations, which directly affected the creation of the federated learning protocols, guaranteeing that sensitive patient information remained within institutional boundaries.
The Model Training Metadata
Table 4 recorded the experimental framework for machine learning advancement. Every training example was associated with a distinct experiment ID, indicating the type of model (Logistic Regression, Random Forest, Multilayer Perceptron, or Federated Averaging). Key hyperparameters like learning rate, batch size, and epoch count were meticulously documented. In federated learning, other parameters such as local epochs, communication round counts, and convergence duration were crucial for assessing the trade-offs between performance and communication efficiency. This framework offered clarity and consistency in model development, conforming to leading standards in machine learning research.
The Model Evaluation Outcomes
Table 5 included performance metrics at the hospital-specific and overall levels. Metrics like AUROC, Precision, Recall, F1-score, and PR-AUC provided a thorough evaluation of predictive performance and the equilibrium between sensitivity and specificity. Additionally, incorporating communication efficiency measured bandwidth needs for each training round, an essential factor in federated learning contexts where resource limitations can vary among organizations. This table facilitated a multi-dimensional assessment of the deployed models by integrating predictive performance with computational and infrastructural metrics.
Collectively, these five tables created a thorough framework that connected patient-level outcomes (
Table 1) to institutional context (
Table 2), ethical protections (
Table 3), model development procedures (
Table 4), and assessment metrics (
Table 5). This cohesive framework guaranteed both methodological precision and adherence to ethical norms and practical viability, thus improving the generalizability and reliability of results across various hospital environments.
Discussion
Table 6 displays the descriptive statistics for patient groups from the three involved hospitals (ABC, FGH, and XYZ), emphasizing the demographic and clinical diversity that highlights the variability within the study population. ABC Hospital provided the largest group (N = 5,800), with FGH and XYZ Hospitals each contributing the same number (N = 4,700). The average age of patients varied from 55.9 years at FGH to 58.7 years at XYZ, suggesting minor yet possibly significant variations in the age distribution among admitted patients. Female representation was fairly evenly distributed across locations, varying from 50% at FGH to 53% at XYZ, indicating similar gender distributions among the institutions. The occurrence of diabetes ranged from 20% (FGH) to 25% (XYZ), whereas hypertension was most prevalent at XYZ (48%) and least prevalent at FGH (42%). In the same way, heart failure rates varied from 7% (FGH) to 9% (XYZ). These differences suggest that patient groups in the hospitals display unique chronic disease characteristics, potentially affecting clinical management approaches and the generalizability of models. The shortest average length of stay (LOS) was at FGH (4.9 days), while XYZ had the longest (5.5 days), with ABC Hospital yielding a middle value (5.2 days). Extended LOS at XYZ might indicate increased case complexity, which could be linked to its somewhat older patient demographic and higher incidence of comorbidities.
The detected variability carries two significant consequences. Initially, differences in age, comorbidity load, and LOS indicate that predictive models developed using combined data should consider site-specific variations to prevent bias. Moreover, these variations highlight the necessity for strong validation in various hospitals to guarantee generalizability. In federated learning scenarios, site-level differences can influence convergence speeds and performance reliability, highlighting the necessity for adaptive algorithms and calibration tailored to individual hospitals.
Table 7 contrasts the effectiveness of three centralized models: Logistic Regression (LR), Random Forest (RF), and Multilayer Perceptron (MLP) against a Federated Learning (FedAvg) method to assess if FL can reach a performance level similar to that of conventional centralized training. The MLP model recorded the best AUROC (0.83) and F1-score (0.71) among all centralized methods, showcasing its enhanced ability to model nonlinear interactions in the patient data.
FedAvg nearly matched MLP’s performance, reaching an AUROC of 0.82 and an F1-score of 0.70, indicating only slight reductions (ΔAUROC = -0.01; ΔF1 = -0.01). Accuracy rates for both MLP (0.75) and FedAvg (0.75) were comparable, suggesting that the federated approach did not undermine overall classification dependability. FedAvg also achieved comparable Precision (0.71) and Recall (0.69) in comparison to MLP’s Precision (0.72) and Recall (0.70). Additionally, the PR-AUC for FedAvg (0.66) was greater than that of RF (0.65) and LR (0.60), yet it was marginally lower than MLP (0.67), indicating that federated aggregation successfully maintained the ability to detect minority classes while adhering to privacy-preserving limitations.
In comparison, RF and LR trailed FedAvg and MLP in nearly all performance measurements. LR, specifically, showed the lowest AUROC (0.77) and PR-AUC (0.60), highlighting its constrained capacity to identify intricate, nonlinear connections in the dataset. RF achieved a moderate performance (AUROC = 0.81) but did not exceed FedAvg in any major metric, highlighting the effectiveness of the federated method compared to a robust ensemble baseline. The small performance difference between FedAvg and the top centralized model strongly suggests that federated learning can reach similar predictive capabilities without needing centralized data aggregation. This is especially critical in clinical environments, where data privacy laws like HIPAA, GDPR, and NDPR limit direct sharing of data between institutions. By upholding local data sovereignty and ensuring predictive precision, FedAvg stands out as a feasible and ethically responsible approach for healthcare analytics across multiple institutions.
Table 8 contrasts the predictive capabilities of locally trained models against the federated model (FedAvg) across the three involved hospitals (ABC, FGH, and XYZ), highlighting the advantages of multi-institutional cooperation without the need for direct data exchange. Federated learning consistently surpassed local training in AUROC and accuracy across all sites. At ABC Hospital, the AUROC rose from 0.78 (local) to 0.82 (federated), and accuracy grew from 0.72 to 0.75. Likewise, FGH Hospital showed significant improvements, with AUROC increasing from 0.75 to 0.81 and accuracy going from 0.70 to 0.74. XYZ Hospital showed similar advancements, with AUROC rising from 0.76 to 0.82 and accuracy increasing from 0.71 to 0.75. These enhancements suggest that federated aggregation successfully utilizes cross-hospital knowledge transfer to boost local predictive performance, even amidst differences in patient populations, clinical workflows, and electronic health record systems. The uniform level of enhancement seen across hospitals indicates that federated learning is resilient to variations between sites, which is essential for practical application in varied healthcare environments. The findings showed that hospitals can reach centralized model quality while maintaining complete control over their patient information, a notable benefit in settings limited by data privacy laws like HIPAA, GDPR, and NDPR. This method minimizes legal and ethical dangers while promoting cooperation among organizations. Federated learning tackles a major obstacle to implementing machine learning in healthcare by enhancing performance while preserving data sovereignty, balancing data utility with patient confidentiality. These results strengthen the importance of FL as an effective approach for multi-site predictive modeling, particularly in situations where centralizing data is unfeasible or not allowed.
Table 9 presented a summary of the communication and computational features of the assessed models, measuring the efficiency trade-offs linked to federated learning (FL). All models were normalized to 50 communication rounds for consistency. Within centralized methods, Logistic Regression (LR) demanded the lowest average bandwidth (12 MB), with Random Forest (RF) next (20 MB) and Multilayer Perceptron (MLP) after that (35 MB). The MLP implementation of FedAvg experienced the greatest bandwidth consumption (38 MB), indicating the extra burden of sending model updates between institutions. Nonetheless, the increase compared to centralized MLP (+3 MB) was slight, indicating that FL's communication needs remain controllable under standard network circumstances. The convergence speed was quickest for LR (30 epochs) and RF (32 epochs), while the centralized MLP needed 40 epochs. FedAvg needed a few more epochs (45) to reach convergence, due to the asynchronous aggregation of parameters and variability in local data distributions. The training durations showed a comparable trend: LR took 18 minutes, RF took 25 minutes, centralized MLP required 42 minutes, and FedAvg needed 48 minutes. These findings suggest that federated learning adds a slight computational burden compared to its centralized version. Although FL required more training time and convergence, the additional bandwidth and computational expenses might be justified considering the significant privacy and data-sovereignty advantages shown in
Table 7 and
Table 8. In environments with restricted network capabilities or limited hardware resources, these efficiency compromises need to be balanced with regulatory and ethical considerations. In numerous healthcare settings, the extra expense seems warranted due to FL’s capacity to attain nearly centralized performance without the need for data exchange. These results emphasize that infrastructure planning, such as adequate bandwidth allocation and somewhat lengthened training timelines, must be taken into account when implementing federated systems on a large scale. Future research could investigate adaptive communication methods (such as gradient compression, partial parameter exchange) or asynchronous federated protocols to decrease overhead further while maintaining performance.
Table 10 assessed the privacy and security aspects of centralized compared to federated learning (FL), focusing on RQ2: “In what ways does FL tackle privacy and collaboration issues?” The success rate of membership inference attacks was significantly greater for centralized training (22%) than for the federated method, which was 8%. This decrease illustrates FL’s capability to restrict information exposure via parameter aggregation, thus reducing the likelihood of adversaries determining if particular patient records were included in the training dataset. The centralized collection of sensitive data was determined to be only partially in line with HIPAA and GDPR/NDPR, mainly because of the increased risk of unauthorized access during the storage and transfer of data. Conversely, federated learning maintained complete adherence to these regulations because patient-level data stayed on local servers during the entire training process. By minimizing cross-border or inter-hospital data exchange, FL adheres to legal requirements that stress local data storage and limited data mobility, essential in multi-jurisdictional healthcare partnerships. These results highlight the importance of FL in facilitating collaboration across multiple institutions while maintaining privacy standards. Hospitals can enhance the performance of predictive models together (as demonstrated in
Table 7 and
Table 8) while maintaining data sovereignty, thus preventing possible legal and reputational issues. Moreover, diminished vulnerability to membership inference attacks indicates that FL can be incorporated into practical systems with increased patient confidence and fewer ethical issues. The showcased security benefits establish FL as an ethically strong option compared to centralized machine learning for confidential health information. Although there is some extra computational and communication burden (
Table 9), these expenses are balanced by FL’s capacity to achieve high predictive accuracy alongside strict privacy protections. Future research might explore sophisticated privacy-preserving improvements (secure aggregation or differential privacy) to further strengthen FL implementations across various clinical environments.
Table 11 provided an in-depth error analysis among hospitals to improve the clinical understanding of the predictive models and pinpoint possible areas for enhancement. Misclassification rates differed slightly among institutions. FGH Hospital indicated the highest rates for false positives (15%) and false negatives (13%), while XYZ Hospital exhibited the lowest rates (13% false positives and 11% false negatives). ABC Hospital exhibited moderate error rates (14% false positives, 12% false negatives). These fairly consistent trends indicate that although the overall performance of the model was strong (refer to
Table 7), some subgroups continue to pose difficulties for accurate classification. Essential characteristics linked to misclassification offer understanding of the underlying clinical intricacy:
ABC Hospital: Patients over 70 years old, those diagnosed with COPD, and individuals with brief hospital stays were often misidentified. These results could indicate unusual readmission risk patterns in older adults with chronic respiratory issues who are discharged quickly.
FGHHospital:Instancesofdiabetesandhypertensionoccurringtogetherwerefrequentlymisclassified,suggestingpossibleinteractioneffectsbetweenthesecomorbiditiesthatthemodelsdidnotcompletelyreflect.
XYZHospital:Patientswithmultipleprioradmissionsandthoseprescribedmorethan10medicationsweremorepronetoclassificationerrors,possiblyreflectingcomplex,multi-morbiditycaseswithvariablecaretrajectories.
This assessment emphasized chances to enhance model sensitivity and specificity by including interaction terms (diabetes–hypertension interaction), temporal characteristics (trends of recent admissions), or measures of medication complexity. Hospitals might also explore site-specific calibration or feature engineering aimed at these subgroups to minimize systematic bias. These findings improve clinical interpretability by recognizing particular patient traits associated with misclassification, an essential condition for incorporating predictive models into decision-making processes. Recognizing error patterns fosters trust between clinicians and informs specific interventions like improved discharge planning or follow-up for high-risk, multi-morbid patients, potentially lowering readmission rates.
Figure 3 illustrates the distribution of patient groups categorized by hospital, age category, gender, and coexisting conditions, offering a visual overview of population diversity among the three involved sites. The distribution indicates that ABC Hospital accounted for the highest share of patients, whereas FGH and XYZ Hospitals brought in comparable but slightly smaller groups. This variable enrollment indicates disparities in hospital capacity and patient flow (see
Table 2), which may affect case-mix complexity and patterns of readmission. In all hospitals, most admissions were concentrated in the 50-69 year age category, with a significant secondary peak in patients aged 70 years and older. XYZ Hospital exhibited a relatively older demographic, in line with its higher average age (
Table 6) and increased occurrence of chronic illnesses like hypertension and heart failure. The gender distribution among hospitals was fairly uniform, showing only a minor female majority, which supports the demographic uniformity previously noted in
Table 6. No individual hospital displayed a significant gender bias, indicating that sex-related bias in admissions probably won't affect model performance.
The illustration also shows clear co-morbidity burdens:
ABC Hospital exhibited a greater prevalence of COPD cases compared to the other locations.
FGHHospitalshowedahighernumberofpatientswithbothdiabetesandhypertension,apairinglinkedtothemisclassificationtrendsobservedinTable11.
XYZHospitalaccountedforthehighestproportionofmulti-morbidpatients,frequentlyshowingahistoryofseveralprioradmissionsorintricatemedicationplans.
The noted distribution highlights the need to consider site-specific and demographic variability in federated learning models. Variations in age composition and the prevalence of co-morbid conditions might influence the rankings of feature importance and lead to distinct error patterns within hospitals. Stratified or weighted methods may enhance fairness and broad applicability by reducing biases due to imbalanced cohort makeup.
Conclusion
This research showed that federated learning (FL) can achieve predictive performance similar to centralized models while maintaining patient privacy and institutional independence. Through extensive assessment across various hospitals, FL attained strong AUROC and F1-scores, minimized vulnerability to membership inference attacks, and ensured complete compliance with regulatory requirements including HIPAA, GDPR, and NDPR.
The results emphasize the importance of FL in healthcare partnerships, offering a structure that facilitates knowledge exchange and strong predictive modeling without the need for centralizing sensitive data. This method reduces compliance risks and fosters trust between participating institutions, thus enabling collaborative analytics across diverse clinical settings.
Additionally, the study presents a definitive route for implementation in practical healthcare systems. With tolerable communication overhead and training efficiency compromises, FL provides a scalable answer for inter-institutional activities like readmission risk forecasting, chronic condition management, and hospital resource allocation. Future initiatives that include differential privacy, the integration of multimodal data, and the extension to wider healthcare networks will strengthen FL’s position as an essential technology for innovative, data-driven healthcare that prioritizes privacy.