1. Introduction
Hospitalization among older adults is a critical public health issue in Mexico. This age group, currently exceeding 17 million individuals aged 60 and over, is projected to triple in the next 40 years [
1,
2]. As this population ages, they face increasing health problems, including high prevalence of chronic diseases such as obesity, diabetes, hypertension, and cardiovascular diseases [
1]. In 2021, the most prevalent conditions among individuals aged 53 and over were hypertension (43.3%), diabetes (25.6%), and arthritis (10.7%). Additionally, 62.3% of this age group perceived their health status as fair to poor [
3]. It has been demonstrated that patients with comorbidities such as hypertension, obesity, and diabetes significantly increase their risk of hospitalization [
4].
This situation, combined with the health issues of the rest of the population (individuals < 60 years), anticipates an unfavorable scenario for hospitalization services in Mexico. For instance, Rojas-Martínez et al. [
5] estimate that approximately 10 million Mexicans could be at risk of developing diabetes or hypertension over the next decade. Additionally, events such as the COVID-19 pandemic can overwhelm public health services, further limiting timely and adequate medical care for older adults. This situation not only affects the health and well-being of the population but also imposes a significant economic burden due to the high demand for hospital services.
Given these challenges, research on predictive models that anticipate hospitalization in older adults becomes crucial. Such models enable health and stakeholders to make informed decisions, improve planning, and allocate resources more effectively, ensuring more equitable and efficient medical care for this vulnerable group.
Several studies have developed predictive health models for various prediction objectives and application scenarios. For example, Carrillo-Vega et al. [
4] analyzed risk factors for hospitalization and mortality in COVID-19 patients in Mexico, using two multiple logistic regression models. Their results indicated a significant increase in the risk of hospitalization in individuals reporting hypertension, obesity, and diabetes (p<0.01), compared to other combinations of chronic diseases. They also identified that men are 1.54 times more likely to be hospitalized than women (p<0.001, 95% CI 1.37–1.74), and that individuals aged 75 and older have a higher risk of hospitalization compared to those aged 49, with an odds ratio (OR) of 3.84.
Although logistic regression models are useful to identify associations, their limitation in capturing complex variable relationships can reduce predictive effectiveness, as they rely on an assumed linear relationship between independent variables and the log odds of the outcome [
6]. In this regard, recent advances in machine learning (ML) have overcome these limitations, allowing for the analysis of large volumes of longitudinal data and capturing complex patterns in health determinants.
For instance, Kraus et al. [
7] proposed an ML-based approach to predict the Time-Up-and-Go (TUG) test in older adults with orthopedic disabilities. In their study, they used 67 multifactorial parameters unrelated to mobility and six feature selection algorithms in the preprocessing phase. Subsequently, they trained four ML models—generalized linear model, support vector machine (SVM), random forest (RF), and XGBoost—to predict the time required for the TUG, using five-fold internal cross-validation resampling and splitting the data into an 80/20 ratio for training and validation. The RF algorithm demonstrated the best performance among the four. This study highlights the potential of ML models for risk stratification in clinical settings, emphasizing the need to incorporate real-time clinical data and expand the dataset size to improve prediction accuracy.
Another case of using ML in the health field is that of Song et al. [
8], who demonstrated the advantages of ML in predicting hospitalizations in patients with chronic diseases, revealing how health factors interact non-linearly to influence hospitalization risk. They implemented four machine learning approaches: regularized logistic regression, SVM, RF, and neural networks, to predict hospitalization in patients over 65 years old with COVID-19. The RF model achieved the best performance, with an AUC of 0.83, due to its ability to capture complex relationships between variables and avoid overfitting by combining multiple decision trees. Their findings highlight the potential of predictive models not only to identify individuals at risk but also to enable early interventions that prevent complications.
Taloba et al. [
9] present a comparative analysis of linear regression, Naive Bayes, and RF models to predict hospitalization and healthcare costs, using data from patients with risk factors such as body mass index (BMI) and other demographic characteristics. Their results show that the linear regression model had the best performance, achieving a predictive accuracy of 97.89%. However, a key limitation of the study is its exclusive focus on the accuracy metric; in health prediction with imbalanced data, metrics such as sensitivity and specificity are essential to measure a model’s ability to correctly identify both positive and negative cases.
In the study by Kandel et al. [
10], a predictive model for hospitalization of patients in skilled nursing facilities was used, employing the LightGBM algorithm, known for its high computational efficiency in handling large volumes of data. They identified important risk factors, such as comorbidities and levels of dependency in daily activities, providing a valuable tool for resource planning and management in these facilities. LightGBM achieved good results in predicting hospitalizations, using metrics such as F1-score, sensitivity, and positive predictive value. However, the authors noted that while LightGBM is accurate, its implementation requires careful hyperparameter tuning, which increases the risk of overfitting and adds complexity. In contrast, more robust and simpler methods, such as RF, offer a better balance between accuracy and ease of implementation, making it more suitable for clinical applications.
The study by Friesner et al. [
11] focus on the development and validation of ML models based on physical activity data collected through wearable devices, aiming to predict unplanned hospitalizations during concurrent chemotherapy and radiotherapy (CRT). The analysis included 214 patients with various types of cancer, employing approaches such as regularized logistic regression, RF, and neural networks to assess hospitalization risk. The results showed that the regression model achieved an area under the curve (AUC) of 0.83, followed by the neural network with an AUC of 0.80 and the RF with an AUC of 0.76. This suggests that, although the regression model had the best performance, the other models were also effective. The study highlights the importance of including physical activity data in models to improve the identification of patients at risk of complications during oncological treatments and facilitate preventive interventions. For more insights, refer to the work of Durán-Vega et al. [
12], which emphasizes the significance of wearables in health monitoring. However, the study by Friesner et al. [
11] has some limitations. Firstly, the data collection period was only one month, which may not be sufficient to train a robust model. Additionally, the study was conducted in an academic setting, which could introduce selection biases and limit the generalizability of the findings. The absence of more complex clinical data may also have affected the model's predictive capacity. These limitations highlight the importance of considering multiple data dimensions in future studies, especially for developing predictive models in older adult populations.
Another relevant study is the work by Ermak et al. [
13], who evaluated ten machine learning models to predict hospitalizations in patients with coronary artery disease, highlighting CatBoost for its ability to handle imbalanced data and categorical variables, achieving an AUC of 0.875. Additionally, the authors emphasize the use of activation thresholds, such as the Youden index, and the application of techniques to handle missing data, such as imputation and resampling, which ensure the model’s generalizability. However, a significant limitation of this study was that the selection of the best model was based solely on AUROC, without considering metrics such as sensitivity and specificity that are crucial in medical prediction. Furthermore, the use of techniques like RandomGridSearch and multiple iterations in hyperparameter optimization adds complexity to the model. This complexity, combined with CatBoost’s inherent complexity and lower interpretability, presents significant challenges for its adoption in clinical settings, where simplicity and speed are essential [
14].
Finally, a recent study that highlights the advantages of ML models in the health domain is the work by Amanollahi et al. [
15], which conducted a systematic review (using the PRISMA model for systematic reviews and meta-analyses (Page et al. [
16]) on the prediction of relapses, hospitalizations, and suicides in patients with bipolar disorder. The review included 18 studies with over 30,000 patients, noting that RF, SVM, and logistic regression are the most used models in the studies. The authors emphasize the robustness of RF in handling imbalanced and noisy data, which reduces prediction error and makes it a reliable option in clinical contexts, where data are often quite complex and unpredictable. Additionally, they highlighted the importance of applying nested cross-validation, a technique rarely used in the reviewed studies, to ensure more reliable results. Regarding evaluation, Amanollahi et al. [
15] stress the importance of using metrics such as sensitivity and specificity to predict adverse clinical events.
The objective of this study is to develop a predictive model for hospitalizations in older adults using longitudinal data from the Mexican Health and Aging Study (MHAS) [
17] and machine learning techniques. This model will enable the identification of individuals at higher health risk and anticipate medical care needs. In a context where the demand for hospital services exceeds the healthcare system’s capacity, the implementation of predictive models is crucial to address these challenges and enhance resource planning. The combination of advanced age and comorbidities in the Mexican population presents a significant challenge for the healthcare system, and the use of data analysis and ML tools is key to ensuring that older adults receive the necessary care during critical moments.
2. Materials and Methods
The objective of this study is to develop a predictive model to identify future hospitalizations in older adults. To achieve this, a prediction strategy was designed using the RF algorithm and the MHAS dataset [
17]. The MHAS dataset comes from a national longitudinal study, we are using information from the waves: 2012, 2015, and 2018. MHAS includes socioeconomic, health, and lifestyle information of individuals 50 years and older, residing in various states of Mexico.
The RF algorithm was selected for this study due to its ability to handle datasets with multiple predictor variables and its robustness against overfitting, a crucial aspect in medical applications [
15]. Additionally, this algorithm is particularly suitable for processing binary, categorical, and ordinal variables, which constitute the majority of our dataset. Not assuming linear relationships between variables allows for capturing more complex interactions among predictive factors, regardless of the variables used [
8,
18]. Therefore, RF is essential for predicting events such as hospitalizations, where the factors influencing the likelihood of hospitalization are complex and often do not follow linear patterns. Factors such as age, health status, access to medical services, and socioeconomic conditions interact in ways that cannot always be represented by simple cause-and-effect relationships, making RF the best option for this analysis.
2.1. General Review of the Dataset
The dataset used in this study is derived from the Mexican Health and Aging Study (MHAS), a national longitudinal study of adults aged 50 and older, partially funded by the National Institutes of Health (NIA), the National Institute on Aging (NIH) and the Mexican National Institute of Statistics and Geography (INEGI). Data collection waves were conducted in 2001, 2003, 2012, 2015, 2018, and 2021. The study's protocols and instruments are highly comparable to those used in the U.S. Health and Retirement Study (HRS), ensuring cross-national comparability. The study included private households with at least one resident aged 50 or older, randomly selecting one eligible participant per household. If the selected participant was married or cohabiting, their spouse was also recruited, regardless of age [
17].
For the purposes of this research, the MHAS dataset underwent a pre-processing phase specifically designed and executed to meet the study's requirements. This phase included the removal of empty fields, handling incomplete data through deletion, selection of individuals aged 50 and older, and the appropriate encoding of each input variable (e.g., binary and ordinal categorical encoding). After this pre-processing stage, the final dataset comprised 30,603 rows and 17 columns. A description of these columns is provided below:
Fourteen columns correspond to the predictor variables: sex, age, diabetes, stroke, educational level, place of residence (urban or rural), living with someone, body mass index, hypertension, level of physical limitation, falls, access to public health services, smoking history, and current alcohol consumption;
Two columns represent the individual’s ID and the year in which the information was collected;
Finally, the column for the target variable is included: hospitalization in the last 12 months. The MHAS dataset shows a marked imbalance in this variable (9.9% in 2012, 12.8% in 2015 and 15.6% in 2018 were hospitalized), which must be considered in the machine learning model design to avoid biased results.
A descriptive summary of the dataset is presented in
Table 1. The data are organized by year and expressed as percentages for each variable used in the machine learning model.
The analysis focused on data from the years 2012, 2015, and 2018 to maintain a consistent temporal pattern. Earlier waves from 2001 and 2003 were excluded due to the significant gap between them and subsequent years, which could disrupt the temporal continuity necessary for robust longitudinal analysis. Additionally, the 2021 wave was excluded because the COVID-19 pandemic represents a different scenario for hospitalization. By focusing on these three years, the model was able to better capture the progression of hospitalization risks over time.
2.2. Designs of Machine Learning Models
To predict hospitalizations, three machine learning models were constructed using the RF algorithm (M1, M2, and M3), each focused on different data sets. The data partitions for training and testing were made using information from different years. The configuration of the models was as follows:
Model M1: Predicts hospitalization events in the three years studied (2012, 2015, and 2018) for each individual;
Model M2: Predicts hospitalization events in 2015 and 2018 using the data available from 2012 as part of the training sample;
Model M3: Predicts hospitalization events in 2018, using data collected in 2012 and 2015 as part of the training set.
This combination in model design allows for the evaluation of the predictive capacity of the RF algorithm in different temporal scenarios, providing a comprehensive view of the evolution of hospitalizations throughout time.
2.3. Validation of Machine Learning Models
To ensure the robustness of the models and avoid overfitting, a nested cross-validation scheme was implemented. This method consists of an inner cross-validation and an outer cross-validation, applied as follows:
Inner cross-validation: Used to estimate the hyperparameters of the RF algorithm. Instead of using the traditional k-fold cross-validation, 10 random partitions of the dataset were employed.
Outer cross-validation: Once the optimal hyperparameters were selected, a second cross-validation was performed to evaluate the model’s performance. Ten random partitions were also used to measure the model’s generalization capacity on unseen data, applying the training and testing scheme on external partitions different from those used in the inner validation. The average of these 10 partitions is reported as the predictive capacity of the models.
This nested procedure ensures that the hyperparameters are not overfitted to a particular subset of data, improving the model’s predictive capacity by evaluating its performance across different data partitions.
In the outer cross-validation four data partitioning for testing and training were proposed as a complementary prediction strategy: 10% - 90%, 20% - 80%, 30% - 70%, and 40% - 60%. By testing these different proportions, a broader and more robust view of the model’s performance is obtained, allowing for the identification of the optimal configuration for different scenarios and ensuring greater reliability in predictions.
2.4. Study Evaluation Metrics
The primary objective of the experiments was to identify the model that optimizes the balance between sensitivity and specificity, with the aim of minimizing both false positives and false negatives, a critical factor in medical applications such as hospitalization prediction [cite source]. Therefore, the performance of the models was evaluated using the following metrics:
Kappa coefficient: A statistic that measures the agreement between predictions and observations adjusted for the possibility that predictions could be made by chance;
Sensitivity: Also known as "recall," it measures the model's ability to correctly identify positive cases (hospitalizations);
Specificity: This metric assesses the model's ability to correctly identify true negative cases (non-hospitalizations), helping to reduce the number of false positives that could lead to an unnecessary strain on hospital resources.
Due to the significant imbalance in the dataset, where the percentage of hospitalizations is relatively low (only 12.8% of the 30,603 records), the Accuracy metric was excluded from the analysis. In situations where there is an imbalance in the target variable, Accuracy tends to favor the majority class (non-hospitalizations), which does not adequately reflect the model's performance in identifying the minority class. Consequently, Accuracy is not an appropriate metric in this context, as it could lead to misleading interpretations and underestimate the model’s performance in predicting the minority class (hospitalizations). Instead, we focused on sensitivity, specificity, and the kappa coefficient, which more accurately captures the model's behavior in the face of data imbalance and aligns with the study’s goal of correctly identifying both true positive and true negative cases.
2.5. Case Studies
The primary objective of these experiments is to evaluate the performance of each model under various data partition scenarios to identify the most effective machine learning model for predicting hospitalizations in individuals aged 50 and older. The final model selection was based exclusively on the previously described evaluation metrics, aiming for an optimal balance between sensitivity and specificity. In this context, two case studies were designed to assess the effectiveness of models M1, M2, and M3:
Case 1 (Variables without Interaction): Uses 14 predictor variables, as previously mentioned, along with the individual's ID and the year the data was collected;
Case 2 (Variables with Interaction): Includes the 14 original predictor variables and all possible first-order interactions, meaning combinations between each pair of the 14 input variables.
The inclusion of variable interactions aims to capture additional relationships between predictive factors that could influence hospitalizations. This approach enriches the information provided to the machine learning models, allowing for an exploration of whether these interactions enhance predictive performance.
These two case studies enable a comprehensive evaluation of the models under different data configurations, providing a holistic view of their performance and permitting the identification of the model that best fits the data, thus ensuring robust and reliable application in real-world settings.
4. Discussion and Future Research Directions
The results of this study demonstrate the usefulness of machine learning models for predicting hospitalizations in older adults in Mexico, using longitudinal data from the MHAS. The performance of models M1, M2, and M3, in their respective variants, generally yielded good results. The M2 model, which included interaction of variables and a test proportion of 20%, was particularly noteworthy, achieving the best balance between sensitivity (0.7215, with a typical error of 0.0038) and specificity (0.4935, with a typical error of 0.0039). This balance allowed M2 to more effectively capture hospitalization cases without excessively increasing false positives, which is crucial in medical care to prevent inefficient allocation of hospital resources. However, it is important to recognize that the specificity result and the kappa coefficient are low, revealing areas of improvement for future research in predicting hospitalizations in older adults with this dataset.
The moderate performance of the model may also relate to the dataset's reliance on self-reported data, common in large-scale population studies. Future research could explore the potential of adding clinical biomarkers (such as glucose, cholesterol, or genetic marker levels) to refine risk distinctions, particularly in a heterogeneous population. Additionally, our study’s broader design approach, which applied the model to a diverse population of older adults with varying health conditions, likely contributed to the moderate performance observed. Unlike previous studies that focused on patients with specific diseases or comorbidities [
4,
13,
15], this general application increased the heterogeneity of the dataset and introduced a significant imbalance between hospitalization cases (12.8%) and non-hospitalized cases (87.2%). This imbalance likely impacted the model’s predictive effectiveness, resulting in fluctuations in sensitivity and specificity values. Furthermore, our research employed a rigorous methodology to enhance the robustness of our results. This included the use of nested cross-validation, the evaluation of four testing proportions (10%, 20%, 30%, and 40%), and the comparison between models with and without variable interactions. This rigorous design methodology, along with the inherent complexity of predicting hospitalizations in a diverse population, may have resulted in a more conservative performance of our proposed models compared to previous studies. Song et al. [
8], for example, achieved more favorable results in predicting hospitalizations in patients with specific comorbidities by working with more homogeneous datasets using the RF algorithm. Nevertheless, while the methodological rigor of our approach may have exposed some limitations in the performance metrics, it also provides a clearer view of the model’s actual performance under more complex conditions. Similarly, Friesner et al. [
11] highlight the importance of incorporating real-time data, collected through wearable devices, to predict unplanned hospitalizations during concurrent chemotherapy and radiotherapy (CRT). Moreover, the authors also noted that the absence of more complex clinical data, such as biomarkers or specific medical parameters, may have affected the predictive capacity of the models. This underscores the importance of integrating multiple data sources, both real-time and clinical, to enhance the accuracy of prediction models in future studies.
Another important direction would be to implement advanced techniques for handling missing data, rather than eliminating entire records, to preserve relevant information about individuals and contribute to better model performance. Additionally, although the utility of the RF algorithm has been demonstrated, exploring other algorithms could further improve the predictive capacity of the model in future research.
The implications of these findings extend beyond the local context of the Mexican population. Socioeconomic factors, chronic health conditions, and access to healthcare services are common variables in studies of older adults in other regions. This suggests that the M2 model developed in this study could be adaptable and useful in similar contexts. For instance, this model could generate information to identify older adults at risk of hospitalization in advance, thus facilitating the planning and allocation of healthcare resources in hospitals and clinics across diverse populations.
In summary, this study provides a solid foundation for the implementation of predictive models in public health resource planning in Mexico, particularly in the context of an aging population. The findings highlight that as the healthcare system faces increasing hospital demand, the ability to anticipate hospitalizations and efficiently manage resources will be key to improving the medical care of older adults. The M2 model, with its strengths in sensitivity, offers a promising starting point for application in healthcare systems aiming to optimize the allocation of such health resources.