Skip to main content

Prediction and feature selection of low birth weight using machine learning algorithms

Abstract

Background and aims

The birth weight of a newborn is a crucial factor that affects their overall health and future well-being. Low birth weight (LBW) is a widespread global issue, which the World Health Organization defines as weighing less than 2,500 g. LBW can have severe negative consequences on an individual’s health, including neonatal mortality and various health concerns throughout their life. To address this problem, this study has been conducted using BDHS 2017–2018 data to uncover important aspects of LBW using a variety of machine learning (ML) approaches and to determine the best feature selection technique and best predictive ML model.

Methods

To pick out the key features, the Boruta algorithm and wrapper method were used. Logistic Regression (LR) used as traditional method and several machine learning classifiers were then used, including, DT (Decision Tree), SVM (Support Vector Machine), NB (Naïve Bayes), RF (Random Forest), XGBoost (eXtreme Gradient Boosting), and AdaBoost (Adaptive Boosting), to determine the best model for predicting LBW. The model’s performance was evaluated based on the specificity, sensitivity, accuracy, F1 score and AUC value.

Results

Result shows, Boruta algorithm identifies eleven significant features including respondent’s age, highest education level, educational attainment, wealth index, age at first birth, weight, height, BMI, age at first sexual intercourse, birth order number, and whether the child is a twin. Incorporating Boruta algorithm’s significant features, the performance of traditional LR and ML methods including DT, SVM, NB, RF, XGBoost, and AB were evaluated where LR, had a specificity, sensitivity, accuracy and F1 score of 0.85, 0.5, 85.15% and 0.915. While the ML methods DT, SVM, NB, RF, XGBoost, and AB model’s respective accuracy values were 85.35%, 85.15%, 84.54%, 81.18%, and 84.41%. Based on the specificity, sensitivity, accuracy, F1 score and AUC, RF (specificity = 0.99, sensitivity = 0.58, accuracy = 85.86%, F1 score = 0.9243, AUC = 0.549) outperformed the other methods. Both the classical (LR) and machine learning (ML) models’ performance has improved dramatically when important characteristics are extracted using the wrapper method. The LR method identified five significant features with a specificity, sensitivity, accuracy and F1 score of 0.87, 0.33, 87.12% and 0.9309. The region, whether the infant is a twin, and cesarean delivery were the three key features discovered by the DT and RF models, which were implemented using the wrapper technique. All three models had the identical F1 score of 0.9318. However, “child is twin” was recognized as a significant feature by the SVM, NB, and AB models, with an F1 score of 0.9315. Ultimately, with an F1 score of 0.9315, the XGBoost model recognized “child is twin” and “age at first sex” as relevant features. Random Forest again beat the other approaches in this instance.

Conclusions

The study reveals Wrapper method as the optimal feature selection technique. The ML method outperforms traditional methods, with Random Forest (RF) being the most effective predictive model for Low-Birth-Weight prediction. The study suggests that policymakers in Bangladesh can mitigate low birth weight newborns by considering identified risk factors.

Introduction

Low birth weight describes the weight of a newborn infant at delivery. When an infant’s birth weight is less than 5 pounds or 2,500 g, it is defined as having a low birth weight [1]. Low birth weight infants are born weighing less than average. This can cause health issues like respiratory distress syndrome, jaundice, anemia, and difficulty regulating body temperature. They are also at higher risk of developmental delays and neonatal death. It has been estimated by the World Health Organization (WHO) that roughly 15–20% of global births result in low birth weight, due to prematurity and intrauterine growth restriction. In developed nations, preterm birth is the primary cause of LBW, while underdeveloped countries tend to experience intrauterine growth restriction as the primary cause [2]. The prevalence of LBW is more than twice as high in developing nations [3]. The prevalence of LBW in Nepal in 2011 was 29.7%; in Pakistan in 2012–2013 was 35.1%; in Indonesia in 2012 was 12.9%; in Armenia in 2010 was 9%; in Jordan in 2012 was 22%; in Uganda in 2011 was 16.9%; in Zimbabwe in 2010–2011 was 14.5%; in Colombia in 2010 was 11.8%; in Cambodia in 2010 was 14.2% and in Tanzania in 2010 was 13.9% [4], which is shown in Fig. 1.

Fig. 1
figure 1

LBW percentage of different countries

Much research has been performed to predict the features related to LBW, and many studies have taken place to predict birth weight. According to [5], ethnicity, compliance with iron and folic acid (IFA) supplements, and maternal antenatal care visits are significantly correlated with low birth weight. Other researchers have concluded that gestational age, the baby’s height, and head circumference can account for around 60% of the variation in newborn weight [6]. In 2014 [7], researchers found that a younger white mother who does not smoke, has hypertension or uterine irritability, has a higher weight at the last menstrual period, and has not experienced premature labor is less likely to give birth to an infant with low birth weight.

Southern Asia has the highest number of LBW newborns. LBW is a major healthcare concern in Bangladesh, with a prevalence rate of 17.7% in 2011, 20% in 2014, and 16% in 2017. Researchers used the chi-square test, DT, and LR models to predict LBW [3]. Evidently, the prevalence of LBW in Bangladesh consistently exceeds 10%, indicating the utmost importance of eradicating this issue. The 2003–2004 National Low Birth Weight Survey (NLBWS) in Bangladesh found that approximately 36% of all newborns had low birth weight, with a prevalence of 29% in urban areas [8]. As low birth weight, or LBW, accounts for 40–60% of newborn mortality, much study is necessary in this regard [5]. Furthermore, one of the Millennium Development Goals is to decrease the mortality rate of children under the age of 5 by two-thirds before 2015 [9]. Despite significant progress in child and maternal health issues, Bangladesh still faces challenges in this area. Research is taking place to reduce the problem of low birth weight, and actions are being accepted worldwide. Therefore, in order to enable the government, take the appropriate action to end low birth weight, extensive research is required in this context to emphasize the general factors that contribute to the problem.

In recent year, ML is gaining popularity in various fields, including disease diagnosis in healthcare [10]. For instance, researchers in India have developed an automated coronary heart disease diagnosis system based on ML, yielding around 89% accuracy [11]. Recent advancements in ML- and DL-based kidney disease diagnosis may offer solutions for countries unable to handle diagnostic tests [12]. In the past, traditional statistical methods were commonly used to identify the factors that influence birth weight. However, only a small number of researchers used machine learning techniques to predict and explore the factors associated with LBW in Bangladesh [3, 13]. As far as we are aware, no research has considered large number of different machine learning approaches to determine which model would be most effective in LBW forecasting in Bangladesh. In a previous study [3], researchers employed only two machine learning (ML) methods: Logistic Regression (LR) and Decision Tree (DT). The study conducted by [13], utilized the BDHS-2011 along with 2014 datasets where they implemented six classification algorithms including as LR, NB, k-nearest neighborhood (k-NN), RF, SVM, and multilayer perceptron (MLP) to anticipate LBW. Furthermore, none of these two studies used feature selection technique to extract important features. Apart from conventional clinical techniques, machine learning could be helpful in more accurately predicting the risk of LBW [14]. Thus, the purpose of this work is to uncover important aspects of LBW using a variety of machine learning (ML) approaches and to determine the best feature selection technique as well as best predictive ML model for LBW.

Related works

Many researchers have worked on predicting Low Birth Weight (LBW) using different methodologies, as it is a worldwide problem, especially in developing countries. Newborns with meager birth weight are vulnerable and have a high mortality rate. One of the Millennium Development Goals is to decrease the mortality rate of children under the age of 5 by two-thirds before 2015 [9]. In 2014, researchers from the USA found that a younger white mother who does not smoke, has hypertension or uterine irritability, has a higher weight at the last menstrual period, and has not experienced premature labor, is less likely to give birth to an infant with low birth weight [7]. In 2015, researchers in India conducted a study on predicting low birth weight and identifying associated risk factors using several machine learning algorithms. The tested algorithms included Logistic Regression, Naïve Bayes, Support Vector Machine (SVM), Neural Network, and Classification Tree. The study found that the Classification Tree algorithm, with its high overall prediction accuracy, specificity, AUC, F-value, and Precision, instilled confidence in the research methods. It had an accuracy rate of 89.95% [15]. Researchers in Indonesia (2018) used Binary Logistic Regression and Random Forest techniques to predict and classify Low Birth Weight (LBW). This study collected data from the Indonesian Demographic and Health Survey (IDHS) of 12,055 women aged 15–49 who gave birth from 2007 to 2012. The independent variables, including place of residence, time zone, wealth index, mother’s education, father’s education, age of the mother, job of the mother, and number of children, were significant risk factors. After completing the research, it was found that the Binary Logistic regression model gave a poor AUC value, whereas the performance of the Random Forest model was outstanding [16].

In Portugal in 2019, researchers used machine-learning methods to predict low birth weight (LBW). The researchers gathered data from 2328 individuals and implemented six different techniques, namely random forest (RF), adaptive boosting (AdaBoost), naïve Bayes (NB), K-nearest neighbor (KNN), decision tree (DT), and support vector machine (SVM). The results showed that AdaBoost had the highest accuracy rate of 98%, a sensitivity of 0.91, and a specificity of 0.99 [17]. In 2017, researchers from India employed three different classification methods - Random Forest (RF), XGBoost, and Bayes minimum error rate classifier - to differentiate between Low Birth Weight (LBW) and Normal Birth Weight (NBW) infants. Consequently, the Bayes minimum error classifier exhibited the highest accuracy rate (96.7%), with a sensitivity of 1.0 and a specificity of 0.85 [1]. Low Birth Weight (LBW) is associated with several maternal and fetal factors [18]. In this research (2017), low birth weight was predicted using Logistic Regression (LR) and Random Forest (RF) techniques. The study was conducted on 600 women from the Milad Hospital in Iran, and relevant data was collected. The findings showed that RF was more precise than LR, with an accuracy rate of 95% compared to 93% for LR. Low birth weight, a significant public health concern globally, is a key contributor to neonatal mortality rates in developing countries [5]. In this study (2022), 308 women were interviewed using face-to-face techniques to collect data. The findings revealed that one in every seven infants had low birth weight. The researchers used multivariate logistic regression for further analysis. They discovered a significant correlation between low birth weight and ethnicity, iron and folic acid (IFA) compliance, and maternal antenatal care visits. The above related researches are summarized in Table 1.

Table 1 Review on existing prediction models for LBW

Methods and materials

We meticulously devised a study plan to achieve our goal and rigorously adhered to it, from data collection to presenting results, which is shown in Fig. 2.

Fig. 2
figure 2

Flowchart of overall methodology of the study

Shortly, the study used BDHS 2017-18 dataset and to illustrate the fundamental data of the respondents, descriptive statistics were performed. The data was processed and then divided into a 70:30 ratios for training and test sets in a random manner. The Boruta and wrapper approach was used to extract important features. Additionally, Logistic Regression (LR) and various machine learning methods including Decision Tree (DT), Support Vector Machine (SVM), Naïve Bayes (NB), Random Forest (RF), XGBoost, and AdaBoost were employed. Ten-fold cross-validation was used to validate each algorithm. The specificity, sensitivity, accuracy, F1 score, and AUC values were used to evaluate and compare each model’s performance. Ultimately, the most optimal feature selection technique and machine learning predictive model were selected. Moreover, the most important LBW traits were retrieved.

Independent variables

The study considered some variables which may have an impact on our outcome variable which are – age of the respondent, highest education level, age of respondent at 1st birth, age at 1st sex, wealth index, BMI, height, weight, place of delivery, educational attainment, region, place of residence, sex of child, child is twin, child is alive, birth order number, delivery by caesarean section, and antenatal care receive. The overall description of the independent variables has shown in Table 2.

Table 2 Description of the response and independent variables with their categorization

Feature selection method

Selecting most significant features two types of method have been applied, they are- Boruta algorithm and Wrapper method.

Boruta Algorithm

The steps of Boruta algorithm can be described as follows:

  • By creating duplicate features and rearranging the values in each column, it first introduces unpredictability to the features. Shadow features are what we refer to as.

  • Determines the significance of employing Mean Decrease Accuracy or Mean Decrease Impurity using a classifier (Random Forest) trained on the dataset.

  • After that, the algorithm determines whether any of your real features are more important. In other words, whether the feature’s Z-score is higher than the best possible Z-score of its shadow feature.

Every iteration, the algorithm compares the Z-scores of the original features to those of the shuffled duplicates to determine whether the latter outperforms the former. The algorithm will mark the features as important if it does [19].

Wrapper technique

Wrapper techniques use a classification algorithm to evaluate feature subsets, making them computationally expensive. Selected features depend on applied classification techniques, and different techniques identify varying risk factor combinations. Despite being slow, it consistently achieves superior feature selection results [20]. The mechanism of the Wrapper is shown in Fig. 3.

Fig. 3
figure 3

Mechanism of the wrapper method

Traditional method

Logistic regression (LR)

The logistic function is the cumulative distribution function of the logistic distribution. It is used to estimate probabilities and determine the relationship between a categorical dependent variable and one or more independent variables [21]. It employs similar methods as probit regression, which uses a cumulative normal distribution curve. Both approaches assume a standard normal distribution of errors and a standard logistic distribution of errors in their latent variable interpretations.

Machine learning algorithms

Various machine learning algorithms have been employed in this study for LBW prediction, such as Decision Tree (DT), Support Vector Machine (SVM), Naïve Bayes (NB), Random Forest (RF), XGBoost, and AdaBoost.

Decision tree (DT)

Decision Tree is a technique used to approximate discrete-valued target functions by representing the learned function as a decision tree [22]. A decision tree focuses on deciding which attribute is the best classifier at each node level. Statistical measures like information gain, Gini index, Chi-square, and entropy are calculated for each node to calculate the worth of that node. The most important thing to keep in mind while developing a machine learning model is to select the optimal method for the dataset and task at hand.

Support Vector Machine (SVM)

Using a technique known as the kernel trick, SVMs can effectively conduct non-linear classification in addition to linear classification by implicitly translating their inputs into high-dimensional feature spaces [23]. For the specified kernel, and kernel parameters, SVM compute the kernel of distances between the data points. After training the data, the SVM algorithm performs the classification using, \(\:f\left(x\right)={w}^{T}x+b\).

Naïve bayes (NB)

The Bayes theorem is the foundation of Naïve Bayes, a probabilistic machine learning method used for classification. The Bayesian classifier relies on the presence or absence of a particular class feature, independent of other features [24]. This classifier relies on the Bayes theorem to supervise machine learning algorithms and operates under the assumption that features are analytically independent.

Random forest (RF)

Random forest, also known as decision tree forest, is ensemble-based learning method that solely focuses on decision tree ensembles. It uses a bagging approach to create numerous decision trees with a random subset of data, which are combined to make an ultimate decision [23]. Random forests can be used for machine-learning problems involving both classification and regression. It is a method of combining various classifiers to address complex issues and improve model performance.

Extreme gradient boosting (XGBoost)

XGBoost is a type of ensemble learning method, which combines various predictive abilities to build a strong and accurate model [25]. In this method, multiple models are created and their outputs are combined to form a consolidated model. These models, also known as base learners, can originate from the same or different learning algorithms. The concept of ensemble learning involves combining individual models to enhance overall model performance.

Adaptive boosting (AdaBoost)

AdaBoost is used in machine learning for boosting, which is a technique for improving the accuracy of a model by combining multiple weak models [26]. The algorithm entails training a weak classifier on a sample set during each iteration. Due to the numerous attributes of each sample, identifying the most effective weak classifier from a vast array of features necessitates significant computational power [27].

Results

Following the meticulous selection and processing of pertinent data, we examined the variables’ demographic properties, which are displayed in Table 3. The study involved 1863 participants. The majority of the mothers were aged 20–24, accounting for 35.2% of the participants. About 51% of the new mothers had completed secondary education, while 16.2% of the newborns had low birth weights. Among the 1863 participants, 635 belonged to the richest wealth index. Surprisingly, despite their privileged status, 12.1% of the newborns from this group had low birth weights, a finding that challenges conventional assumptions. Additionally, a significant 86% of the participants received antenatal care from at least two centers. Roughly, 55.3% of the mothers had a normal BMI and delivered almost 85% of normal-weight babies. Among the participants, 53% resided in rural areas and 47% in urban areas. 83.7% of the female newborns had normal birth weights, and 86% of single births resulted in normal-weight babies.

Table 3 Demographic characteristics of independent variables

Then, this study used the Boruta algorithm to select features. The results of the Boruta algorithm revealed that out of the 18 chosen initially features, 11 were identified as important. These crucial features are presented in Table 4, and a visual representation of them is provided in Fig. 4. The selected important features were used in machine learning algorithms for prediction purposes. The dataset was divided into training and test data in a 70–30% ratio in a random manner. The specificity, sensitivity, accuracy, Area Under Curve (AUC) and F1 score were used to evaluate the models generated by the algorithms and compare them against each other provided in Table 5.

Table 4 Summary results of the Boruta algorithm
Fig. 4
figure 4

Feature selection using Boruta algorithm

Table 5 Performance evaluation of traditional and ML models using Boruta significant features

Based on the findings presented in Table 5, the traditional method Logistic Regression (LR) has an accuracy rate of 85.15%. 85% of negative instances were anticipated to be negative (specificity = 0.85), whereas 50% of positive cases were projected to be positive (sensitivity = 0.5). Additionally, it has an F1 score of 0.9195 and an AUC of 0.545. Among the considered ML methods, Support Vector Machine (SVM), Random Forest (RF), and AdaBoost all have specificity values of 0.99, but all other models display good specificity values. However, all models have substandard sensitivity ratings, with the exception of Decision Tree (DT = 0.57) and Random Forest (RF = 0.58).

In contrast, both Random Forest and SVM among the ML methods have a higher accuracy rate of 85.86% and higher F1 scores, with RF at 0.9243 and SVM at 0.9231. This indicates that the ML methods (SVM and RF) perform better than LR in terms of accuracy and F1 score. Upon assessing the F1 scores, it is evident that RF outperforms all other models, while XGBoost lags with the lowest score. Additionally, when comparing the AUC value for model evaluation between RF and SVM, RF has a more considerable value with 0.549. In summary, RF has the highest specificity, sensitivity, accuracy, AUC and F1 score value among all the models tested, which makes RF the optimal model by using Boruta significant features to predict low birth weight. Another method applied for feature selection is the Wrapper method, which performs feature selection and prediction for each classification model simultaneously.

Table 6 Performance evaluation of traditional and ML models using wrapper method significant features

Based on the results shown in Table 6, traditional method LR identified five features as significant. DT and RF identified three significant features using the wrapper method across all ML techniques: Region, Child is twin, and Delivery by cesarean section. SVM, NB, and AB identified Child is twin as a significant feature. In the XGBoost model, Age at 1st sex and Child is twin are significant features.

Both the classical (LR) and machine learning (ML) models’ performance has improved dramatically when important characteristics are extracted using the wrapper method. Notably, LR is marginally less accurate than ML techniques, although all ML models are equally accurate. The sensitivity of every model has risen significantly, with FR and DT exhibiting the greatest sensitivity values, 0.67. Based on the F1 score, both DT and RF achieved the highest score of 0.9318. Nonetheless, when evaluating based on the AUC value, LR obtained the highest value at 0.582, while RF achieved a value of 0.5749, and DT obtained a value of 0.555. Conversely, XGBoost yielded the lowest AUC value at 0.508. Comparing LR, DT, and RF, it is evident that DT has the lowest AUC value, while the AUC values for LR and RF are nearly identical. Moreover, it is crucial to note that RF outperforms LR in terms of specificity, sensitivity, accuracy, AUC value and F1 score. In conclusion, utilizing the Wrapper method for different classifiers, RF emerged as the superior classifier with the highest accuracy, F1 score, and AUC.

The entire dataset can be divided into training and testing data using different ratios, such as 60:40, 70:30, 80:20, and 90:10. All the algorithms were applied with 10-fold cross-validation. However, using a 70:30 ratio resulted in the highest precision. Therefore, for this study, the dataset was divided into 70% training and 30% testing data. Tables 5 and 6 show that ML method performs better than traditional method and RF is the most effective machine-learning model for predicting LBW for both feature selection methods. However, when comparing the accuracy, F1 score and AUC value using the Wrapper method, RF outperforms the RF using the Boruta significant feature. Additionally, the Boruta algorithm has identified 11 significant features, while the Wrapper method using RF has identified only three significant features. This indicates that the Wrapper method reduces model complexity and improves model performance, making it a better feature selection method than the Boruta algorithm. Figure 4 displays the feature importance of the three significant features identified by the RF Wrapper method as related to low birth weight. These features include region, child is twin, and delivery by caesarean section. These features are considered the most significant ones.

Among the three significant features, “child is twin” has the highest importance score, followed by “region” and “delivery by caesarean section” as the second and third most important features, respectively (Fig. 5).

Fig. 5
figure 5

Feature importance plot using RF wrapper method

Discussions

Researchers worldwide have been trying to predict Low Birth Weight (LBW) using different techniques, as it is a significant problem, particularly in developing countries. Newborns with low birth weight and preterm birth are at high risk and have a higher rate of mortality. Birth weight, preterm birth, and neonatal mortality are closely linked [15]. One of the Millennium Development Goals was to reduce child mortality by two-thirds before 2015 [9]. Research has demonstrated that individuals with either low or high birth weight are at a higher risk of developing obesity, which leads to cardiovascular disease [28]. This study aims to predict LBW to address these issues using machine learning algorithms. Additionally, the study aims to identify the most relevant features and the best ML model to predict LBW, along with the technique of identifying these features.

Demographic characteristics

According to the findings, 14.9% of babies had low birth weight, while 85.1% had normal birth weight. Most mothers were aged 20–24, accounting for 35.2%. In Nepal, 84.7% of respondents gave birth to babies weighing at least 2.5 kg, while 15.3% gave birth to babies weighing less than 2.5 kg. Most participants (35.7%) belonged to the age group 20–24 [5]. In this study, 13.6% of babies with low birth weights were born via cesarean section, while 86.4% had normal birth weights. In a study conducted in Ethiopia, 9.5% of babies had low birth weights, and 90.5% had normal birth weights through caesarean section. Concerning residence, 15.4% of newborns in rural areas had low birth weights [2]. This aligns with our findings, as 15.7% of newborns in rural areas were found to have low birth weights. A study in Turkey revealed gender differences in low birth weight percentages. It showed that 5.8% of male and 7.6% of female children are born with low birth weight [29]. In contrast, our findings indicate that 16.3% of female and 13.6% of male newborns are born with low birth weight. This suggests that the percentage of female babies born with low birth weight is slightly higher than that of male newborns.

Prediction of LBW

In various studies across Asia, different machine-learning models have been used to predict low birth weight (LBW) in infants. These studies have shown varying levels of effectiveness depending on the dataset and context. For example, in Bangladesh [3] a study found that logistic regression (LR) was more effective than decision tree (DT) algorithms for predicting LBW, achieving an accuracy of 85% with a 70:30 training and test dataset split. This suggests that LR, known for its simplicity and interpretability, may be more effective in datasets with linear relationships between predictor variables and the outcome.

In contrast, research from India indicated that the Classification Tree algorithm, a type of decision tree, had the highest overall prediction accuracy (89.95%) and better performance across multiple metrics, including specificity, area under the curve (AUC), F-value, and precision [15]. This suggests that for certain datasets, especially those with non-linear interactions or complex decision boundaries, tree-based algorithms might provide better performance. In Iran, a study further highlighted the versatility of tree-based models, demonstrating that the Random Forest (RF) algorithm exceeded logistic regression in accuracy, with RF achieving a 95% accuracy rate compared to 93% for LR [18]. The study also found that the Random Forest (RF) algorithm was the most suitable for this data, outperforming all other algorithms, including the Boruta algorithm and the Wrapper method, regarding accuracy, F1 score and AUC values, using a 70:30 training and test dataset ratio with 10-fold cross validation.

When utilizing the Boruta algorithm for feature selection, the RF model achieved an accuracy of 85.86%, F1 score of 0.9243 and an AUC value of 0.549. In contrast, employing the Wrapper method led to the RF model demonstrating an accuracy of 87.298%, F1 score of 0.9318 and an AUC of 0.5749. It’s crucial to emphasize that larger and more balanced datasets result in higher accuracy, F1 score and AUC values. Considering our use of secondary data and the cleanliness of our dataset, it’s evident that our dataset is not sufficiently balanced, leading to a marginally smaller AUC value. Consequently, among all machine learning algorithms, RF has unequivocally proven to be the most effective in predicting LBW. It is also claimed that with a larger sample size (through oversampling or a big dataset), the ML algorithm performs better in terms of AUC and accuracy [17].

These findings collectively suggest that while logistic regression can be effective, particularly in simpler scenarios or where interpretability is crucial, tree-based methods such as Classification Trees and Random Forests often provide superior predictive performance, especially when dealing with complex, non-linear relationships in the data. The choice of model should thus be tailored to the specific characteristics of the dataset and the study’s objectives.

Feature selection and important features

Various maternal and fetal factors influence low birth weight (LBW) [18]. Researchers in India have suggested that maternal socio-demographic features and blood Polycyclic Aromatic Hydrocarbon (PAH) concentration are associated with LBW [30]. It is crucial to identify the factors that contribute to LBW.

Numerous studies in South Asia have identified various risk factors for low birth weight (LBW), with both commonalities and regional differences. In Chandigarh, India, a study by Sharma et al. identified key socio-demographic and maternal factors associated with LBW, including low maternal literacy, low per capita income, birth order of two or more, and maternal age over 30 years as significant contributors to LBW risk [31]. Similarly, research from Pakistan found that factors such as teenage pregnancy, illiteracy, inadequate antenatal care, maternal anemia, and pregnancy-induced medical conditions are strongly linked to LBW. This highlights the critical role of maternal health, nutrition, and access to healthcare services in preventing LBW. Social determinants such as maternal education and healthcare accessibility appear particularly influential in Pakistan, emphasizing the importance of targeted interventions in these areas [32]. In contrast, a study in the rural community of Rajshahi district in Bangladesh found that maternal weight, birth interval, and the female sex of the newborn were significant risk factors for LBW [33]. This study underscores the importance of maternal nutrition and reproductive behaviors in determining birth outcomes. Interestingly, the female sex of the newborn was also associated with LBW, suggesting potential gender-based biological or cultural factors influencing birth weight outcomes in this context.

The results of the current study suggest that the Wrapper method is a superior feature selection algorithm compared to the Boruta algorithm. The Wrapper method effectively reduces model complexity by identifying fewer but more significant features essential for predicting LBW. Additionally, selecting relevant features through the Wrapper method improves model specificity, sensitivity, accuracy and AUC value. According to a previous study [3], the chi-square method identified eight statistically significant features for LBW. However, using these features only resulted in 85% accuracy, which is lower than the accuracy achieved through both the Boruta algorithm and the Wrapper method in this study. Therefore, the Boruta algorithm and the Wrapper method are better feature selection methods for predicting LBW.

Advanced data science methods have been used to predict LBW, with the Wrapper method proving more effective than the Boruta algorithm in identifying significant features affecting LBW. Using the Random Forest classification technique, the Wrapper method identified critical factors such as whether the child is twin, region, and delivery by cesarean section. Among these, being a twin had the most substantial impact on LBW, with the second twin more likely to suffer from low birth weight due to intrauterine growth constraints. The region of residence was also crucial, highlighting disparities in healthcare availability across different areas that directly affect newborn health outcomes. Lastly, delivery by cesarean section emerged as a significant feature, possibly due to underlying medical conditions or complications that may necessitate such interventions.

The comparative analysis of various studies in South Asia shows that while there are common risk factors for LBW (low birth weight), the significance of specific factors can vary widely by region. This variation highlights the necessity for region-specific strategies in addressing low birth weight, considering the distinct socio-cultural, economic, and healthcare contexts of each country.

Conclusion

The research strongly advocates for using the Wrapper method to select features associated with LBW. Random forest classification emerged as the most effective for LBW prediction among the array of machine learning algorithms. Furthermore, the features that exhibited a substantial impact on LBW are region, child is twin, and delivery by caesarean section. Therefore, to reduce low birth weight newborns in Bangladesh, policymakers may take into account the risk factors that have been discovered from this study. Given that low birth weight (LBW) can have a number of detrimental effects, it appears imperative that health promotion programs include information on how to obtain ideal weight of newborns. Consequently, it would seem appropriate to discuss how LBW affects health promotion programs. For Bangladesh to achieve SDG objective 3, the authors propose including recommendations regarding nutrition and health education into Bangladesh’s educational system and adding temporal and spatial heterogeneity in follow-up research.

Importantly, the study’s results could be improved by implementing hyper parameter tuning methods. This is particularly important given the limited sample size and time constraints that still yielded high-quality outcomes. Hyper parameter tuning is expected to enhance the performance of the machine-learning models by optimizing their parameters. Additionally, Bayesian methods offer significant potential in handling imbalanced datasets. These methods incorporate prior knowledge about class distributions and provide probabilistic outputs. This approach is particularly valuable in public health contexts, where interpreting results in the light of existing knowledge is crucial.

Data availability

Data is available at https://dhsprogram.com/data/.

References

  1. Yarlapati AR, Roy Dey S, Saha S. Early prediction of LBW cases via Minimum Error Rate Classifier: A Statistical Machine Learning Approach. 2017 IEEE Int Conf Smart Comput SMARTCOMP 2017. 2017. https://doi.org/10.1109/SMARTCOMP.2017.7947002.

    Article  Google Scholar 

  2. Bekele WT. Machine learning algorithms for predicting low birth weight in Ethiopia. BMC Med Inf Decis Mak. 2022;22(1):1–16. https://doi.org/10.1186/s12911-022-01981-9.

    Article  Google Scholar 

  3. Ashikul Islam Pollob SM, Abedin MM, Islam MT, Islam MM, Maniruzzaman M. Predicting risks of low birth weight in Bangladesh with machine learning. PLoS ONE. 2022;17(5):1–12. https://doi.org/10.1371/journal.pone.0267190.

    Article  Google Scholar 

  4. Mahumud RA, Sultana M, Sarker AR. Distribution and determinants of low birth weight in developing countries. J Prev Med Public Heal. 2017;50(1):18–28. https://doi.org/10.3961/jpmph.16.087.

    Article  Google Scholar 

  5. Thapa P, et al. Prevalence of low birth weight and its associated factors: Hospital based cross sectional study in Nepal. PLOS Glob Public Heal. 2022;2(11):e0001220. https://doi.org/10.1371/journal.pgph.0001220.

    Article  Google Scholar 

  6. Abdollahian M, Gunaratne N. Low birth weight prediction based on maternal and fetal characteristics, Proc. – 12th Int. Conf. Inf. Technol. New Gener. ITNG 2015. 2015;646–650, https://doi.org/10.1109/ITNG.2015.108

  7. Das RN, Devi RS, Kim J. Mothers’ lifestyle characteristics impact on her neonates’ low birth weight. Int J Women’s Heal Reprod Sci. 2014;2(4):229–35. https://doi.org/10.15296/ijwhr.2014.33.

    Article  Google Scholar 

  8. Khan JR, Islam MM, Awan N, Muurlink O. Analysis of low birth weight and its co-variants in Bangladesh based on a sub-sample from nationally representative survey. BMC Pediatr. 2018;18(1):1–9. https://doi.org/10.1186/s12887-018-1068-0.

    Article  Google Scholar 

  9. B. D.E., C. T.F., and C. P.A., Determinants of survival in very low birth weight neonates in a public sector hospital in Johannesburg. BMC Pediatr. 2010;10:1–11. Available: http://www.biomedcentral.com/1471-2431/10/30%5Cn. http://ovidsp.ovid.com/ovidweb.cgi?T=JS&PAGE=reference&D=emed9&NEWS=N&AN=2010358492

  10. Ahsan MM, Luna SA, Siddique Z. Machine-Learning-Based Disease Diagnosis: A, Healthcare. pp. 1–30, 2022.

  11. Ansari A Q, Gupta N K. Automated diagnosis of coronary heart disease using neuro-fuzzy integrated system. Proc 2011 World Congr Inf Commun Technol WICT 2011. 2011;no September:1379–84. https://doi.org/10.1109/WICT.2011.6141450.

  12. Levey AS, Coresh J. Chronic kidney disease. Lancet. 2012;379(9811):165–80. https://doi.org/10.1016/S0140-6736(11)60178-5.

  13. Borson NS, Kabir MR, Zamal Z, Rahman RM. Correlation analysis of demographic factors on low birth weight and prediction modeling using machine learning techniques. In 2020 Fourth World Conference on Smart Trends in Systems, Security and Sustainability (WorldS4). (2020, July). IEEE. pp. 169–173).

  14. Hange U, Selvaraj R, Galani M, Letsholo K. A data-mining model for predicting low birth weight with a high AUC. Comput Inform Sci 2018;109–21.

  15. Senthilkumar S, Paulraj D. Prediction of Low Birth Weight Infants and Its Risk Factors Using Data Mining Techniques, Proc. 2015 Int. Conf. Ind. Eng. Oper. Manag. Dubai, United Arab Emirates, vol. 3, pp. 186–194, 2015.

  16. Faruk A, Cahyono ES, Eliyati N, Arifieni I. Prediction and classification of low birth weight data using machine learning techniques. Indones J Sci Technol. 2018;3(1):18–28. https://doi.org/10.17509/ijost.v3i1.10799.

    Article  Google Scholar 

  17. Loreto P, Peixoto H, Abelha A, Machado J. Predicting low birth weight babies through data mining. Adv Intell Syst Comput. 2019;932:568–77. https://doi.org/10.1007/978-3-030-16187-3_55.

    Article  Google Scholar 

  18. Ahmadi P et al. Prediction of low birth weight using Random Forest: A comparison with Logistic Regression, J. Paramed. Sci., 2017;8(3): 36–43. Available: https://journals.sbmu.ac.ir/aab/article/view/15412

  19. Kursa MB, Jankowski A, Rudnicki WR. Boruta - A system for feature selection. Fundam Informaticae. 2010;101(4):271–85. https://doi.org/10.3233/FI-2010-288.

    Article  Google Scholar 

  20. Hsu HH, Hsieh CW, Lu MD. Hybrid feature selection by combining filters and wrappers. Expert Syst Appl. 2011;38(7):8144–50. https://doi.org/10.1016/j.eswa.2010.12.156.

    Article  Google Scholar 

  21. Das A. Logistic regression. Encyclopedia of Quality of Life and Well-Being Research. Cham: Springer International Publishing; 2024. pp. 3985–6.

    Google Scholar 

  22. Shokri R, Stronati M, Song C, Shmatikov V. Membership Inference Attacks against Machine Learning Models. Proc - IEEE Symp Secur Priv. 2017;3–18. https://doi.org/10.1109/SP.2017.41.

  23. Alzubi J, Nayyar A, Kumar A. Machine learning from theory to algorithms: an overview. J Phys Conf Ser. 2018;1142(1). https://doi.org/10.1088/1742-6596/1142/1/012012.

  24. Amiri M, Eftekhari M, Keynia F. Using Naïve Bayes classifier to accelerate constructing fuzzy intrusion detection systems. 2013;(6):453–9.

  25. Li M, Fu X, Li D. Diabetes prediction based on XGBoost Algorithm. IOP Conf Ser Mater Sci Eng. 2020;768(7). https://doi.org/10.1088/1757-899X/768/7/072093.

  26. Freund Y, Schapire RE. Experiments with a new boosting algorithm. Proc 13th Int Conf Mach Learn. 1996;148–56. https://doi.org/10.1.1.133.1040.

  27. Xiahou X, Harada Y. Customer churn prediction using AdaBoost classifier and BP Neural Network Techniques in the E-Commerce industry. Am J Ind Bus Manag. 2022;12(03):277–93. https://doi.org/10.4236/ajibm.2022.123015.

    Article  Google Scholar 

  28. Barker M, Robinson S, Osmond C, Barker DJP. Birth weight and body fat distribution in adolescent girls. Arch Dis Child. 1997;77(5):381–3. https://doi.org/10.1136/adc.77.5.381.

    Article  PubMed  Google Scholar 

  29. Çam HH, Harunoğulları M, Polat Y. A study of low birth weight prevalence and risk factors among newborns in a public-hospital at Kilis. Turkey. 2020;20(2):709–14.

    Google Scholar 

  30. Kumar SN, et al. Predicting risk of low birth weight offspring from maternal features and blood polycyclic aromatic hydrocarbon concentration. Reprod Toxicol. 2020;94(March):92–100. https://doi.org/10.1016/j.reprotox.2020.03.009.

    Article  PubMed  Google Scholar 

  31. Risk M, Of F, Birth L, In W. Ispub.com, vol. 9, no. 1, pp. 56–59, 2008.

  32. Anjum F, Javed T, Afzal MF, Sheikh GA. Maternal risk factors Associated with Low Birth Weight: a Case Control Study. Annals. 2011;17(3):223–8.

    Google Scholar 

  33. Ullah M, Haque M, Hafez M, Khanam M. Biological Risk factors of low Birth Weight in Rural Rajshahi. TAJ J Teach Assoc. 1970;16(2):50–3. https://doi.org/10.3329/taj.v16i2.3881.

    Article  Google Scholar 

Download references

Funding

No particular grant from a governmental, commercial, or nonprofit funding agency was given for this research.

Author information

Authors and Affiliations

Authors

Contributions

Tasneem Binte Reza: Conceptualization, data analysis, writing original draft, editing.Nahid Salma: Coceptualization, Methodology, Manuscript revision, editing, critical review, supervision.

Corresponding author

Correspondence to Nahid Salma.

Ethics declarations

Ethics approval and consent to participate

This study used secondary dataset which came from the website of the Demographic and Health Surveys (DHS) Programme (https://dhsprogram.com/data/). For this dataset, ethics approval is not necessary.

Consent for publication

The manuscript has been approved by all authors and have provided full consent for publication.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Reza, T.B., Salma, N. Prediction and feature selection of low birth weight using machine learning algorithms. J Health Popul Nutr 43, 157 (2024). https://doi.org/10.1186/s41043-024-00647-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s41043-024-00647-8

Keywords