- Research
- Open access
- Published:
Decision tree-based learning and laboratory data mining: an efficient approach to amebiasis testing
Parasites & Vectors volume 18, Article number: 33 (2025)
Abstract
Background
Amebiasis represents a significant global health concern. This is especially evident in developing countries, where infections are more common. The primary diagnostic method in laboratories involves the microscopy of stool samples. However, this approach can sometimes result in the misinterpretation of amebiasis as other gastroenteritis (GE) conditions. The goal of the work is to produce a machine learning (ML) model that uses laboratory findings and demographic information to automatically predict amebiasis.
Method
Data extracted from Jordanian electronic medical records (EMR) between 2020 and 2022 comprised 763 amebic cases and 314 nonamebic cases. Patient demographics, clinical signs, microscopic diagnoses, and leukocyte counts were used to train eight decision tree algorithms and compare their accuracy of predictions. Feature ranking and correlation methods were implemented to enhance the accuracy of classifying amebiasis from other conditions.
Results
The primary dependent variables distinguishing amebiasis include the percentage of neutrophils, mucus presence, and the counts of red blood cells (RBCs) and white blood cells (WBCs) in stool samples. Prediction accuracy and precision ranged from 92% to 94.6% when employing decision tree classifiers including decision tree (DT), random forest (RF), XGBoost, AdaBoost, and gradient boosting (GB). However, the optimized RF model demonstrated an area under the curve (AUC) of 98% for detecting amebiasis from laboratory data, utilizing only 300 estimators with a max depth of 20. This study highlights that amebiasis is a significant health concern in Jordan, responsible for 17.22% of all gastroenteritis episodes in this study. Male sex and age were associated with higher incidence of amebiasis (P = 0.014), with over 25% of cases occurring in infants and toddlers.
Conclusions
The application of ML to EMR can accurately predict amebiasis. This finding significantly contributes to the emerging use of ML as a decision support system in parasitic disease diagnosis.
Graphical Abstract

Background
More than a million people are impacted by food and waterborne parasites each year, despite massive international efforts to control and prevent their spread [1]. Entamoeba histolytica is a protozoan parasite that infects humans and causes two forms of pathology: intestinal amebiasis (or amoebic dysentery) and amoebic liver abscesses, parasitic diseases most frequently associated with fatalities globally [2]. Determining the epidemiology of amebiasis is challenging because of its silent nature and technical difficulties. There may be overlap between symptoms of amebiasis and other GE diseases, resulting in variable estimates of the prevalence [3], ranging from approximately 5% [4] to as high as 40% [5].
Amebiasis remains endemic in several developing countries, particularly those with low economic level and high population density. For instance, pathogenic E. histolytica has been found in 22 of 30 American countries, with an overall prevalence of 9% [6]. Amebic infections are particularly prevalent in regions with tropical or subtropical climates, such as Latin America, where 22% of Brazilians have Entamoeba spp. [3], and 28–42% of patients in rural Mexico were found to have pathogenic E. histolytica in a seroprevalence investigation [5]. However, despite being subtropical, only a small percentage of the North Indian population (5%) is affected by E. histolytica [4]. In Asia, a 10-year retrospective study conducted in Taiwan discovered that 16.1 out of every 1,000,000 people develop amebiasis [7].
Middle Eastern countries are considered key endemic foci of amebiasis owing to their large young population and inadequate sanitation and water systems in many areas [8]; however, variations are found in patient records. For instance, amebiasis is low in Iran [9], whereas in Erbil City, northern Iraq, pathogenic E. histolytica accounted for 81.4 % of asymptomatic Entamoeba infections, with an overall frequency of 6% [10]. Furthermore, in Saudi Arabia up to 83% of adult patients in Jeddah City [11] and 40.7% in Najran City [12] had intestinal infections caused by E. histolytica. In patients from different parts of Jordan, Giardia lamblia and E. histolytica/E. dispar were shown to be the most common intestinal parasites. The southern region of Jordan had the highest prevalence (80.7%) [13], while hospitals in the Jordanian city of Amman and its northern region reported prevalence rates of 27.81% and 33%, respectively [14, 15]. Amebiasis has also been detected in specific Jordanian populations. For example, patients with GE symptoms have an E. histolytica infection rate of 10% in Bedouin tribes [16]. In contrast, only one such case has been documented among culinary hotel employees [17]. Overall, amebiasis is endemic, according to previous studies carried out in Jordan. But there are still several unresolved questions. The degree of amebiasis in Jordanian patients of all ages and sexes is unclear, in addition to the clinical and laboratory symptoms.
Frequent episodes of diarrheal to semisoft stools, often containing blood, mucus, and live trophozoites, are hallmarks of amebic dysentery [18]. Abdominal pain can range from slight soreness to intense discomfort. Numerous investigations have documented adverse systemic effects, including elevated leukocyte counts [19, 20]. Amoebic colitis is often accompanied by abdominal soreness, and in some cases, peritonitis or toxic megacolon can exacerbate fulminant colitis [21]. Persistent amebic infection in the colon can also mimic inflammatory bowel disease [22]. Additionally, amoebic dysentery might be mistaken for other causes of GE conditions, including ulcerative colitis, salmonellosis, or shigellosis [23]. Compared with bacillary dysentery, amoebic dysentery typically results in less frequent and less watery stools. Amebiasis is characterized by the presence of mucus and blood specks [24]. However, symptoms of shigellosis often include bloody diarrhea. It is also noted that complete blood count (CBC) measures overlap between amebiasis and other GE cases [25]. In fact, it remains difficult to distinguish amebiasis from other causes of GE as a result of these variables.
Testing for amebiasis involves a variety of laboratory techniques, beginning with simpler methods such as microscopy, and advancing to more sophisticated methods such as molecular diagnostics for detecting parasite DNA in feces, immunoassays involving serology, and direct antigen identification [26]. Sometimes, combining multiple tests can yield conclusive results. Although rapid antigen testing is available, it may not effectively distinguish between an acute infection and a past infection [26], especially in populations where the frequency of infection is high [27]. In Jordan, as in many other countries, the primary method used to diagnose amoebic and other infectious gastroenteritis is the traditional microscopic examination. However, microscopic inspection requires qualified personnel. Sensitive and specific immunoassays can detect E. histolytica antigens in feces, thus confirming the diagnosis. However, the majority of public health sector laboratories in Jordan and many other countries lack the immunoassay equipment necessary to confirm infection. This method is limited to Private sector laboratories, which are not fully accessible to all patients owing to insufficient health insurance coverage. In such cases, integrating electronic system techniques with medical data may offer solutions to diagnostic challenges.
Artificial intelligence (AI) and machine learning (ML) are increasingly used in the medical industry. Serving as a diagnostic decision support system, for instance [28]. Machine learning has plenty of possibilities for diagnosing infectious diseases. Machine learning and deep learning (DL) models have been applied to parasitology to predict disease, detect parasites [29], and identify risk factors [30]. However, these applications are limited, with most techniques related to malaria [31], trypanosomiasis [32], and toxoplasmosis [33]. Besides, few published studies have discussed the use of ML for other protozoal infections [34], and none have included the prediction of amebiasis. However, since it is impossible to pinpoint the exact cause of GE based solely on clinical features, AI techniques based on laboratory data processing for infectious GE are a crucial field of study. An amoebic trophozoite with characteristic morphology and red blood cell ingestion must be found in a fresh stool specimen to confirm an official diagnosis of invasive amebiasis. Even skilled technicians, meanwhile, commonly confuse partially digested food particles, WBC, hematophagous macrophages, and nonpathogenic protozoa for amoebic trophozoites. The application of AI has the potential to result in accurate evaluations being performed automatically, which in turn reduces errors and times in diagnosis and improves performance in prediction and detection. This is especially valid when considering amebiasis.
The main goal of this study is to model the identification of amebiasis based on laboratory results and subsequent separation from other GE causes using machine learning. In particular, our objectives are:
-
1.
Assess the current state of amebiasis in Jordan’s Al-Salt Province.
-
2.
Examine the key features that set amebiasis apart from other GE instances. These features are directly generated from clinical, demographic, hematological, and microscopic data.
-
3.
Examine how well decision tree classifiers work with EMR data to identify amebiasis.
-
4.
Verify the outcomes of the selected model (the best performer) by repeatedly using a holdout dataset with different sizes. This makes it more likely that the model will function as anticipated in the actual world and generalize effectively to new unseen inputs.
This study presents, for the first time, the use of feature selection and ML-based decision tree classifier approaches in the context of amebiasis detection from structured data.
Methods
Study environment
The targeted population inhabits Al-Salt, one of Jordan’s largest cities, which is located north of Amman. In this study, we looked at parasitology data from patients at Al-Hussein Hospital, Al-Salt, the city’s only hospital, which is also a government-run university hospital. Thus, the statistics may be viewed as indicative of the 180,090-person majority population of the city, which consists of 51% males and 49% females.
Description of the dataset
Between 7 September 2020 and December 2022, information was compiled from Jordan’s Hakeem electronic health record system [35]. The data used for the analysis included wet-mount reports of the detection of intestinal parasites. Laboratory tests were performed by the normal operating procedures of the Jordanian health centers. Briefly, stool samples were collected in labeled, clean, dry, leak-proof, and sterile plastic containers for a laboratory diagnosis from all patients with GE with suspected involvement of an intestinal parasite. Within 30 min of sample collection, direct stool examinations were performed to examine for the presence of common intestinal parasites (E. histolytica/dispar, Giardia lamblia) and worms (Ascaris lumbricoides, Hymenolepis nana, Enterobius vermicularis, Strongyloides stercoralis, and Trichuris trichiura, hookworms besides Taenia, and other cestode species). After stool examinations, the infected patients were treated according to the national guidelines.
Positive parasitology results for Entamoeba species (n = 763) were included in this study, as it is the predominant protozoan reported. A few sporadic cases of Giardia (fewer than 20) were reported during the study period; however, they were not included in our analysis. For ML engineering, we added 314 more examples to our analysis. Those instances exhibited GE symptoms, but parasites were not microscopically detected in them.
Hakeem’s sociodemographic data and laboratory results were examined and compiled using a worksheet designed for this purpose. The dataset contained the date, sex, age (years), diagnosis (diarrheal, semisoft stool, and abdominal pain without diarrhea), and stool examination results with severity indicators such as stool–(RBCs), and leukocytosis or white blood cell (WBC) counts. The WBC count was categorized into three groups based on the number of cells/high-power field: low = 0–3; medium = 3–6; and high = more than 6. Additionally, complete blood count (CBC) results were used to obtain WBC and neutrophil counts because they are considered signs of a reaction to the infection and its severity. The patients were categorized into four age groups: toddlers (less than 5 years), school-going age (represents 5–18 years), active workforce (19–59 years old), and adults (older than 60 years).
Tables 1 and 2 provide a statistical summary of observations in each class of the dataset. Table 1 presents the summary statistics for the numerical features of the amebiasis and non-amebiasis cases. Similarly, Table 2 shows the frequency distribution of categorical features by their respective values.
As one can see from Table 1 presents the summary statistics for key features in amebiasis and non-amebiasis cases. For the amebiasis class, the average age is 27.9 ± 25.22 years, with hemoglobin (Hb) levels averaging 12.79 ± 1.88 g/dL and WBC at 9.22 ± 3.58. The neutrophil percentage in this class averages 65.99 ± 15.5%. In contrast, the non-amebiasis class has a slightly younger average age of 24.59 ± 24.91 years, with a lower mean Hb level of 11.91 ± 1.86 g/dL and a higher mean WBC count of 10.93 ± 3.10. The neutrophil percentage in this class is lower, averaging 40.02 ± 21.45%. These differences in hematological parameters may suggest variations in immune response between the two classes.
Table 2 summarizes the distribution of categorical variables in both the amebiasis and non-amebiasis classes by count and percentage. In terms of sex distribution, females represent 50.1% of the amebiasis class and 53.8% of the non-amebiasis class. For diagnosis, diarrhea is the most common condition, affecting 55.8% of individuals with amebiasis and 61.5% of those without. In the diarrhea RBC category, high RBC presence is seen in 53.3% of the amebiasis class but only 1.3% in the non-amebiasis class. Regarding diarrhea-WBC counts, high levels are observed in 43.9% of amebiasis cases, while moderate levels are dominant in the non-amebiasis class (75.5%). For mucus, its absence is most frequent, particularly in the non-amebiasis class (99.7%). These variations highlight differences in symptom distribution between the two classes.
Data processing, analysis, and visualization
After being extracted from Hakeem, the data were cleaned using Pandas Python software [36]. Numpy [37] and Scipy [38] packages were used to categorize the data, establish the association between variables, and display their distribution patterns. To compare qualitative data, a chi-squared test was performed, which clarified the distributions of categorical variables, specifically the relationship between risk factors for amebiasis, including sex, age, and other laboratory findings that indicated the severity of infection, including diarrhea or the presence of RBCs or WBCs in the stool. Correlations among age, WBC count, neutrophil count, and neutrophil percentage were tested using one-way analysis of variance. Student’s t-tests were used to compare numerical variables (WBC count, neutrophil count, and neutrophil percentage) and to determine significant differences between groups. The analysis followed a significance threshold of P < 0.05, ensuring the reliability and validity of the results. Graphical data representations were generated using Matplotlib Pyplot package [39] and Seaborn in Python [40].
For encoding categorical variables, we employed a straightforward factorization approach, assigning values such as “male” = 0, “female” = 1, “mucus present” = 1, “mucus absent” = 0, and “numerous” = 2, and so on for all categorical variables including “Diagnosis,” “Diarrhea-RBC,” and “Diarrhea-WBC.” Although we initially tested one-hot encoding in our pilot experiments, the results were not as favorable as those obtained with the factorization method. Consequently, we chose to use factorization encoding for all of our reported experiments.
Machine learning model
The goal of this research was to create a model that can correctly identify between amebiasis and non-amebic gastrointestinal data. Classification is the process of predicting an unlabeled object’s class using its properties [28, 41]. The process consists of the following steps:
-
1.
Training: A training dataset is utilized to develop a classification model. We used supervised learning techniques in our research as our data was labeled with both text and numerical values, and the dataset was large and varied. Parametric approaches such as logistic regression (LR) and Naive Bayes (NB) and nonparametric methods like multilayer perception neural networks (MLP), K-nearest neighbor (KNN), Hassanat KNN (HKNN), and decision trees (DT) are examples of machine learning (ML) techniques used in this study. We also looked at techniques such as support vector machine (SVM) and linear SVM (LSVM), which are classified as both parametric and nonparametric. Random forest (RF), AdaBoost, eXtreme gradient boosting (XGBoost), and gradient boosting (GB) were also used as ensemble approaches. We also used additional well-known models that fit linear regressors and classifiers under convex loss functions, such as linear discriminant analysis (LDA) and stochastic gradient descent (SGD) [42,43,44].
-
2.
Testing: Using a different test dataset, this stage assesses the trained model’s effectiveness and quality. To evaluate the classifiers’ prediction accuracy and generalization capacity, the test data are fed into the same classifiers’ models that were obtained by the training phase.
We used fivefold cross-validation to test and train all of the data to assess the aforementioned ML techniques and determine which classifier best suited our data for the best identification of amebiasis [45, 46]. Each classifier was evaluated for performance using the most common metrics (accuracy, precision, recall, and F1 score) to determine the most effective model for our data. We used the default parameters for each classifier, as the aim is to find the best classifier that suits our data rather than optimizing the performance of each, such a process should be our next step if the data are too complex to be classified by all classifiers examined.
Results
Feature engineering
In this study, we employed two feature selection methods: feature correlation [47] and feature importance ranking [48].
The random forest (RF) technique was employed to rank the features in the dataset under investigation on the basis of their importance concerning amebiasis. Several decision trees were built using the robust ensemble learning methodology known as the RF method, and their outputs were combined to increase prediction accuracy and reduce overfitting. In RF, the contribution of each feature to the overall model performance is used to define the feature’s relevance. To be more precise, we used the Gini index as a standard for determining the significance of each attribute. A dataset’s impurity is measured by the Gini index, which indicates the probability that an element selected at random would be incorrectly classified if its labels were distributed in accordance with the subset’s label distribution. A purer node—one in which its feature successfully divides the classes—is indicated by a lower Gini value.
A variety of factors, including clinical signs and laboratory data from the patients under study, were incorporated into our analysis. These included total and differential leukocyte counts, red blood cell and fecal leukocyte counts, and other factors including patient age and sex. Our goal was to improve the diagnosis accuracy by taking these aspects into account [49, 50].
Our goal was to pinpoint the most important risk factors that, when taken as a whole, contribute to 95% of the overall feature significance rating that the RF model produced. The proportion of neutrophil percentage, diarrhea-RBC, diarrhea-WBC, WBC, Hb, and mucus presence were the main parameters that were found. Figure 1a lists these parameters in decreasing order of significance.
Using this method not only helped us determine the most important amebiasis predictors but also shed light on the underlying laboratory and clinical traits linked to the illness. We can more effectively guide focused actions and clinical decision-making by concentrating on traits with high significance ratings.
The cor() function from the basic R language package is utilized to compute the correlation; by default, it computes the Pearson correlation coefficient. The Pearson correlation coefficient between variables X and Y can be calculated using the following formula:
, where \(X_i\) and \(Y_i\) are individual data points for variables X and Y, \(\bar{X}\) and \(\bar{Y}\) are the means of the variables X and Y, respectively.
The heatmap displayed in Fig. 1 (row 4) illustrates the Pearson correlation coefficients between all variables and the dependent variable (amebiasis), which show a significant relationship between the aforementioned top six strongest features and the microscopic diagnosis of amebiasis. While neutrophil (%), diarrhea-RBC, diarrhea-WBC, mucus, and Hb have comparatively greater positive associations (0.56, 0.36, 0.30, 0.30, and 0.21, respectively), WBC exhibits a relatively substantial negative connection (−0.22). The classification of correlation strength (\(r_{XY}\)) is not widely agreed upon; nonetheless, it is generally understood to be weak when \(r_{XY}\) is in the range (−0.4, 0.4), moderate when \(r_{XY}\) falls between 0.4 and 0.7, or −0.4 and −0.7, and strong when \(-0.7> r_{XY} > 0.7\) [51].
In the context of this study, these correlations are seen as having a relative strength even if they do not meet traditional criteria for being called “strong.” Their inclusion as essential features is justified by the relative intensity of the connection with amebiasis. This connection, which is consistent with the RF results as shown by the correlation heatmap (Fig. 1), demonstrates the selection of these six variables, which collectively account for 95% of the RF classification importance.
The Pearson correlation coefficients between all variables and the dependent variable (amebiasis), which are illustrated by the heatmap shown in Fig. 1 (row 4), demonstrate that the aforementioned top six strongest features have a significant link with the microscopic diagnosis of amebiasis. WBC shows a relatively significant negative correlation (−0.22), while neutrophil (%), diarrhea-RBC, diarrhea-WBC, mucus, and Hb show relatively stronger positive relationships (0.56, 0.36, 0.30, 0.30, and 0.21, respectively). The selection of these six features, which together account for 95% of the RF classification importance, is demonstrated by this correlation, which aligns with the RF results, as seen by the correlation heatmap (Fig. 1).
Because Pearson correlation presumes that both variables are continuous and normally distributed, it is typically not advised to utilize it with binary dependent (amebiasis) and continuous independent variables (the other features). In practice, it is still occasionally utilized to quickly understand the linear connection between a continuous and binary variable. Other approaches are advised to obtain a more suitable measure of the relationship between a continuous and binary variable, such as the point-biserial correlation, a special case of the Pearson correlation in which one variable is binary [52]. However, we obtained the same coefficients using the point-biserial correlation as shown in Fig. 2.
Amebiasis prevalence and related risk factors
A total of 4429 stool tests were performed over the study period on patients with GE symptoms. Among these, 763 were confirmed to have amebiasis (17.22%), with 381 (49.9%) male and 382 (50.1%) female. The study patients’ characteristics are shown in Fig. 3. The study participants were between the ages of less than 1 month and 99 years. Male patients had a median age of 19 years, while female patients had a median age of 24 years. The median ages of males and females, however, did not differ significantly (P = 0.43). The prevalence of amebiasis among males and females across various age groups was not significantly different (P = 0.39). However, among toddlers and preschoolers, more males were observed than females (Fig. 3).
In contrast, considerably more females were observed than males in the adult group (aged 19–59 years). Male sex and age both indicated a higher incidence of amebiasis when comparing males of different ages to females of similar ages (P = 0.014). Among all age groups studied, adults aged 20–59 years had the highest infection rate, whereas those older than 60 years had the lowest infection rate (Fig. 4). The percentages of amoebic infections across age groups did not differ significantly (P = 0.06). According to the clinical signs, the intensity of the symptoms was as follows: abdominal pain without diarrhea, 145 (19%); semisoft stools, 189 (24.7%); watery diarrhea, 426 (55.8%); and rectal bleeding, 3 (0.3%) (Fig. 5).
Individuals aged 1 month to 4 years accounted for 27% of the total, with 107/206 cases (51.9%) diagnosed with watery diarrhea. As expected, WBCs and neutrophil (%) showed a correlation with the diagnosis. These measurements increased when patients were diagnosed with watery diarrhea (Fig. 5).
Machine learning results
All classifiers performed well on our data by scoring at least 70% accuracy, and high performance as indicated by the other metrics as can be seen from Table 3. Interestingly, it is evident that DT, RF, AdaBoost, XGBoost, and GB achieved perfect scores across all metrics, showcasing their exceptional performance in identifying amebiasis cases without even the need for optimization and parameter tuning. Meanwhile, KNN, LDA, and NB displayed comparatively lower scores, suggesting a potential need for further optimization or inconsideration for our classification task. These results showcase the varying performance across models and provide valuable insights for selecting the most suitable classifier for the given dataset, and this justifies our comparison of a large set of classifiers to find the best one(s). In terms of theory, DT, RF, AdaBoost, XGBoost, and GB utilize decision trees as their base classifiers. Decision trees are the building blocks of these models, and they are either directly utilized (as in DT and RF) or used as base learners in boosting algorithms like AdaBoost and gradient boosting. Therefore, it is tempting to look at the decision tree resulting from the training process. Table 6 presents the decision rules that are extracted from the output decision tree of the DT classifier. As can be seen from Table 6, the resultant decision tree is used to classify cases almost accurately based mainly on strong factors such as diarrhea-RBC, diarrhea-WBC, mucus, and neutrophils (%), which are shown to be critical for diagnosing amebiasis from other non-amebiasis cases.
In summary, the decision tree alone or as part of another ensemble approach seems to effectively categorize cases into amebiasis and non-amebiasis on the basis of the aforementioned strong factors, with high accuracy. This suggests a strong predictive power of these factors in diagnosing amebiasis in the five decision tree-based models. This is also supported by the results shown in Fig. 1b, where these factors were selected as the best factors among all the other factors in our dataset, this includes the feature selection approach and the correlation of these variables to the amebiasis-dependent variable.
Using its default parameters (max-depth = None, number estimators = 100), The RF classifier was the best performer in all measures, with a relatively small value of its standard deviation along the five runs of the fivefold cross-validation; therefore, it was worth fine-tuning its parameters to attempt to achieve the best performance. Figure 6 shows a grid search to find the best parameters for RF using AUC as an optimizer.
The grid search results show that the RF can perform better on our data when using 300 estimators, and the maximum depth of each tree/estimator = 20, showing a very high area under curve (AUC = 0.98) when using these parameters. The classification results using these parameters are presented in Table 4.
Even though we used fivefold cross-validation, which guarantees that each data sample is trained and assessed over many runs, it is more robust to validate the proposed model on a fresh dataset serving generalizability. However, finding a dataset whose properties exactly matched ours was found to be quite challenging. Furthermore, considering the possible harm that Monte Carlo simulations and other oversampling approaches may do to the integrity of medical data, we are hesitant to use them to synthesize new data for extra validation (see Refs. [45, 46]).
In light of these limitations, we chose to use a holdout set to supplement the findings from our fivefold cross-validation as an alternate validation technique. This method offers a workable option in the event that an external dataset is not available, in addition to enabling us to validate the model’s performance on unseen data [53].
Table 5 illustrates how we used many training–test splits to validate our model’s robustness. In particular, we tested with several ratios, such as 10% training and 90% test, 20%, 30%, 40%, and up to 90% training with the remaining percentage serving as a test set. We were able to verify the model’s performance under various circumstances and sample sizes by iteratively adjusting the training set size.
It is expected to see higher results as the size of the training set gets larger. However, even with a very small training sample (10%), the new validation scores were consistently excellent. This result demonstrates our model’s generalizability and implies that it retains its potent predicting powers in spite of changes in the volume of training data. The efficacy of our method as a viable substitute for traditional validation is further supported by the results of our experiments, which demonstrate that even with only 10% of the data used for training (i.e., 'little training data') and 90% for testing, we achieved excellent performance. This finding highlights the generalizability of our method without the need to rely on external datasets.
As can be seen from Table 4, the RF benefited from the parameter tuning process, slightly increasing its accuracy by about 0.04%, and interestingly, increasing its precision, recall, and F1 score by at least 1.5%. As a result, we recommend using the RF with max-depth = 20 and number estimators =300 as being the relatively best performer among all the classifiers tested to identify amebiasis cases providing the data in hand.
We reduced the table size by including only the rules that pertain to identifying amebiasis, excluding the rules for non-amebiasis as shown in Table 6. It is worth noting that these rules are primarily composed of the top selected and correlated features, which in themselves can accurately identify amebiasis on the basis of the method used. We did not need to experiment on the selected features because certain machine learning methods incorporate feature selection as a built-in process. A decision tree is generated after evaluating the strength of each feature, starting with the strongest. Some methods utilize information gain, while others use the Gini index to determine the best feature to start the rule with.
The confidence in training is computed on the basis of how accurate a rule is during the training process, which is determined by the number of correctly classified amebiasis cases divided by the total number of cases classified by each rule. Training rules are expected to be more accurate because the method continues to work and train until it best fits the rules to the training data. On the other hand, the confidence in testing is usually lower because the resulting trained model is applied to unseen (test) data. Nevertheless, there was no notable distinction between the two, signaling the strong generalizability of the trained model and indicating that it has not been overfit by the training data. While the average of each confidence indicates the effectiveness of each of the training and testing phases. These types of measures might not fully represent the performance of a ML method owing to variations in the number of cases handled by each rule.
It is interesting to note that some rules achieved 100% confidence in both training and testing. For instance, the last rule, “If Diarrhea-RBC is very high then Amebiasis is positive,” accurately identified 30 cases during the training phase and 15 cases during the testing phase as amebiasis based solely on this rule. ML methods can easily provide such rules, which might be of great benefit to the medical field.
The rules presented in Table 6 are not the sole rules that can be derived from the available data. Each decision tree produces its own set of rules, and any of these rules could be highly beneficial to the medical field. However, exploring all these rules is outside the scope of this paper.
Discussion
On the basis of our evaluation of EMR, a significant proportion of Jordanians living in Al-Salt City are afflicted with amebiasis. It is therefore an ongoing public health issue. Conventional microscopic analysis was the main method used to detect the existence of the infection, and the diagnosis was supported by the discovery of amoebic trophozoites, cysts, or both in the specimen under examination. Verified cases of amebiasis showed signs of gastric distress, such as nausea, diarrhea, and in some cases, bloody feces. It is advised to use molecular approaches to accurately distinguish between various Entamoeba species, such as the pathogenic E. histolytica and other species, which are frequently asymptomatic and have unknown virulence, such as E. dispar, E. hartmanni, and E. moshkovskii [54]. However, specialized PCR-based DNA detection techniques for E. histolytica are only available to reference laboratories.
Previous studies conducted in various districts of Jordan demonstrated significant variability in the prevalence rate [13,14,15]. This figure may be related to geographical factors, such as the kind and supply of drinking water, which may affect the existence of amebiasis in different locations. According to a UNICEF report, 93% of Jordanians have access to a safe water source, and 86% have access to a piped network. In urban areas, water is typically available once a week, and less than once every 2 weeks in rural areas, with a reduced frequency during the summer. Only 77.3% of the existing sanitation systems are safely managed, and only one-third of schools have basic sanitation services [55]. This may represent an additional risk factor for cycling amebiasis in different regions of the country. Owing to the ongoing water shortage, many Jordanians drink commercially available filtered water, which could reduce the risk of contracting infection. Commercial water is typically offered in open or closed tanks in local markets in Jordan, and these tanks can be filled with reusable oxygen-sterilized containers upon request. However, how often these tanks and containers are inspected remains unclear.
In this study, we demonstrated that it is possible to predict the most crucial features of amoebic gastroenteritis using ML models based on medical records. Because it is challenging to accurately determine the exact species microscopically, the cases were identified as Entamoeba sp. As this is the first study about the prediction of amebiasis using ML, we focused on evaluating the suitability of algorithms using a small number of features that are chiefly derived from microscopic and hematological parameters. Corresponding to this, Sandri et al. used a few hematological markers and the naïve Bayes classifier to successfully separate toxoplasmosis patients from healthy controls [33]. We got up to 98% AUC of an improved RF model in our experiment, but the model in their instance achieved 70% AUC. We may have achieved this high performance by using a substantial quantity of data. In response to the tremendous health crisis caused by the outbreak and reemergence of infections, several research teams have created AI systems aimed at automating the identification of infectious diseases. Chadaga, for example, employed ML to differentiate between COVID-19 and non-COVID-19 pneumonia based on hematological indicators, although relatively few studies have used clinical markers as a method [41]. In comparison with our research, the investigations on malaria focused on increasing accuracy through the application of both conventional ML and DL-based algorithms [31, 56]. The use of ML for protozoal diagnostics is currently dependent on parasite detection using picture recognition [57], while our work maximizes EMR potential for ML parasitological applications. Following this inquiry, we will apply ML for Entamoeba species detection using microscopic images.
Finding the ideal feature combinations to include or leave out of the models is necessary. We employed the RF feature importance ranking method and cross-validation. Using this strategy allows the establishment of a more accurate and useful model to forecast amebiasis in a GE data pool. The de facto standard in conventional ML research, cross-validation, has only been used in a small number of studies [58]. The majority of the studies hold validation. In contrast, our method assesses the reliability and consistency of the features chosen across different models in use and assists in identifying features with outstanding predictive value. Prior research revealed a lack of diverse datasets, failing to include individuals with different clinical diagnoses, different laboratory results, and different demographics [57]. The development of prediction systems is severely hampered by these characteristics. The relevant studies employed a small number of ML models (e.g., Refs. [33, 34, 57]), which may induce bias and/or limit the search for more efficient ML techniques. In contrast, a variety of models were used in this study to make predictions, which produced an ideal landscape for selecting the model that best fit the available data. The predictive model would have had more statistical power if a larger sample size had been included. Nevertheless, compared with most other research that examined the suitability of ML for diagnosing infectious disorders, this analysis included a comparatively large dataset.
Leukocytosis in the stool, mucus, and red blood cells were the features relevant to the presence of amebiasis in this investigation. These findings were in excellent agreement with traditional laboratory indicators of an inflammatory illness caused by amoebic invasive infections. In actuality, the concordance between the outcomes of the ML approach and the conventional laboratory indicators serves as evidence of the ML model’s durability. Increased mucus production and stool RBCs are typically indicators of invasive enteric infections that breach the mucosa and cause tissue damage [59]. Many polymorphonuclear leukocytes (PMN) on stool microscopy suggest an inflammatory condition in the colon. Patients with amebiasis may display fecal PMNs, although they are often less numerous [60]. Generally speaking, microscopic analysis of mucus, PMN, and RBCs does not provide everything about the etiology that cannot be verified clinically. This could indicate a more serious disease such as ulcerative colitis, Crohn’s disease, or colon cancer. In comparison, using ML models combined with a feature selection technique relies on the feature’s strong correlation with class (amebiasis, non-amebiasis), while also appreciating the intercorrelation between the used features. These presumptions lead to a statistical power applied to compute each correlation. This is evident from looking at the decision rules extracted from the decision tree resultant model for determining amebiasis in (Table 6). As a result, the use of AI/ML diagnosis approaches has shown to be crucial for optimizing the outcomes of microscopic examinations that identify amebiasis infections.
In our experiments, we found that decision tree-based classifiers consistently outperformed other types of classifiers, and there are several reasons for this strong performance. First and foremost, our dataset primarily consisted of categorical variables, which we encoded using a straightforward factorization approach (e.g., assigning values such as 0, 1, 2, 3, etc.). Decision tree models, such as random forests and other types of trees, are particularly well suited for handling categorical data. They can easily split the data based on these categorical values without needing complicated transformations or assumptions about how the data are distributed, unlike models such as SVM or LDA. Our results are consistent with the work of Mathison et al. [61], who used a convolutional neural network (CNN) model to manually detect intestinal protozoa and obtained a good agreement of 98%.
Secondly, we observed considerable class overlap in our dataset, as shown in Fig. 7. This overlap posed a significant challenge for models that depend on linear decision boundaries, such as SVM and LDA, which struggled to differentiate between the classes effectively. In contrast, decision trees are nonparametric and can create nonlinear decision boundaries, allowing them to adapt better to the data’s structure, including areas where classes overlap. This flexibility enabled tree-based models to capture subtle patterns and relationships that other models missed.
Additionally, ensemble methods based on decision trees, such as random forests, gained an advantage by reducing variance through the aggregation of multiple trees, each trained on different subsets of the data. This approach not only enhanced predictive performance but also made the models more resilient to noise and variations in the dataset, improving their ability to generalize to new, unseen data.
We may be able to enhance the performance of the least-performing algorithms, such as KNN. Nevertheless, this paper is not intended to address such an enhancement by employing other encoding techniques, such as one-hot encoding [49], and alternative distance measurements, such as Hasanat distance, which has been shown to be unaffected by outliers and data noise [62, 63]. However, this distance was the core of HKNN, where the results of the KNN were significantly increased by at least 10% using such a distance metric; this finding is supported by Refs. [64,65,66,67,68,69], among others.
Accuracy is a statistic commonly used to assess a classifier’s overall performance in binary classification settings [45]. The F1 score is a metric that combines accuracy and recall. Still, it is also useful for evaluating a classifier’s performance in scenarios including class imbalance or where false positives and false negatives are prominent. The F1 score is especially useful when attempting to find a compromise between precision and recall, as it provides a single statistic that takes both into consideration. The F1 score thus emerges as a suitable metric for classifier comparisons in our case [70,71,72], taking into account the classifier’s performance on imbalanced data, even when it is slightly imbalanced as in our case, and realizing the significance in a medical application where administrators must trade off between false positives and false negatives [63]. Consequently, it was determined that the best classifier for the amebiasis prediction system was a decision tree approach.
An additional finding from this research indicates a negative correlation between WBCs and amebiasis. This finding has clinical implications because, unlike invasive bacterial infections such as Salmonella or Escherichia coli [73], leukocytoses are unexpectedly seen in amebiasis. On the other hand, neutrophil (%) correlated positively with amebiasis (0.56). In fact a sizable fraction of amebiasis cases have neutrophilic leukocytosis (Supplementary file S1: Table S1) [74]. These figures could potentially indicate the presence of a concurrent or secondary bacterial infection. However, the most common form of invasive amebiasis, amoebic liver abscess, might be associated with neutrophilic leukocytosis, along with clinical manifestations such as watery diarrhea, and less frequently with amebic colitis or even necrotizing colitis [75,76,77]. Hegazi et al. (2021) showed that more than 50% of hospitalized young children in Saudi Arabia had aberrant leukocytosis and neutrophilia according to age, although they did not show any amebic liver abscess on abdominal ultrasonography [20].
Clinical symptoms, such as loose stools, bleeding, and abdominal pain, along with neutrophilia, suggested that amebiasis is aggressive in this cohort and signaled the possibility of early invasive amoebic disease, requiring prompt and appropriate diagnosis and treatment. Young children and toddlers under the age of 5 years were discovered to make up more than a quarter of the patients in this study, with more than half of them having been diagnosed with watery diarrhea, which indicates a high severity of infection. This could have severe effects on children of this age as diarrheal infections linked to amebiasis are more likely to cause dehydration, stunted growth, and induce malnutrition [73, 78].
Severe complications caused by dehydration necessitate immediate hospitalization. The risk of infection is increased by inadequate breastfeeding, mixed feeding, and daycare [20]. This finding highlights the significance of closely monitoring GE in Jordanian children. It is essential to determine whether the rising prevalence of amebiasis in children is consistent in different regions across the country. Few occasional reports indicated inconsistent results. Specifically, prior surveillance research conducted in the capital city of Jordan discovered that the frequency of intestinal parasites (including E. histolytica) was the highest in children under the age of 5 years [14].
Additionally, children under the age of 15 years make up more than 60% of affected patients in northern Jordanian cities [15]. In contrast, adult patients in southern sites have E. histolytica at higher rates than children [13]. Obtaining more information on the socioeconomic situation, water quality, dietary background, nursery, and daycare is essential to understanding the causes of amebiasis in Jordan’s toddlers and young children.
This study had some limitations. The survey was performed in the Jordanian city of Al-Salt and thus focused on a single community. A more comprehensive, metacentric study involving different cohorts of patients from diverse ethnic, socioeconomic, and regional backgrounds would provide more concrete data and reproducible patterns on the distribution of potentially diverse parasite genotypes and respective differences in the pathogenesis and immune responses upon infection. However, the number of patients admitted to the city´s only hospital reflected a sufficient sample of the local population. On the one hand, information on immunological parameters—such as cytokine levels and the genotyping of the recovered Entamoeba spp.—would help clarify the possible causes behind the severity of amebiasis in patients from Jordan and will improve ML applications in the future.
Conclusions
This research adds to the current application of AI technology by training existing ML-based decision tree models to identify amebiasis with greater accuracy achieved by optimization. It was possible to detect amebiasis cases with excellent accuracy. The application of ML demonstrates the technology’s ability to mine electronic medical information and trigger therapeutic action. Our methodology has the benefit of incorporating feature ranking with multiple classifiers and comparing their performance on a GE pool dataset. We used the decision tree classifier to create the best prediction system possible. A variety of laboratory tests are included in this system, with a focus on diarrhea-RBC, diarrhea-WBC, neutrophil (%), and WBC measures as the strongest features. Using the data on hand, we have depicted the epidemiological picture of amebiasis among patients with amebiasis in all age and sex categories. Large-scale research on Jordan’s amebiasis is required, as is the development of preventative strategies to lower the disease’s prevalence in high-risk settings including daycare centers and educational institutions.
Data availability
A novel dataset supporting this study’s finding will be shared alongside the publication, and a suitable link to the data will be provided.
Code availability
The code can be made available upon request from the last author.
References
Troeger C, Forouzanfar M, Rao PC, Khalil I, Brown A, Reiner RC, et al. Estimates of global, regional, and national morbidity, mortality, and aetiologies of diarrhoeal diseases: a systematic analysis for the Global Burden of Disease Study 2015. Lancet Infect Dis. 2017;17:909–48.
Shirley DAT, Farr L, Watanabe K, Moonah S. A review of the global burden, new diagnostics, and current therapeutics for amebiasis. In: Open forum infectious diseases. vol. 5. Oxford University Press US; 2018. p. ofy161.
dos Santos Zanetti A, Malheiros AF, de Matos TA, Dos Santos C, Battaglini PF, Moreira LM, et al. Diversity, geographical distribution, and prevalence of Entamoeba spp. in Brazil: a systematic review and meta-analysis. Parasite. 2021;28:17.
Singh A, Banerjee T, Khan U, Shukla SK. Epidemiology of clinically relevant Entamoeba spp. (E. histolytica/dispar/moshkovskii/bangladeshi): a cross sectional study from North India. PLoS Neglect Trop Dis. 2021;15:e0009762.
Alvarado-Esquivel C, Hernandez-Tinoco J, Sanchez-Anguiano LF. Seroepidemiology of Entamoeba histolytica infection in general population in rural Durango, Mexico. J Clin Med Res. 2015;7:435.
Servián A, Helman E, Iglesias MR, Panti-May JA, Zonta ML, Navone GT. Prevalence of human intestinal Entamoeba spp. in the Americas: a systematic review and meta-analysis, 1990–2022. Pathogens. 2022;11:1365.
Lin FH, Chen BC, Chou YC, Chien WC, Chung CH, Hsieh CJ, et al. The epidemiology of Entamoeba histolytica infection and its associated risk factors among domestic and imported patients in Taiwan during the 2011–2020 Period. Medicina. 2022;58:820.
Flaih MH, Khazaal RM, Kadhim MK, Hussein KR, Alhamadani FAB. The epidemiology of amoebiasis in Thi-Qar Province, Iraq (2015–2020): differentiation of Entamoeba histolytica and Entamoeba dispar using nested and real-time polymerase chain reaction. Epidemiol Health. 2021;43:e2021034.
Haghighi A, Riahi SM, Taghipour A, Spotin A, Javanian M, Mohammadi M, et al. Amoebiasis in Iran: a systematic review and meta-analysis. Epidemiol Infect. 2018;146:1880–90.
Mahmood SAF, Bakr HM. Molecular identification and prevalence of Entamoeba histolytica, Entamoeba dispar and Entamoeba moshkovskii in Erbil City, Northern Iraq. Polish J Microbiol. 2020;69:263-72. https://doiorg.publicaciones.saludcastillayleon.es/10.33073/pjm-2020-028.
Bakhraibah AO. Prevalence of Entamoeba histolytica in adult diarrheic patients of King Fahd Hospital in Jeddah, Saudi Arabia. Int J Pharm Res Allied Sci. 2018;7:177–82.
Fathi A, Bahnass M, Elshahawy I. Seroprevalence of amoebiasis in Najran Saudi Arabia. Tropic Biomed. 2017;34:732–40.
Nawafleh H, Al Hroob AM, Kawafha MM, Altaif KI. Epidemiological study: laboratory data mining in south of Jordan. Am J Infect Dis. 2014;10:137.
Chazal A, Adi H. The prevalence of intestinal parasites in Amman, Jordan. Bull Pharm Sci Assiut Univ. 2007;30:235–9.
Jaran A. Prevalence and seasonal variation of human intestinal parasites in patients attending hospital with abdominal symptoms in northern Jordan. EMHJ-Eastern Mediterr Health J. 2016;22:756–60.
Nimri L, Meqdam M. Enteropathogens associated with cases of gastroenteritis in a rural population in Jordan. Clin Microbiol Infect. 2004;10:634–9.
Abdel-Dayem M, Al Zou’bi R, Hani RB, Amr ZS. Microbiological and parasitological investigation among food handlers in hotels in the Dead Sea area, Jordan. J Microbiol Immunol Infect. 2014;47:377–80.
Zulfiqar H, Mathew G, Horrall S. Amebiasis. Treasure Island: StatPearls Publishing; 2024.
Ghosh S, Sharma S, Gadpayle A, Gupta H, Mahajan R, Sahoo R, et al. Clinical, laboratory, and management profile in patients of liver abscess from northern India. J Trop Med. 2014;2014:142382.
Hegazi MA, Patel TA, El-Deek BS. Prevalence and characters of Entamoeba histolytica infection in Saudi infants and children admitted with diarrhea at 2 main hospitals at South Jeddah: a re-emerging serious infection with unusual presentation. Brazi J Infect Dis. 2013;17:32–40.
Shirley DA, Moonah S. Fulminant amebic colitis after corticosteroid therapy: a systematic review. PLoS Negl Tropic Dis. 2016;10:e0004879.
Babić E, Bevanda M, Mimica M, Karin M, Volarić M, Bogut A, et al. Prevalence of amebiasis in inflammatory bowel disease in University Clinical Hospital Mostar. Springerplus. 2016;5:1–4.
Hong SM, Baek DH. A review of colonoscopy in intestinal diseases. Diagnostics. 2023;13:1262.
Tanyuksel M, Petri WA Jr. Laboratory diagnosis of amebiasis. Clin Microbiol Rev. 2003;16:713–29.
Tatliparmak AC, Yilmaz S, Colak FU, Erdil FN. Diagnostic and sentinel surveillance process for amebiasis in the emergency department. J Med Surg Public Health. 2023;1:100004.
Morán P, Serrano-Vázquez A, Rojas-Velázquez L, González E, Pérez-Juárez H, Hernández EG, et al. Amoebiasis: advances in diagnosis, treatment, immunology features and the interaction with the intestinal ecosystem. Int J Mol Sci. 2023;24:11755.
Carrero JC, Reyes-López M, Serrano-Luna J, Shibayama M, Unzueta J, León-Sicairos N, et al. Intestinal amoebiasis: 160 years of its first detection and still remains as a health problem in developing countries. Int J Med Microbiol. 2020;310:151358.
Khanna VV, Chadaga K, Sampathila N, Chadaga R, Prabhu S, Swathi K, et al. A decision support system for osteoporosis risk prediction using machine learning and explainable artificial intelligence. Heliyon. 2023;9:e22456.
Soares FA, Suzuki CTN, Sabadini E, Falcão AX, de Oliveira Baccin A, de Melo LCV, et al. Laboratory validation of the automated diagnosis of intestinal parasites via fecal sample processing for the recovery of intestinal parasites through the dissolved air flotation technique. Parasit Vectors. 2024;17:368.
Hu RS, Hesham AEL, Zou Q. Machine learning and its applications for protozoal pathogens and protozoal infectious diseases. Front Cell Infect Microbiol. 2022;12:882995.
Fuhad KF, Tuba JF, Sarker MRA, Momen S, Mohammed N, Rahman T. Deep learning based automatic malaria parasite detection from blood smear and its smartphone based application. Diagnostics. 2020;10:329.
Benfodil K, Benbouras MA, Ansel S, Mohamed-Cherif A, Ait-Oudhia K. Prediction of Trypanosoma evansi infection in dromedaries using artificial neural network (ANN). Vet Parasitol. 2022;306:109716.
Sandri V, Gonçalves IL, Machado das Neves G, Romani Paraboni ML. Diagnostic significance of C-reactive protein and hematological parameters in acute toxoplasmosis. J Parasit Dis. 2020;44:785–93.
Ligda P, Claerebout E, Kostopoulou D, Zdragas A, Casaert S, Robertson LJ, et al. Cryptosporidium and Giardia in surface water and drinking water: animal sources and towards the use of a machine-learning approach as a tool for predicting contamination. Environ Pollut. 2020;264:114766.
Electronic Health Solutions. Hakeem Program; 2024. https://ehs.com.jo/hakeem-program. Accessed 7 July 2024.
The Pandas Development Team. Pandas Documentation: Getting Started - Overview; 2024. https://pandas.pydata.org/docs/_sources/getting_started/overview.rst.txt. Accessed 7 July 2024.
Harris CR, Millman KJ, van der Walt SJ, Gommers R, Virtanen P, Cournapeau D, et al. Array programming with NumPy. Nature. 2020;585:357–62.
Virtanen P, Gommers R, Oliphant TE, Haberland M, Reddy T, Cournapeau D, et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17:261–72.
Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9:90–5.
Waskom M, Botvinnik O, O’Kane D, Hobson P, Lukauskas S, Gemperline DC, et al. mwaskom/seaborn: v0. 8.1 (September 2017). Zenodo. 2017.
Chadaga K, Prabhu S, Bhat V, Sampathila N, Umakanth S, Chadaga R. Artificial intelligence for diagnosis of mild-moderate COVID-19 using haematological markers. Ann Med. 2023;55:2233541.
Tarawneh AS, Alamri ES, Al-Saedi NN, Alauthman M, Hassanat AB. CTELC: a constant-time ensemble learning classifier based on KNN for big data. IEEE Access. 2023;11:89791–802.
Hassanat AB, Ali HN, Tarawneh AS, Alrashidi M, Alghamdi M, Altarawneh GA, et al. Magnetic force classifier: a novel method for big data classification. IEEE Access. 2022;10:12592–606.
Hassanat AB. Furthest-pair-based decision trees: experimental results on big data classification. Information. 2018;9:284.
Hassanat A, Altarawneh G, Alkhawaldeh IM, Alabdallat YJ, Atiya AF, Abujaber A, et al. In: 2023 IEEE Symposium on Computers and Communications (ISCC). IEEE. 2023:1–7.
Tarawneh AS, Hassanat AB, Altarawneh GA, Almuhaimeed A. Stop oversampling for class imbalance learning: a review. IEEE Access. 2022;10:47643–60.
Huang Ll, Tang J, Chen Sb, Ding C, Luo B. An efficient algorithm for feature selection with feature correlation. In: Intelligent Science and Intelligent Data Engineering: Third Sino-foreign-interchange Workshop, IScIDE 2012, Nanjing, China, October 15–17, 2012. Revised Selected Papers 3. Springer; 2013. p. 639–46.
Wojtas M, Chen K. Feature importance ranking for deep learning. Adv Neural Inf Process Syst. 2020;33:5105–14.
Alkhawaldeh I, Al-Jafari M, Abdelgalil M, Tarawneh A, Hassanat A. P-358 a machine learning approach for predicting bone metastases and its three-month prognostic risk factors in hepatocellular carcinoma patients using SEER data. Ann Oncol. 2023;34:S140.
Alkhawaldeh IM, Altarawneh G, Al-Jafari M, Abdelgalil MS, Tarawneh AS, Machine Hassanat A. A, et al. In: 2023 IEEE Symposium on Computers and Communications (ISCC). IEEE. 2023:1–5.
Tanni SE, Patino CM, Ferreira JC. Correlation vs. regression in association studies. Jornal Brasileiro de Pneumologia. 2020;46:e20200030.
Tate RF. Correlation between a discrete and a continuous variable. Point-biserial correlation. Ann Math Stat. 1954;25:603–7.
Tam A. Training-validation-test split and cross-validation done right. Mach Learn Mastery. 2021;23.
Al-Dalabeeh EA, Irshaid FI, Roy S, Ali IKM, Al-Shudifat AM. Identification of Entamoeba histolytica in patients with suspected amebiasis in Jordan using PCR-based assays. Pak J Biol Sci. 2020;23:166–72.
UNICEF Jordan. Water, sanitation and hygiene; 2024. https://www.unicef.org/jordan/water-sanitation-and-hygiene. Accessed 7 July 2024.
Dong Y, Jiang Z, Shen H, Pan WD, Williams LA, Reddy VV, et al. In: 2017 IEEE EMBS international conference on biomedical & health informatics (BHI). IEEE. 2017:101–4.
Mbunge E, Batani J. Application of deep learning and machine learning models to improve healthcare in sub-Saharan Africa: emerging opportunities, trends and implications. Telematics and Informatics Reports. 2023:100097.
Rajaraman S, Jaeger S, Antani SK. Performance evaluation of deep neural ensembles toward malaria parasite detection in thin-blood smear images. PeerJ. 2019;7:e6977.
Tatliparmak AC, Yilmaz S, Colak FU, Erdil FN. Diagnostic and sentinel surveillance process for amebiasis in the emergency department. J Med Surg Public Health. 2023;1:100004.
Fernández-López LA, Gil-Becerril K, Galindo-Gómez S, Estrada-García T, Ximénez C, Leon-Coria A, et al. Entamoeba histolytica interaction with enteropathogenic Escherichia coli increases parasite virulence and inflammation in amebiasis. Infect Immun. 2019;87:10–1128.
Mathison BA, Kohan JL, Walker JF, Smith RB, Ardon O, Couturier MR. Detection of intestinal protozoa in trichrome-stained stool specimens by use of a deep convolutional neural network. J Clin Microbiol. 2020;58:10–1128.
Abu Alfeilat HA, Hassanat AB, Lasassmeh O, Tarawneh AS, Alhasanat MB, Eyal Salman HS, et al. Effects of distance measure choice on k-nearest neighbor classifier performance: a review. Big Data. 2019;7:221–48.
Hassanat A, Alkafaween E, Tarawneh AS, Elmougy S. Applications review of hassanat distance metric. In: 2022 International Conference on Emerging Trends in Computing and Engineering Applications (ETCEA). IEEE; 2022. p. 1–6.
Ehsani R, Drabløs F. Robust distance measures for k NN classification of cancer data. Cancer Inform. 2020;19:1176935120965542.
Jiřina M, Krayem S. The distance function optimization for the near neighbors-based classifiers. ACM Trans Knowl Discov Data. 2022;16:1–21.
Hofer E, Mohrenschildt M. Locally-scaled kernels and confidence voting. Mach Learn Knowl Extr. 2024;6:1126–44.
Na J, Wang Z, Lv S, Xu Z. An extended k nearest neighbors-based classifier for epilepsy diagnosis. IEEE Access. 2021;9:73910–23.
Uddin S, Haque I, Lu H, Moni MA, Gide E. Comparative performance analysis of K-nearest neighbour (KNN) algorithm and its different variants for disease prediction. Sci Rep. 2022;12:6256.
Hase VJ, Bhalerao YJ, Verma S, Wakchaure V, Vikhe G. Intelligent threshold prediction in hybrid mesh segmentation using machine learning classifiers. Int J Manag Technol Eng. 2018;8:1426–42.
Grandini M, Bagli E, Visani G. Metrics for multi-class classification: an overview. arXiv preprint arXiv:2008.05756. 2020.
Al-khlifeh EM, Hassanat AB. Predicting the distribution patterns of antibiotic-resistant microorganisms in the context of Jordanian cases using machine learning techniques. J Appl Pharm Sci. 2024;14:174–83.
Al-Khlifeh EM, Alkhazi IS, Alrowaily MA, Alghamdi M, Alrashidi M, Tarawneh AS, et al. Extended spectrum beta-lactamase bacteria and multidrug resistance in Jordan are predicted using a new machine-learning system. Infect Drug Resist. 2024;17:3225–40.
Mercado EH, Ochoa TJ, Ecker L, Cabello M, Durand D, Barletta F, et al. Fecal leukocytes in children infected with diarrheagenic Escherichia coli. J Clin Microbiol. 2011;49:1376–81.
Alkhlifeh EM. Analysis of unique presentation of amebiasis: experience from Jordan. medRxiv. 2023:2023–11.
Yue B, Meng Y, Zhou Y, Zhao H, Wu Y, Zong Y. Characteristics of endoscopic and pathological findings of amebic colitis. BMC Gastroenterol. 2021;21:1–6.
Salles JM, Moraes LA, Salles MC. Hepatic amebiasis. Brazilian J Infect Dis. 2003;7:96–110.
Nakada-Tsukui K, Nozaki T. Immune response of amebiasis and immune evasion by Entamoeba histolytica. Front Immunol. 2016;7:175.
Mondal D, Petri WA Jr, Sack RB, Kirkpatrick BD, Haque R. Entamoeba histolytica-associated diarrheal illness is negatively associated with the growth of preschool children: evidence from a prospective study. Trans Royal Soc Trop Med Hygiene. 2006;100:1032–8.
Acknowledgements
We acknowledge Jordan’s AL-Hussein/Salt Hospital for allowing access to medical records from the microbiology laboratory section. We also thank Ms. Dina Al-Zoubi in the parasitology laboratory for her assistance in laboratory data collection and verification steps.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Author information
Authors and Affiliations
Contributions
E.K.: provided concepts and ideas, structured the research, defined the intellectual content, searched for relevant literature, collected data, and wrote original draft of the manuscript. A.T.: performed data analysis, drafting, revising, and critically reviewing the article. K.A.: drafting, revising, or critically reviewing the article. M.A.: drafting, revising, or critically reviewing the article. R.H.: drafting, revising, or critically reviewing the article. A.H.: carried out data acquisition, data analysis, searched for relevant literature, and wrote original draft of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study has been approved by the scientific and administration committee of scientific research at Al-Balqa Applied University and Jordan’s Al-Hussein/Salt Hospital. The material is the authors’ original work that has not been previously published elsewhere.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Al-khlifeh, E., Tarawneh, A.S., Almohammadi, K. et al. Decision tree-based learning and laboratory data mining: an efficient approach to amebiasis testing. Parasites Vectors 18, 33 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13071-024-06618-6
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13071-024-06618-6