Subsumption is a Novel Feature Reduction Strategy for High Dimensionality Datasets

  • Donald C. Wunsch III Missouri University of Science and Technology, USA
  • Daniel B. Hier Missouri University of Science and Technology, USA
Keywords: Machine Learning, Feature Reduction, Neurology, Ontology, Principal Components, Relief, subsumption

Abstract

High dataset dimensionality poses challenges for machine learning classifiers because of high computational costs and the adverse consequences of redundant features. Feature reduction is an attractive remedy to high dimensionality. Three different feature reduction strategies (subsumption, Relief F, and principal component analysis) were evaluated using four machine learning classifiers on a high dimension dataset with 474 unique features, 20 diagnoses, and 364 instances. All three feature reduction strategies proved capable of significant feature reduction while maintaining classification accuracy. At high levels of feature reduction, the principal components strategy outperformed Relief F and subsumption. Subsumption is a novel strategy for feature reduction if features are organized in a hierarchical ontology.

Downloads

Download data is not yet available.

Metrics

Metrics Loading ...

PlumX Statistics

References

1. Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley interdisciplinary reviews: computational statistics, 2(4), 433–459.
2. Al-Jabery, K. K., Obafemi-Ajayi, T., Olbricht, G. R., & Wunsch II, D. C. (2020). 9 - data analysis and machine learning tools in MATLAB and Python. In K. K. Al-Jabery, T. Obafemi-Ajayi, G. R. Olbricht, & D. C. Wunsch II (Eds.), Computational learning approaches to data analytics in biomedical applications (p. 231-290). Academic Press. DOI: https://doi.org/10.1016/B978-0-12-814482-4.00009-7
3. Bhatia, K.P.and Erro, R. and Stamelou, M. (2017). Case studies in movement disorders. Cambridge University Press.
4. Blumenfeld, H. (2010). Neuroanatomy through clinical cases. Sinauer Associates.
5. Chen, S. Y., Feng, Z., & Yi, X. (2017, Jun). A general introduction to adjustment for multiple comparisons. J Thorac Dis, 9(6), 1725–1729.
6. Corrales, D. C., Lasso, E., Ledezma, A., & Corrales, J. C. (2018). Feature selection for classification tasks: Expert knowledge or traditional methods? Journal of Intelligent & Fuzzy Systems, 34(5), 2825–2835.
7. Dash, M., & Liu, H. (1997). Feature selection for classification. Intelligent data analysis, 1(3), 131–156.
8. De Boer, P.-T., Kroese, D. P., Mannor, S., & Rubinstein, R. Y. (2005). A tutorial on the cross-entropy method. Annals of operations research, 134(1), 19–67.
9. Demsˇar, J., Curk, T., Erjavec, A., Cˇ rt Gorup, Hocˇevar, T., Milutinovicˇ, M., & Zupan, B. (2013). Orange: Data mining toolbox in Python. Journal of Machine Learning Research, 14, 2349-2353. Retrieved from http://jmlr.org/papers/v14/demsar13a.html
10. Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., .& Dean, J. (2019). A guide to deep learning in healthcare. Nature medicine, 25(1), 24–29.
11. Gauthier, S., & Rosa-Neto, P. (2011). Case studies in dementia: Volume 1: Common and uncommon presentations. Cambridge University Press.
12. Groza, T., Ko¨hler, S., Moldenhauer, D., Vasilevsky, N., Baynam, G., Zemojtel, T., et al. (2015). The human phenotype ontology: semantic unification of common and rare disease. The American Journal of Human Genetics, 97(1), 111–124.
13. Hauser, S., Weiner, H., & Levitt, L. (1986). Case studies in clinical neurology for the house officer. Williams & Wilkins.
14. Hier, D. B., & Brint, S. U. (2020). A neuro-ontology for the neurological examination. BMC Medical Informatics and Decision Making, 20(1), 1–9.
15. Hier, D. B., Kopel, J., Brint, S. U., Wunsch, D. C., Olbricht, G. R., Azizi, S., & Allen, B. (2020). Evaluation of standard and semantically-augmented distance metrics for neurology patients. BMC Medical Informatics and Decision Making, 20(1), 1–15.
16. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6), 417.
17. Howard, J., & Singh, A. (2016). Neurology image-based clinical review. Springer Publishing Company.
18. Janecek, A., Gansterer, W., Demel, M., & Ecker, G. (2008). On the relationship between feature selection and classification accuracy. In New challenges for feature selection in data mining and knowledge discovery (pp. 90–105).
19. Kingsford, C., & Salzberg, S. L. (2008). What are decision trees? Nature biotechnology, 26(9), 1011–1013.
20. Kira, K., & Rendell, L. A. (1992). The feature selection problem: Traditional methods and a new algorithm. In AAAI (Vol. 2, pp. 129–134).
21. Ko¨hler, S., Øien, N. C., Buske, O. J., Groza, T., Jacobsen, J. O., McNamara, C., and others (2019). Encoding clinical data with the human phenotype ontology for computational differential diagnostics. Current protocols in human genetics, 103(1), e92.
22. Ko¨hler, S., Vasilevsky, N. A., Engelstad, M., Foster, E., McMurry, J., Ayme', S. et al. (2017). The human phenotype ontology in 2017. Nucleic acids research, 45(D1), D865 -D876.
23. Koller, D., & Sahami, M. (1996). Toward optimal feature selection (Tech. Rep.). Stanford InfoLab.
24. Kononenko, I., Sˇimec, E., & Robnik-Sˇikonja, M. (1997). Overcoming the myopia of inductive learning algorithms with Relief F. Applied Intelligence, 7(1), 39–55.
25. Kuhn, M., & Johnson, K. (2019). Feature engineering and selection: A practical approach for predictive models. CRC Press.
26. Kuhn, M., Johnson, K., et al. (2013). Applied predictive modeling (Vol. 26). Springer. Liveson, S. A. (2000). Peripheral neurology: case studies. Oxford University Press.
27. Miotto, R., Wang, F., Wang, S., Jiang, X., & Dudley, J. T. (2018). Deep learning for healthcare: review, opportunities, and challenges. Briefings in bioinformatics, 19(6), 1236–1246.
28. Mohajon, J. (2020). Confusion Matrix for Your Multi-Class Machine Learning Model. Towards data science. May 28, 2020. Retrieved at https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826
29. Noseworthy, J. H. (2004). Fifty neurologic cases from Mayo Clinic. Oxford University Press.
30. Pendlebury, S., Anslow, P., & Rothwell, P. (2007). Neurological case histories: Case histories in acute neurology and the neurology of general medicine. Oxford University Press.
31. Ringner, M. (2008). What is principal component analysis? Nature Biotechnology, 26(3), 303–304.
32. Smola, A. J., & Scho¨lkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222.
33. Solomon, T., Michael, B., Miller, A., & Kneen, R. (2019). Case studies in neurological infection. Cambridge University Press.
34. Tang, J., Alelyani, S., & Liu, H. (2014). Feature selection for classification: A review. Data classification: Algorithms and applications, 37.
35. The National Center for Biomedical Ontology. (2021). The Human Phenotype Ontology. https://bioportal.bioontology.org/ontologies/HP. (Uploaded: 2020-12-07)
36. Toy, E. C., Simpson, E., & Tintner, R. (2012). Case Files Neurology, the second edition. McGraw Hill.
37. Visalakshi, S., & Radha, V. (2014). A literature review of feature selection techniques and applications: Review of feature selection in data mining. In 2014 IEEE international conference on computational intelligence and computing research (pp. 1–6).
38. Waxman, S. (2009). Clinical neuroanatomy, 26th edition. McGraw-Hill Education.
39. Xiao, C., Choi, E., & Sun, J. (2018). Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. Journal of the American Medical Informatics Association, 25(10), 1419–1428.
Published
2022-02-08
How to Cite
Wunsch III, D. C., & Hier, D. B. (2022). Subsumption is a Novel Feature Reduction Strategy for High Dimensionality Datasets. European Scientific Journal, ESJ, 18(4), 20. https://doi.org/10.19044/esj.2022.v18n4p20