CART (Classification and Regression Tree) is a machine learning algorithm first proposed by Breiman et al. in 1984 and is widely used in predictive modeling. Although being a simple algorithm, CART has an important status as it sets the foundation of many tree-based methods such as Bagging, XGBoost and Random Forest. CART has long been believed to have to the ability to deal with missing data because of its surrogate splits function, which means that for a given observation, when one variable that is used in tree construction is missing, CART will use other variables that are similar to the missing variable to help constructing the tree.
Since there are different types of missing data, such as MAR (missing at random), MNAR (missing not at random), and MCAR (missing completely at random), we aim to conduct simulations using various models to examine CART’s ability to handle different types of missing data as well as the factors that influence its abilities.
In this poster presentation, I will discuss the simulation study I have conducted, in which I simulated data using 11 models with various levels of complexity, sample size and missing proportion. Using MSEs as criteria, I examined how each model performs relative to one another, and what insights can be generated from the simulation results.
Authors: Valerie Huang, Dr. Han Du (Advisor)