Author List: Sinha, Atish P.; May, Jerrold H.;
Journal of Management Information Systems, 2004, Volume 21, Issue 3, Page 249-280.
In this study, we conduct an empirical analysis of the performance of five popular data mining methods--neural networks, logistic regression, linear discriminant analysis, decision trees, and nearest neighbor--on two binary classification problems from the credit evaluation domain. Whereas most studies comparing data mining methods have employed accuracy as a performance measure, we argue that, for problems such as credit evaluation, the focus should be on minimizing misclassification cost. We first generate receiver operating characteristic (ROC) curves for the classifiers and use the area under the curve (AUC) measure to compare aggregate performance of the five methods over the spectrum of decision thresholds. Next, using the ROC results, we propose a method for tuning the classifiers by identifying optimal decision thresholds. We compare the methods based on expected costs across a range of cost-probability ratios. In addition to expected cost and AUC, we evaluate the models on the basis of their generalizability to unseen data, their scalability to other problems in the domain, and their robustness against changes in class distributions. We found that the performance of logistic regression and neural network models was superior under most conditions. In contrast, decision tree and nearest neighbor models yielded higher costs, and were much less generalizable and robust than the other models. An important finding of this research is that the models can be effectively tuned post hoc to make them cost sensitive, even though they were built without incorporating misclassification costs.
Keywords: binary classification; credit evaluation; data mining; decision analysis; misclassification costs; performance evaluation and tuning; predictive models; ROC curves
Algorithm:

List of Topics

#215 0.576 data classification statistical regression mining models neural methods using analysis techniques performance predictive networks accuracy method variables prediction problem measure
#151 0.101 costs cost switching reduce transaction increase benefits time economic production transactions savings reduction impact services reduced affect expected optimal associated
#157 0.081 evaluation effectiveness assessment evaluating paper objectives terms process assessing criteria evaluations methodology provides impact literature potential important evaluated identifying multiple
#209 0.071 results study research information studies relationship size variables previous variable examining dependent increases empirical variance accounting independent demonstrate important addition
#8 0.058 decision making decisions decision-making makers use quality improve performance managers process better results time managerial task significantly help indicate maker