Author List: Li, Xiao-Bai; Sarkar, Sumit;
MIS Quarterly, 2014, Volume 38, Issue 3, Page 679-698.
Regression techniques can be used not only for legitimate data analysis, but also to infer private information about individuals. In this paper, we demonstrate that regression trees, a popular data-analysis and data-mining technique, can be used to effectively reveal individuals’ sensitive data. This problem, which we call a regression attack, has not been addressed in the data privacy literature, and existing privacy-preserving techniques are not appropriate in coping with this problem. We propose a new approach to counter regression attacks. To protect against privacy disclosure, our approach introduces a novel measure, called digression, which assesses the sensitive value disclosure risk in the process of building a regression tree model. Specifically, we develop an algorithm that uses the measure for pruning the tree to limit disclosure of sensitive data. We also propose a dynamic value-concatenation method for anonymizing data, which better preserves data utility than a user-defined generalization scheme commonly used in existing approaches. Our approach can be used for anonymizing both numeric and categorical data. An experimental study is conducted using real-world financial, economic, and healthcare data. The results of the experiments demonstrate that the proposed approach is very effective in protecting data privacy while preserving data quality for research and analysis.
Keywords: Privacy; data analytics; data mining; regression; regression trees; anonymization
Algorithm:

List of Topics

#215 0.367 data classification statistical regression mining models neural methods using analysis techniques performance predictive networks accuracy method variables prediction problem measure
#97 0.157 set approach algorithm optimal used develop results use simulation experiments algorithms demonstrate proposed optimization present analytical distribution selection number existing
#239 0.147 privacy information concerns individuals personal disclosure protection concern consumers practices control data private calculus regulation risk individual legislation government sensitive
#44 0.088 approach analysis application approaches new used paper methodology simulation traditional techniques systems process based using proposed method present provides various
#6 0.053 data used develop multiple approaches collection based research classes aspect single literature profiles means crowd collected trend accuracy databases accurate