Author List: Li, Xiao-Bai; Sarkar, Sumit;
Information Systems Research, 2006, Volume 17, Issue 3, Page 254-270.
To respond to growing concerns about privacy of personal information, organizations that use their customers' records in data-mining activities are forced to take actions to protect the privacy of the individuals involved. A common practice for many organizations today is to remove identity-related attributes from the customer records before releasing them to data miners or analysts. We investigate the effect of this practice and demonstrate that many records in a data set could be uniquely identified even after identity-related attributes are removed. We propose a perturbation method for categorical data that can be used by organizations to prevent or limit disclosure of confidential data for identifiable records when the data are provided to analysts for classification, a common data-mining task. The proposed method attempts to preserve the statistical properties of the data based on privacy protection parameters specified by the organization. We show that the problem can be solved in two phases, with a linear programming formulation in Phase I (to preserve the first-order marginal distribution), followed by a simple Bayes-based swapping procedure in Phase II (to preserve the joint distribution). Experiments conducted on several real-world data sets demonstrate the effectiveness of the proposed method.
Keywords: Bayesian Estimation; Data Confidentiality; Data Mining; data swapping; Linear programming; Privacy
Algorithm:

List of Topics

#126 0.409 data database administration important dictionary organizations activities record increasingly method collection records considered perturbation requirements special level efforts administrators analyzed
#97 0.181 set approach algorithm optimal used develop results use simulation experiments algorithms demonstrate proposed optimization present analytical distribution selection number existing
#215 0.121 data classification statistical regression mining models neural methods using analysis techniques performance predictive networks accuracy method variables prediction problem measure
#239 0.097 privacy information concerns individuals personal disclosure protection concern consumers practices control data private calculus regulation risk individual legislation government sensitive
#137 0.072 phase study analysis business early large types phases support provided development practice effectively genres associated different sensemaking including form technologies