Author List: Li, Xiao-Bai; Sarkar, Sumit;
Information Systems Research, 2011, Volume 22, Issue 4, Page 774-789.
Record linkage techniques have been widely used in areas such as antiterrorism, crime analysis, epidemiologic research, and database marketing. On the other hand, such techniques are also being increasingly used for identity matching that leads to the disclosure of private information. These techniques can be used to effectively reidentify records even in deidentified data. Consequently, the use of such techniques can lead to individual privacy being severely eroded. Our study addresses this important issue and provides a solution to resolve the conflict between privacy protection and data utility. We propose a data-masking method for protecting private information against record linkage disclosure that preserves the statistical properties of the data for legitimate analysis. Our method recursively partitions a data set into smaller subsets such that data records within each subset are more homogeneous after each partition. The partition is made orthogonal to the maximum variance dimension represented by the first principal component in each partitioned set. The attribute values of a record in a subset are then masked using a double-bounded swapping method. The proposed method, which we call multivariate swapping trees, is nonparametric in nature and does not require any assumptions about statistical distributions of the original data. Experiments conducted on real-world data sets demonstrate that the proposed approach significantly outperforms existing methods in terms of both preventing identity disclosure and preserving data quality.
Keywords: data partitioning; data swapping; privacy; record linkage
Algorithm:

List of Topics

#215 0.242 data classification statistical regression mining models neural methods using analysis techniques performance predictive networks accuracy method variables prediction problem measure
#97 0.227 set approach algorithm optimal used develop results use simulation experiments algorithms demonstrate proposed optimization present analytical distribution selection number existing
#126 0.185 data database administration important dictionary organizations activities record increasingly method collection records considered perturbation requirements special level efforts administrators analyzed
#239 0.154 privacy information concerns individuals personal disclosure protection concern consumers practices control data private calculus regulation risk individual legislation government sensitive
#68 0.064 business units study unit executives functional managers technology linkage need areas information long-term operations plans mission large understand knowledge current