Machine Learning Model Generation With Copula-Based Synthetic Dataset for Local Differentially Private Numerical Data

Yuichi Sei*, J. Andrew Onesimu, Akihiko Ohsuga

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

10 Citations (Scopus)

Abstract

With the development of IoT technology, personal data are being collected in many places. These data can be used to create new services, but consideration must be given to the individual's privacy. We can safely collect personal data while adding noise by applying differential privacy. However, because such data are very noisy, the accuracy of machine learning trained by the data greatly decreased. In this study, our objective is to build a highly accurate machine learning model using these data. We focus on the decision tree machine learning algorithm, and, instead of applying it as is, we use a preprocessing technique wherein pseudodata are generated using a copula while removing the effect of noise added by differential privacy. In detail, the proposed novel protocol consists of three steps: generating a covariance matrix from the differentially private numerical data, generating a discrete cumulative distribution function from differentially private numerical data, and generating copula-based numerical samples. Simulation results using synthetic and real datasets verify the utility of the proposed method not only for the decision tree algorithm but also for other machine learning algorithms such as deep neural networks. This method will help create machine learning models, such as recommendation systems, using differential privacy data.

Original languageEnglish
Pages (from-to)101656-101671
Number of pages16
JournalIEEE Access
Volume10
DOIs
Publication statusPublished - 2022

All Science Journal Classification (ASJC) codes

  • General Computer Science
  • General Materials Science
  • General Engineering
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Machine Learning Model Generation With Copula-Based Synthetic Dataset for Local Differentially Private Numerical Data'. Together they form a unique fingerprint.

Cite this